[pLog-svn] db optimization

Thu Apr 13 18:23:24 GMT 2006

Yes, the data I worked with probably isn't very realistic for an
installation with several hundred blogs and thousands of articles and
comments. That's why I asked to enhance the generateData script to
generate a more realistic test scenario.
What I've in mind is a installation with about 10.000 users, 15.000
blogs, 300.000 articles, 500.000 comments (some of them spam),
the bayesian tables need to be filled accordingly etc, standard plugins
installed and used etc.

That would be step one. Step two would be to simulate traffic to the
installation. There are testing tools that can use apache log entries to
simulate users. That would allow us to get some realistic data to
identify the current bottlenecks.

So my current suggestions are just some general hints. A detailed
analysis will require a good test environment, as described above.

Jon Daley wrote:
>     That looks like an excellent analysis, and some good tips - I didn't
> know any of that.
>     One thing to keep in mind is that Christoph only looked at Oscar's
> log, so there are some features that presumably aren't used, and might
> affect other people's blogs negatively, say, (I'm guessing) if they
> don't use custom URLs, or other things like that?
> 
> 
> On Thu, 13 Apr 2006, Christoph Feddersen wrote:
> 
>> Hi there,
>>
>> I examined the provided logfiles and came up with some changes to the
>> db-indices. They just improve the performance of several queries without
>> changing the db-schema or code.
>> I dropped several indices, because there's no need to two indices for a
>> single column. It just adds an overhead when inserting/updating/deleting
>> records.
>>
>>
>> ALTER TABLE `plog_articles_comments` ADD INDEX ( `blog_id` , `type` );
>> ALTER TABLE `plog_articles_comments` DROP INDEX `blog_id`;
>> ALTER TABLE `plog_articles_comments` DROP INDEX `article_id`;
>> ALTER TABLE `plog_articles_comments` ADD INDEX ( `article_id` , `type`);
>> ALTER TABLE `plog_articles` DROP INDEX `blog_id`;
>> ALTER TABLE `plog_gallery_albums` DROP INDEX `owner_id`;
>> ALTER TABLE `plog_gallery_resources` DROP INDEX `album_id`;
>> ALTER TABLE `plog_referers` DROP INDEX `blog_id`;
>> ALTER TABLE `plog_users_permissions` DROP INDEX `user_id`;
>> ALTER TABLE `plog_articles` DROP INDEX `blog_id_slug`;
>> ALTER TABLE `plog_articles` ADD INDEX ( `blog_id` , `status` , `date` );
>>
>> Some general advices
>> ====================
>>
>> Don't use COUNT(*), use COUNT(Primary key field) instead. There's no
>> need to load complete record sets when you only want to know the number
>> of rows.
>>
>>
>> If possible, the conditions of a queries should be in the same order and
>> have an index in exactly that order. There are several queries that use
>> the same conditions, but they appear in a different order. Example:
>>
>> SELECT date  FROM plog_articles  WHERE status = 1  AND blog_id = 1 AND
>> date >= 20060300000000 AND date <= NOW();
>>
>> SELECT * FROM plog_articles WHERE blog_id = 1 AND date >
>> '20060311172502' AND status = 1 ORDER BY DATE ASC LIMIT 0,1;
>>
>> You see that I added a multi column index with:
>> ADD INDEX ( `blog_id` , `status` , `date` );
>>
>> So the conditions should be order that way:
>> SELECT date  FROM plog_articles  WHERE  blog_id = 1 AND status = 1 AND
>> date >= 20060300000000 AND date <= NOW();
>>
>> SELECT * FROM plog_articles WHERE blog_id = 1  AND status = 1 AND date >
>> '20060311172502' ORDER BY DATE ASC LIMIT 0,1;
>>
>>
>> Avoid like% statements, they are very slow and can't make use of indices:
>> SELECT a.id as id, a.id, a.date,
>> a.user_id,a.blog_id,a.status,a.properties, a.num_reads, a.slug, 1 AS
>> relevance, a.num_comments AS num_comments, a.num_nonspam_comments AS
>> num_nonspam_comments, a.num_trackbacks AS num_trackbacks,
>> a.num_nonspam_trackbacks AS num_nonspam_trackbacks,
>> a.global_category_id AS global_category_id, a.in_summary_page AS
>> in_summary_page
>> FROM plog_articles a, plog_articles_categories c,
>> plog_article_categories_link l
>> WHERE  a.blog_id = 1
>> AND FROM_UNIXTIME(UNIX_TIMESTAMP(a.date)+0)+0 LIKE '200601%'
>> AND c.id = l.category_id AND a.id = l.article_id  AND a.status = '1' AND
>> a.date <= '20060402144438'
>> GROUP BY a.id
>> ORDER BY a.date DESC LIMIT 0, 5;
>>
>> Use fulltext index with match against instead of like
>> SELECT article_id FROM plog_articles_text WHERE (((normalized_topic LIKE
>> 'S') OR (normalized_text LIKE 'S')) OR ((normalized_topic LIKE 'S') OR
>> (normalized_text LIKE 'S')));
>>
>>
>> I'll take a closer look at queries in the code in the next week. I'll
>> need to fill me database with some more data to get better test results.
>> That's why I ported the generateData.php to 1.1. Maybe someone can
>> enhance it even more so that all database tables will be filled with
>> realistic values.
>>
>> The overall db-structure is looking good. Regarding the MyISAM vs.
>> InnoDB question, there's an article that compares these storage engines:
>> http://dev.mysql.com/tech-resources/articles/storage-engine/part_3.html
>>
>> I'd stick with MyISAM in the standard distribution. As most people will
>> run small installations, it's the simplest and fastest way and you can
>> use the FULLTEXT search. Installations with a large database may benefit
>> from using InnoDB but I'd leave that up to the admin. Maybe add a notice
>> to the documentation that this can be done (beware: InnoDB doesn't
>> support fulltext).
>>
>> Christoph
>>
>> _______________________________________________
>> pLog-svn mailing list
>> pLog-svn at devel.lifetype.net
>> http://devel.lifetype.net/mailman/listinfo/plog-svn
>>
> 
> **************************************
> Jon Daley
> http://jon.limedaley.com/
> 
> Our software is not 'released'.
> It *escapes*, leaving a trail of mangled QA engineers in its path.
> _______________________________________________
> pLog-svn mailing list
> pLog-svn at devel.lifetype.net
> http://devel.lifetype.net/mailman/listinfo/plog-svn