[pLog-svn] Indepth pLog SQL Analysis

Matt matt at woodzy.com
Thu Jun 30 22:04:48 GMT 2005


Dear fellow developers,

I could have thrown this into the forums... and still can because
someone might find this interesting... but I'd rather compose this email
in a mail client, plus this is also alot of info and I'd rather spam
your mail boxes  :)

When I'm developing for pLog I have an unfortunate curse, to be sitting
near the rack where the development web-server is located. This is good
of course if I destroy something foolish.... but mainly irritating.
Irritating because every time I refresh any of the pLog pages i hear the
RAID array scratch around. Now I, being a stickler for efficiency, get
really irritated at hearing the HD's scratch everytime I refresh the
main page with a server which has 2gigs of ram and running nothing other
than apache/mysqld.

This is my quest... find the evil evil culprits in which cause this
scratchy sound because it should NOT, let me repeat that, NOT be
touching the disk on every page query.

(For all you thinking, "oh you need to enable caching", lets just ignore
that for now and pretend caching doesn't exist... pLog doesn't do
anything THAT complicated)

^^ pLog on a ramFS ^^

My first thought was that pLog loads tons of files... maybe the OS is
not keeping everything in the page table...

^^^^^makeramfs.sh^^^^^
#!/bin/bash
curdir=`pwd`
filelist="admin.php album archives blog blog.php category comment class
debug.php error.php config get index.php install locale plugins post
resource resserver.php rss rss.php static styles summary.php templates
trackback.php trackbacks user version.php xmlrpc.php"

#unmount it if it already exists
umount $curdir/plog-ramfs
mkdir $curdir/plog-ramfs -p
mount -t ramfs ramfs $curdir/plog-ramfs -o maxsize=30000

for file in $filelist; do
cp -R -p $curdir/$file $curdir/plog-ramfs;
done;
^^^^^^^^^^^^^^^^^^^
(notes on script: you'll have to run this in your current plog directory
or set curdir to the correct directory)

^^^ changes to index.php somewhere at the top ^^^
    if (!defined( "PLOG_CLASS_PATH" )) {
        if (is_dir(dirname(__FILE__)."/plog-ramfs"))
            define("PLOG_CLASS_PATH",dirname(__FILE__)."/plog-ramfs/");
        else
            define( "PLOG_CLASS_PATH", dirname(__FILE__)."/");
    }
^^^^^^^^^^^^

Now at first glance this seems like a awesome idea... at least I thought
so... but the performance gains are rather small because the OS
generally does a good job of caching these files. Parsing them is
what takes a long time...

(I came up with these numbers by individually accessing a whole slew of
pLog pages 4 times each non-consecutively with wget and grep'ed out the
Execution time which I added to index.php using xdebug...)

Average Execution time without ramfs:
1.67863318267151515151 seconds

Average Execution time on ramfs:
1.63227174033659090909 seconds

As you can see the difference is pretty small... the difference, pretty
much, completely resides in the fact that the ramfs filesystem is much
more lightweight than ext3 because its made to run in ram so the
syscalls (fopen and the like) are faster...

^^ summary pLog in ram ^^
So clearly the HD scratches are not because of the actual php files and
this really isn't a recommendable optimization for windows for ex...
next program to target... mysql...

^^ mysql the evil tmp table creator ^^
First thing I noticed when poking around mysql statistics was...

^^^^^
mysql> show status like '%tmp%';
+-------------------------+--------+
| Variable_name           | Value  |
+-------------------------+--------+
| Created_tmp_disk_tables | 34243  |
| Created_tmp_tables      | 107642 |
| Created_tmp_files       | 0      |
+-------------------------+--------+
3 rows in set (0.00 sec)
^^^^^

the tmp_disk_tables was exceptionally high compared to the actual number
of temporary tables created. (2:1 pretty much...)

So maybe mysql just wasn't keeping enough memory available for pLog to
run around happily... I checked the default variables (show
variables)... and they seemed reasonable... but I guess they could be low...

This inspired the commands:

^^^
mysql> set global tmp_table_size=33554432, max_tmp_tables=1024,
join_buffer_size=134217728, max_join_size=134217728;

mysql> set global query_cache_limit=2097152, query_cache_size=33554432,
query_cache_type=1, table_cache=2048;

mysql> set global sort_buffer_size=33554432,max_sort_length=4096;

mysql> flush status;
^^^

And after running through my timer/wget script again...

^^^
mysql> show status like '%tmp%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| Created_tmp_disk_tables | 351   |
| Created_tmp_tables      | 651   |
| Created_tmp_files       | 0     |
+-------------------------+-------+
3 rows in set (0.00 sec)
^^^

Unfortunatly It still looks like we have a 2:1ish ratio... Now my entire
database should have fit in those buffers I allocated... so there is no
way the pLog joined tables could not fit a few times over into the
caches/buffers I set so there must be another reason why we are going to
disk...

^^^ explain mysql command and query logging ^^^
So some stuff I noticed about the queries... the queries that go to disk
are generally because of a filesort that is required because there are
not enough keys in the table...

I'll attach the results I created but in order to reproduce the attached
results you'll need to...
- enable query logging on the mysql server...
- then modify some Execute function somewhere (I modified _Execute in
adodb.inc b/c I was also timing the sql queries) to strip \n's from all
the queries... (thats bad form btw)... with str_replace("\n") and so on...

^^^ mysql_log_analyzer.sh ^^^
#!/bin/bash
username="root"
pass=""
database="plog"

while read line
do
echo "$line" >> $1
echo "$line" | mysql -u$username -p$pass -e "EXPLAIN $line;" $database >> $1
echo "" >> $1
done
^^^^^^
usage:
cat mysql_queries.log.save | sed 's/^[ \t0-9:]\+[ \t][0-9]\+ [^
]\+//g;s/ [ \t]\+/ /g;' | grep "SELECT" -i | sh mysql_log_analyzer.sh
results

^^^^^^

(make sure to copy the log file or the EXPLAINs will start to show up in
the log and you'll be running this all day... also forgive the evil
regex :)... if you want I can explain that)

When looking at the results the string you don't ever want to see is:
"Using temporary; Using filesort"

That means it is an unoptimizable query and will do lots of work every
single time this query is executed. Filesorting on multiple large tables
generally requires actually writing to disk as in the following query...

No matter what variables I chose for mysqld this query ALWAYS went to
disk... I think its because of my rather large articles table, which I
have reduced from 6k to 1k for these tests... mainly because I thought
the size of the table was the problem...

^^^^
SELECT a.id as article_id, c.id AS id, c.name AS name, c.url AS url,
c.blog_id AS blog_id, c.last_modification AS last_modification,
c.in_main_page AS in_main_page, c.parent_id AS parent_id, c.properties
AS properties, c.description AS description, c.mangled_name AS
mangled_name, IF(a.id IS NULL, 0, COUNT(*)) AS num_articles, a.date AS
last_update FROM main_articles_categories c LEFT JOIN
main_article_categories_link l ON c.id=l.category_id LEFT JOIN
main_articles a ON a.id = l.article_id WHERE a.id IN
("901","897","896","893","869","868","852","847","831","828") AND
a.blog_id = '1' AND c.blog_id = '1' GROUP BY c.id, a.id

table type  possible_keys key key_len ref rows  Extra
c ref blog_id blog_id 4 const 2 Using where; Using temporary; Using filesort
l index NULL  PRIMARY 8 NULL  906 Using index
a eq_ref  PRIMARY,blog_id,blog_id_slug,blog_id_slug_category_id PRIMARY
4 l.article_id  1 Using where
^^^^

This has effectively 1812 rows to sort... and table size is something
like (key_lengths)4*8*4*(906*2)*(row_size)*3tables which mysql swaps out
and thus causing alot of disk seeks.

It is much better to get this count in two queries using more keys and
MUCH less joining going on...

^^^ suggested query(s) ^^^
mysql> select * from main_articles_categories where blog_id=1;

mysql> select count(a.id) from main_articles
a,main_article_categories_link l where l.category_id in
(list_from_above_query) and a.id=l.article_id group by l.category_id;

^^^
^^^ and the explain to go with it ^^^
mysql> explain select * from main_articles_categories where blog_id=1;
table type possible_keys  key key_len  ref  rows Extra
main_articles_categories  ref  blog_id blog_id 4 const 2 Using where

mysql> explain select count(a.id) from main_articles
a,main_article_categories_link l where l.category_id in (1,2,3) and
a.id=l.article_id group by l.category_id;
table   type    possible_keys   key     key_len ref     rows    Extra
l       index   PRIMARY PRIMARY 8       NULL    906     Using where;
Using index; Using temporary; Using filesort
a       eq_ref  PRIMARY PRIMARY 4       l.article_id    1       Using
where; Using index
^^^

This still uses filesort... we'd have to use mysql specific stuff to
optimize beyond this... but the mysql optimizer can handle this query
much better... in the original query the optimizer did not recognize it
could use the PRIMARY key in the article table to significantly reduce
the query size and thus even though temporary is used it will be able to
handle an in memory query because its only one join (two tables) and the
size and rows to search is smaller.

Personally this is a monstrously evil thing... I think it would but well
advised to create another field in *articles_categories called length.

Because of this disk access (mainly) the average generation time wasted
on mysql is ~0.454832392 seconds. The above query can take up to 250ms
with a large articles database... (with 6k it takes half a second)

^^^ conclusion ^^^

So after that mouthful of stuff... clearly the php can stand to be sped
up some but there is also some room for improvement in pLog's sql
statements... as the attached results file shows... we are doing full
filesorts on a LOT of tables and having to create temporary tables to do
so. I also tried putting mysql temporary files on a ramfs... to little
avail other than crashing the query :)... I don't have alot lot more to
say heh... maybe someone else can add
something...

Tomorrow I'm playing with PHP Optimizers... you'll probably hear from me
again...

For now I'm out,
-Matt



More information about the pLog-svn mailing list