Friday, March 27, 2009

Mysql query optimization for ORDER BY.. LIMIT queries

Recently I was working with a client to support a text-search type functionality on their website. Given a search category and a list of words, the app was to return the highest relevancy titles that matched those words. For purposes of demonstration, assume the ranking was simple, using only an "or" search on the list of words. (So for example, a search of "Green tea" would match return all titles that matched "Green" or "Tea", and relevancy scoring didn't need to take into account if more than one word was matched)

This was a high-traffic application with upwards of 100 queries per second.

The most painful part of this query was sorting the results in order of relevancy. The simplified table looks like this:
mysql> desc titles;
+-----------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------+-------+
| category_id | int(11) | | PRI | 0 | |
| keyword | varchar(20) | | PRI | | |
| relevancy | int(11) | | | 0 | |
| title_id | int(11) | | PRI | 0 | |
| title | varchar(255) | | | | |
+-----------------+--------------+------+-----+---------+-------+
5 rows in set (0.00 sec)

The primary key was the combination of category_id-keyword-title_id - queries always include both the category_id and the keyword, and the title_id was returned as an index into another table containing more information about the title:

mysql> explain select relevancy,title_id,title from titles where category_id=355 and keyword in ("green", "tea")
order by relevancy desc limit 20;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 24 | NULL | 5074 | Using where; Using filesort |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

In the "Extra" part of the query plan, you can see that it is "Using Filesort". This is never good - this means mysql will scan every matching row and build up a temporary store, so that it can re-order the results and return only 20. For this search term, it is creating a new table with 5074 rows. Creating and sorting a 5000 row table is not that big of a deal on its own, but multiply that by 100QPS and the database is doing a lot of extra work.

Without the order by, the query plan is easier on the database. It still matches 5074 rows but no longer needs the filesort. However, this query does not return the required results - they are returned in more or less random order.

mysql> explain select relevancy,title from titles where category_id=355 and keyword in
("green", "tea");
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 24 | NULL | 5074 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)


On the mysql documentation about ORDER BY optimization, it says that you can add the column that you are ordering by to the index, to allow the order by to happen in an index scan instead of a table scan. With that in mind, I added the "relevancy" column into the index the optimizer was choosing - in this case, the primary key.

mysql> alter table titles drop primary key, add primary key (category_id, keyword, relevancy, title_id);
Query OK, 170671 rows affected (4.72 sec)
Records: 170671 Duplicates: 0 Warnings: 0

However, this resulted in no change in the mysql query plan, other than the row estimation:

mysql> explain select relevancy,title from titles where category_id=355 and keyword in
("green", "tea") order by relevancy desc limit 20;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 24 | NULL | 6253 | Using where; Using filesort |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)


I did a little digging and found a great post in the MySQL Performance Blog about this. Using the "IN" clause causes the index to be used slightly differently, and eliminates the use of the "relevancy" column for sorting optimization. The above post links to another article talking about a UNION optimization. I went one step simpler and just had the code make multiple queries to the database, one keyword at a time, then re-ordering the result in code. The explain plan for that now looks like this:

mysql> explain select relevancy,title from titles where category_id=355 and keyword='green'
order by relevancy desc limit 20;
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
| 1 | SIMPLE | titles | ref | PRIMARY | PRIMARY | 24 | const,const | 610 | Using where |
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
1 row in set (0.00 sec)


Even though the code now has to make a few trips to the database to satisfy the entire query, each individual query runs very fast. At the beginning of the exercise, we were unable to sustain a load of even 50QPS, by the time the optimizations were finished, we were running 80 threads against the database (4 core opteron 2216HE) with a combined ~900QPS using no cache.

A welcome side effect of querying on a smaller unit (category ID, plus a single keyword), we are able to make more use of a distributed cache (memcached) to store those results. While we may not get a lot of queries for "Green tea", by storing the results of "green" and "tea" separately, we get at least a partial cache hit for other queries like "green tree", "green grass", "oolong tea", etc.

No comments: