Wednesday, October 27, 2010

Interesting interaction between lucene query parser and java jdwp debugging

Yesterday I solved a problem that had been vexing me for more than a week. I am helping a client load test an application using Apache Solr. (JDK 1.6.0_14, Solr version 1.4.1 which uses Lucene 2.9.3).

Under a small amount of load, searches would return quickly, but the service was quickly overwhelmed with most queries taking 3+ seconds to return, even searching in cores that had very small index sizes (Hundreds of documents).

Thread dumps and hprof stack profiling of the application were confusing. Solr seemed to be spending huge amounts of time not in search code or scoring, where I had expected, but instead in the query parsing code. Many of the threads looked something like this:


"Thread-234" daemon prio=10 tid=0x00007ff4b7436800 nid=0x4aae waiting on condition [0x000000005a811000]
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.queryParser.QueryParser.jj_2_1(QueryParser.java:1613)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1308)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
at org.apache.solr.search.QParser.getQuery(QParser.java:131)
at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:103)
(etc)


or this:


"Thread-123" daemon prio=10 tid=0x00007ff4b6d00800 nid=0x4ab6 waiting on condition [0x000000005b011000]
java.lang.Thread.State: RUNNABLE
at org.apache.lucene.queryParser.FastCharStream.readChar(FastCharStream.java:47)
at org.apache.lucene.queryParser.QueryParserTokenManager.jjMoveNfa_3(QueryParserTokenManager.java:529)
at org.apache.lucene.queryParser.QueryParserTokenManager.jjMoveStringLiteralDfa0_3(QueryParserTokenManager.java:87)
at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1190)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(QueryParser.java:1776)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1328)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
(etc)


Some of the queries are somewhat complicated, with nested parens and such, but it did not seem as if they were complicated enough to justify 2-3 second parse times. Additionally even uncomplicated queries (no parens, no nesting, just simple q=xyz) were taking tens or hundreds of milliseconds under load. The code for QueryParser.java is complicated, it is generated code using JavaCC, and uses many nested method calls. It seemed like perhaps there was some single mutex that all of these method calls were fighting over - especially because these stack traces show the thread as runnable, yet the state is "waiting on condition".

While under load, I did a backtrace using gdb (thread apply all bt 12) and saw several suspicious looking threads that looked like this:

Thread 37 (Thread 1237547376 (LWP 32044)):
#0 0x00007f1421c594e4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1 0x00007f1421c59914 in pthread_cond_wait@GLIBC_2.2.5 () from /lib/libpthread.so.0
#2 0x00007f1421388727 in os::PlatformEvent::park () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#3 0x00007f14213617c2 in Monitor::IWait () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#4 0x00007f1421361e2a in Monitor::wait () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#5 0x00007f14214bcb39 in VMThread::execute () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#6 0x00007f14213f0e4b in OptoRuntime::handle_exception_C_helper () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#7 0x00007f14213f0ef9 in OptoRuntime::handle_exception_C () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#8 0x00007f141d411ec6 in ?? ()


Google comes through again - looking for "OptoRuntime::handle_exception_C_helper" comes up with a post from AMD titled Java Performance When Debugging Is Enabled.

It turns out that by default in this environment, apps are launched with jdwp debugger attachment enabled.
(JVM flag: -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=9000)

Negative performance impact is expected when a debugger is attached, but this flag adds an overhead to certain JVM calls, specifically around exception handling. The Lucene Query Parser seems to use exceptions for signaling within a tight loop ( LookaheadSuccess ), so this may be why query parsing is affected so much.

After disabling the debugging, the average return time for solr queries dropped from ~3 seconds to < 400ms for complicated queries, and from ~3 seconds to < 30ms for the simple queries. Also, before the test, the solr process would use at most ~1.2 CPU's of a 4CPU machine. After the test, we can get the solr process to use more than 3CPU on the same 4 CPU machine, and the overall throughput of the load test was more than doubled.

Friday, March 27, 2009

Mysql query optimization for ORDER BY.. LIMIT queries

Recently I was working with a client to support a text-search type functionality on their website. Given a search category and a list of words, the app was to return the highest relevancy titles that matched those words. For purposes of demonstration, assume the ranking was simple, using only an "or" search on the list of words. (So for example, a search of "Green tea" would match return all titles that matched "Green" or "Tea", and relevancy scoring didn't need to take into account if more than one word was matched)

This was a high-traffic application with upwards of 100 queries per second.

The most painful part of this query was sorting the results in order of relevancy. The simplified table looks like this:
mysql> desc titles;
+-----------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------+-------+
| category_id | int(11) | | PRI | 0 | |
| keyword | varchar(20) | | PRI | | |
| relevancy | int(11) | | | 0 | |
| title_id | int(11) | | PRI | 0 | |
| title | varchar(255) | | | | |
+-----------------+--------------+------+-----+---------+-------+
5 rows in set (0.00 sec)

The primary key was the combination of category_id-keyword-title_id - queries always include both the category_id and the keyword, and the title_id was returned as an index into another table containing more information about the title:

mysql> explain select relevancy,title_id,title from titles where category_id=355 and keyword in ("green", "tea")
order by relevancy desc limit 20;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 24 | NULL | 5074 | Using where; Using filesort |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

In the "Extra" part of the query plan, you can see that it is "Using Filesort". This is never good - this means mysql will scan every matching row and build up a temporary store, so that it can re-order the results and return only 20. For this search term, it is creating a new table with 5074 rows. Creating and sorting a 5000 row table is not that big of a deal on its own, but multiply that by 100QPS and the database is doing a lot of extra work.

Without the order by, the query plan is easier on the database. It still matches 5074 rows but no longer needs the filesort. However, this query does not return the required results - they are returned in more or less random order.

mysql> explain select relevancy,title from titles where category_id=355 and keyword in
("green", "tea");
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 24 | NULL | 5074 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)


On the mysql documentation about ORDER BY optimization, it says that you can add the column that you are ordering by to the index, to allow the order by to happen in an index scan instead of a table scan. With that in mind, I added the "relevancy" column into the index the optimizer was choosing - in this case, the primary key.

mysql> alter table titles drop primary key, add primary key (category_id, keyword, relevancy, title_id);
Query OK, 170671 rows affected (4.72 sec)
Records: 170671 Duplicates: 0 Warnings: 0

However, this resulted in no change in the mysql query plan, other than the row estimation:

mysql> explain select relevancy,title from titles where category_id=355 and keyword in
("green", "tea") order by relevancy desc limit 20;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| 1 | SIMPLE | titles | range | PRIMARY | PRIMARY | 24 | NULL | 6253 | Using where; Using filesort |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)


I did a little digging and found a great post in the MySQL Performance Blog about this. Using the "IN" clause causes the index to be used slightly differently, and eliminates the use of the "relevancy" column for sorting optimization. The above post links to another article talking about a UNION optimization. I went one step simpler and just had the code make multiple queries to the database, one keyword at a time, then re-ordering the result in code. The explain plan for that now looks like this:

mysql> explain select relevancy,title from titles where category_id=355 and keyword='green'
order by relevancy desc limit 20;
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
| 1 | SIMPLE | titles | ref | PRIMARY | PRIMARY | 24 | const,const | 610 | Using where |
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
1 row in set (0.00 sec)


Even though the code now has to make a few trips to the database to satisfy the entire query, each individual query runs very fast. At the beginning of the exercise, we were unable to sustain a load of even 50QPS, by the time the optimizations were finished, we were running 80 threads against the database (4 core opteron 2216HE) with a combined ~900QPS using no cache.

A welcome side effect of querying on a smaller unit (category ID, plus a single keyword), we are able to make more use of a distributed cache (memcached) to store those results. While we may not get a lot of queries for "Green tea", by storing the results of "green" and "tea" separately, we get at least a partial cache hit for other queries like "green tree", "green grass", "oolong tea", etc.

Monday, June 18, 2007

Rebuilding a 3Ware Raid set in linux

This information is specific to the 3Ware 9500 Series controller. (More specifically, the 9500-4LP). However, the 3Ware CLI seems to be the same for other 3Ware 9XXX controllers which I have had experience with. (The 9550 for sure)


Under linux, the 3Ware cards can be manipulated through the "tw_cli" command. (The CLI tools can be downloaded for free from 3Ware's support website)

A healthy RAID set looks like this:

dev306:~# /opt/3Ware/bin/tw_cli
//dev306> info c0

Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
------------------------------------------------------------------------------
u0 RAID-5 OK - 256K 1117.56 ON OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 372.61 GB 781422768 3PM0Q56Z
p1 OK u0 372.61 GB 781422768 3PM0Q3YY
p2 OK u0 372.61 GB 781422768 3PM0PFT7
p3 OK u0 372.61 GB 781422768 3PM0Q3B7


A failed RAID set looks like this:

dev306:~# /opt/3Ware/bin/tw_cli
//dev306> info c0

Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
------------------------------------------------------------------------------
u0 RAID-5 DEGRADED - 256K 1117.56 ON OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 372.61 GB 781422768 3PM0Q56Z
p1 OK u0 372.61 GB 781422768 3PM0Q3YY
p2 OK u0 372.61 GB 781422768 3PM0PFT7
p3 DEGRADED u0 372.61 GB 781422768 3PM0Q3B7


Now I will remove this bad disk from the RAID set:


//dev306> maint remove c0 p3
Exporting port /c0/p3 ... Done.




I now need to physically replace the bad drive. Unfortunately since our vendor wired some of our cables cockeyed, I will usually cause some I/O on the disks at this point, to see which of the four disks is "actually" bad. (Hint: The one with no lights on is the bad one.)


dev306:~# find /opt -type f -exec cat '{}' > /dev/null \;


With the bad disk identified and replaced, now I need to go back into the 3Ware CLI and find the new disk, then tell the array to start rebuilding.


dev306:~# /opt/3Ware/bin/tw_cli
//dev306> maint rescan
Rescanning controller /c0 for units and drives ...Done.
Found the following unit(s): [none].
Found the following drive(s): [/c0/p3].


//dev306> maint rebuild c0 u0 p3
Sending rebuild start request to /c0/u0 on 1 disk(s) [3] ... Done.

//dev306> info c0

Unit UnitType Status %Cmpl Stripe Size(GB) Cache AVerify IgnECC
------------------------------------------------------------------------------
u0 RAID-5 REBUILDING 0 256K 1117.56 ON OFF OFF

Port Status Unit Size Blocks Serial
---------------------------------------------------------------
p0 OK u0 372.61 GB 781422768 3PM0Q56Z
p1 OK u0 372.61 GB 781422768 3PM0Q3YY
p2 OK u0 372.61 GB 781422768 3PM0PFT7
p3 DEGRADED u0 372.61 GB 781422768 3PM0Q3B7


Note that p3 still shows a status of "DEGRADED" but now the array itself is "REBUILDING". Under minimal IO load, a RAID-5 with 400GB disks such as this one will take about 2.5 hours to rebuild.

Supermicro H8DAR-T BIOS Settings

We run a lot of Supermicro H8DAR-T motherboards in production. These are the BIOS settings that work well for us. I have not done a lot of tweaking trying to get more performance out of our systems with BIOS settings, since stability is key.

Note that unless specified here, we leave the settings at their default values. (Some of these settings are default values but documented because we need them set that way) Especially important options in BOLD.


Advanced->ACPI Settings->Advanced ACPI Settings
ACPI 2.0 [No]
ACPI APIC Support [Enabled]
ACPI SRAT Table [Enabled]
BIOS->AML ACPI Table [Enabled]
Headless Mode [Enabled]
OS Console Redirection [Always]

Advanced->AMD PowerNow Configuration
PowerNow [Disabled]

Advanced->Remote Access
Remote Access [Enabled]
Serial Port [COM2]
Serial Port Mode [19200,8,N,1]
Flow Control [None]
Redirection After Post [Always]
Terminal Type [vt100]
UT-UTF8 Combo Keys [Enabled]
SRedir Memory Display [No Delay]

Advanced->System Health->System Fan
Fan Speed Control [1) Disable - Full Speed]

PCIPnP
Plug and Play OS [No]
PCI Latency [64]
Allocate IRQ to PCI VGA [Yes]
Pallete Snooping [Disabled]
PCI IDE BusMaster [Disabled]

Boot->Boot Device Priority
1) Floppy
2) PC-CD-244E (cdrom)
3) MBA Slot 218 (first ethernet)
4) 3Ware (or Onbard SATA)
5) MBA Slot 219 (second ethernet)

Chipset->NorthBridge->ECC Configuration
DRAM ECC [Enabled]
MCA ECC Logging [Enabled]
ECC Chipkill [Enabled]
DRAM Scrub Redirect [Enabled]
DRAM BG Scrub [163.8us]
L2 Cache BG Scrub [ 10.2us]
Data Cache BG Scrub [ 5.12us]

Chipset->NorthBridge->IOMMU Options
IOMMU Mode [Best Fit]
Aperture Size [64MB]

Supermicro H8DAR-T version detection

The Supermicro H8DAR-T motherboard comes in (at least) two flavors. The differences that I know about between the two versions are:

* The version 2.01 board will run Opensolaris/Nexenta out of the box. This is because of a difference in the SATA controller hardware. The version 1.01 board will not run Opensolaris without an add-on controller card.

* The 1.01 and 2.01 boards use different hardware sensors (For temperature, fan speed, etc). We get sensor stats through our IPMI cards; because of this the IPMI cards need to be flashed to the specific version of the hardware. The IPMI cards do work for poweron/poweroff and console redirection without this specific firmware, only the sensors do not work if the IPMI firmware mis-matches the motherboard version.

Unfortunately, I do not see enough of a difference at POST time to be able to tell them apart. However, there are two ways I know of to do the detection.

1. With the cover of the machine off, the version can be seen in the back left corner of the board. (Will post pics later)

2. Under linux, use the "dmidecode" command. The system board uses "Handle 0x0002". What works well for me is "dmidecode |grep -A3 'Base Board' ". v1.01 boards report their Version as "1234567890" (way to go Supermicro!). v2.01 boards report as being Version "2.0". Examples:

v1board:~# dmidecode |grep -A3 "Base Board"
Base Board Information
Manufacturer: Supermicro
Product Name: H8DAR-T
Version: 1234567890


v2board:~# dmidecode |grep -A3 "Base Board"
Base Board Information
Manufacturer: Supermicro
Product Name: H8DAR-T
Version: 2.0