Under a small amount of load, searches would return quickly, but the service was quickly overwhelmed with most queries taking 3+ seconds to return, even searching in cores that had very small index sizes (Hundreds of documents).
Thread dumps and hprof stack profiling of the application were confusing. Solr seemed to be spending huge amounts of time not in search code or scoring, where I had expected, but instead in the query parsing code. Many of the threads looked something like this:
"Thread-234" daemon prio=10 tid=0x00007ff4b7436800 nid=0x4aae waiting on condition [0x000000005a811000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.queryParser.QueryParser.jj_2_1(QueryParser.java:1613)
        at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1308)
        at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)
        at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)
        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
        at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
        at org.apache.solr.search.QParser.getQuery(QParser.java:131)
        at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:103)
        (etc)
or this:
"Thread-123" daemon prio=10 tid=0x00007ff4b6d00800 nid=0x4ab6 waiting on condition [0x000000005b011000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.lucene.queryParser.FastCharStream.readChar(FastCharStream.java:47)
        at org.apache.lucene.queryParser.QueryParserTokenManager.jjMoveNfa_3(QueryParserTokenManager.java:529)
        at org.apache.lucene.queryParser.QueryParserTokenManager.jjMoveStringLiteralDfa0_3(QueryParserTokenManager.java:87)
        at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1190)
        at org.apache.lucene.queryParser.QueryParser.jj_ntk(QueryParser.java:1776)
        at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1328)
        at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)
        at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)
        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
        at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
        (etc)
Some of the queries are somewhat complicated, with nested parens and such, but it did not seem as if they were complicated enough to justify 2-3 second parse times. Additionally even uncomplicated queries (no parens, no nesting, just simple q=xyz) were taking tens or hundreds of milliseconds under load. The code for QueryParser.java is complicated, it is generated code using JavaCC, and uses many nested method calls. It seemed like perhaps there was some single mutex that all of these method calls were fighting over - especially because these stack traces show the thread as runnable, yet the state is "waiting on condition".
While under load, I did a backtrace using gdb (thread apply all bt 12) and saw several suspicious looking threads that looked like this:
Thread 37 (Thread 1237547376 (LWP 32044)):
#0  0x00007f1421c594e4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
#1  0x00007f1421c59914 in pthread_cond_wait@GLIBC_2.2.5 () from /lib/libpthread.so.0
#2  0x00007f1421388727 in os::PlatformEvent::park () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#3  0x00007f14213617c2 in Monitor::IWait () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#4  0x00007f1421361e2a in Monitor::wait () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#5  0x00007f14214bcb39 in VMThread::execute () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#6  0x00007f14213f0e4b in OptoRuntime::handle_exception_C_helper () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#7  0x00007f14213f0ef9 in OptoRuntime::handle_exception_C () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so
#8  0x00007f141d411ec6 in ?? ()
Google comes through again - looking for "OptoRuntime::handle_exception_C_helper" comes up with a post from AMD titled Java Performance When Debugging Is Enabled.
It turns out that by default in this environment, apps are launched with jdwp debugger attachment enabled.
(JVM flag: -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=9000)
Negative performance impact is expected when a debugger is attached, but this flag adds an overhead to certain JVM calls, specifically around exception handling. The Lucene Query Parser seems to use exceptions for signaling within a tight loop ( LookaheadSuccess ), so this may be why query parsing is affected so much.
After disabling the debugging, the average return time for solr queries dropped from ~3 seconds to < 400ms for complicated queries, and from ~3 seconds to < 30ms for the simple queries. Also, before the test, the solr process would use at most ~1.2 CPU's of a 4CPU machine. After the test, we can get the solr process to use more than 3CPU on the same 4 CPU machine, and the overall throughput of the load test was more than doubled.
