Ops Monkey

Interesting interaction between lucene query parser and java jdwp debugging

2010-10-27T15:57:00.000-07:00

Yesterday I solved a problem that had been vexing me for more than a week. I am helping a client load test an application using Apache Solr. (JDK 1.6.0_14, Solr version 1.4.1 which uses Lucene 2.9.3).

Under a small amount of load, searches would return quickly, but the service was quickly overwhelmed with most queries taking 3+ seconds to return, even searching in cores that had very small index sizes (Hundreds of documents).

Thread dumps and hprof stack profiling of the application were confusing. Solr seemed to be spending huge amounts of time not in search code or scoring, where I had expected, but instead in the query parsing code. Many of the threads looked something like this:



"Thread-234" daemon prio=10 tid=0x00007ff4b7436800 nid=0x4aae waiting on condition [0x000000005a811000]

   java.lang.Thread.State: RUNNABLE

        at org.apache.lucene.queryParser.QueryParser.jj_2_1(QueryParser.java:1613)

        at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1308)

        at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)

        at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)

        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)

        at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)

        at org.apache.solr.search.QParser.getQuery(QParser.java:131)

        at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:103)

        (etc)

or this:



"Thread-123" daemon prio=10 tid=0x00007ff4b6d00800 nid=0x4ab6 waiting on condition [0x000000005b011000]

   java.lang.Thread.State: RUNNABLE

        at org.apache.lucene.queryParser.FastCharStream.readChar(FastCharStream.java:47)

        at org.apache.lucene.queryParser.QueryParserTokenManager.jjMoveNfa_3(QueryParserTokenManager.java:529)

        at org.apache.lucene.queryParser.QueryParserTokenManager.jjMoveStringLiteralDfa0_3(QueryParserTokenManager.java:87)

        at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1190)

        at org.apache.lucene.queryParser.QueryParser.jj_ntk(QueryParser.java:1776)

        at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1328)

        at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1265)

        at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1254)

        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)

        at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)

        (etc)

Some of the queries are somewhat complicated, with nested parens and such, but it did not seem as if they were complicated enough to justify 2-3 second parse times. Additionally even uncomplicated queries (no parens, no nesting, just simple q=xyz) were taking tens or hundreds of milliseconds under load. The code for QueryParser.java is complicated, it is generated code using JavaCC, and uses many nested method calls. It seemed like perhaps there was some single mutex that all of these method calls were fighting over - especially because these stack traces show the thread as runnable, yet the state is "waiting on condition".

While under load, I did a backtrace using gdb (thread apply all bt 12) and saw several suspicious looking threads that looked like this:



Thread 37 (Thread 1237547376 (LWP 32044)):

#0  0x00007f1421c594e4 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread.so.0

#1  0x00007f1421c59914 in pthread_cond_wait@GLIBC_2.2.5 () from /lib/libpthread.so.0

#2  0x00007f1421388727 in os::PlatformEvent::park () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so

#3  0x00007f14213617c2 in Monitor::IWait () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so

#4  0x00007f1421361e2a in Monitor::wait () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so

#5  0x00007f14214bcb39 in VMThread::execute () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so

#6  0x00007f14213f0e4b in OptoRuntime::handle_exception_C_helper () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so

#7  0x00007f14213f0ef9 in OptoRuntime::handle_exception_C () from /opt/jdk1.6.0_14/jre/lib/amd64/server/libjvm.so

#8  0x00007f141d411ec6 in ?? ()

Google comes through again - looking for "OptoRuntime::handle_exception_C_helper" comes up with a post from AMD titled Java Performance When Debugging Is Enabled.

It turns out that by default in this environment, apps are launched with jdwp debugger attachment enabled.
(JVM flag: -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=9000)

Negative performance impact is expected when a debugger is attached, but this flag adds an overhead to certain JVM calls, specifically around exception handling. The Lucene Query Parser seems to use exceptions for signaling within a tight loop ( LookaheadSuccess ), so this may be why query parsing is affected so much.

After disabling the debugging, the average return time for solr queries dropped from ~3 seconds to < 400ms for complicated queries, and from ~3 seconds to < 30ms for the simple queries. Also, before the test, the solr process would use at most ~1.2 CPU's of a 4CPU machine. After the test, we can get the solr process to use more than 3CPU on the same 4 CPU machine, and the overall throughput of the load test was more than doubled.

Mysql query optimization for ORDER BY.. LIMIT queries

2009-03-27T22:51:00.000-07:00

Recently I was working with a client to support a text-search type functionality on their website. Given a search category and a list of words, the app was to return the highest relevancy titles that matched those words. For purposes of demonstration, assume the ranking was simple, using only an "or" search on the list of words. (So for example, a search of "Green tea" would match return all titles that matched "Green" or "Tea", and relevancy scoring didn't need to take into account if more than one word was matched)

This was a high-traffic application with upwards of 100 queries per second.

The most painful part of this query was sorting the results in order of relevancy. The simplified table looks like this:

mysql> desc titles;
+-----------------+--------------+------+-----+---------+-------+
| Field           | Type         | Null | Key | Default | Extra |
+-----------------+--------------+------+-----+---------+-------+
| category_id     | int(11)      |      | PRI | 0       |       |
| keyword         | varchar(20)  |      | PRI |         |       |
| relevancy       | int(11)      |      |     | 0       |       |
| title_id        | int(11)      |      | PRI | 0       |       |
| title           | varchar(255) |      |     |         |       |
+-----------------+--------------+------+-----+---------+-------+
5 rows in set (0.00 sec)

The primary key was the combination of category_id-keyword-title_id - queries always include both the category_id and the keyword, and the title_id was returned as an index into another table containing more information about the title:


mysql> explain select relevancy,title_id,title from titles where category_id=355 and keyword in ("green", "tea")
order by relevancy  desc limit 20;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows | Extra                       |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | titles | range | PRIMARY       | PRIMARY |      24 | NULL | 5074 | Using where; Using filesort |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

In the "Extra" part of the query plan, you can see that it is "Using Filesort". This is never good - this means mysql will scan every matching row and build up a temporary store, so that it can re-order the results and return only 20. For this search term, it is creating a new table with 5074 rows. Creating and sorting a 5000 row table is not that big of a deal on its own, but multiply that by 100QPS and the database is doing a lot of extra work.

Without the order by, the query plan is easier on the database. It still matches 5074 rows but no longer needs the filesort. However, this query does not return the required results - they are returned in more or less random order.


mysql> explain select relevancy,title from titles where category_id=355 and keyword in
("green", "tea");
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | titles | range | PRIMARY       | PRIMARY |      24 | NULL | 5074 | Using where |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)

On the mysql documentation about ORDER BY optimization, it says that you can add the column that you are ordering by to the index, to allow the order by to happen in an index scan instead of a table scan. With that in mind, I added the "relevancy" column into the index the optimizer was choosing - in this case, the primary key.


mysql> alter table titles drop primary key, add primary key (category_id, keyword, relevancy, title_id);
Query OK, 170671 rows affected (4.72 sec)
Records: 170671  Duplicates: 0  Warnings: 0

However, this resulted in no change in the mysql query plan, other than the row estimation:


mysql> explain select relevancy,title from titles where category_id=355 and keyword in
("green", "tea") order by relevancy  desc limit 20;
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
| id | select_type | table  | type  | possible_keys | key     | key_len | ref  | rows | Extra                       |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | titles | range | PRIMARY       | PRIMARY |      24 | NULL | 6253 | Using where; Using filesort |
+----+-------------+--------+-------+---------------+---------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

I did a little digging and found a great post in the MySQL Performance Blog about this. Using the "IN" clause causes the index to be used slightly differently, and eliminates the use of the "relevancy" column for sorting optimization. The above post links to another article talking about a UNION optimization. I went one step simpler and just had the code make multiple queries to the database, one keyword at a time, then re-ordering the result in code. The explain plan for that now looks like this:


mysql> explain select relevancy,title from titles where category_id=355 and keyword='green'
order by relevancy  desc limit 20;
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
| id | select_type | table  | type | possible_keys | key     | key_len | ref         | rows | Extra       |
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
|  1 | SIMPLE      | titles | ref  | PRIMARY       | PRIMARY |      24 | const,const |  610 | Using where |
+----+-------------+--------+------+---------------+---------+---------+-------------+------+-------------+
1 row in set (0.00 sec)

Even though the code now has to make a few trips to the database to satisfy the entire query, each individual query runs very fast. At the beginning of the exercise, we were unable to sustain a load of even 50QPS, by the time the optimizations were finished, we were running 80 threads against the database (4 core opteron 2216HE) with a combined ~900QPS using no cache.

A welcome side effect of querying on a smaller unit (category ID, plus a single keyword), we are able to make more use of a distributed cache (memcached) to store those results. While we may not get a lot of queries for "Green tea", by storing the results of "green" and "tea" separately, we get at least a partial cache hit for other queries like "green tree", "green grass", "oolong tea", etc.

Rebuilding a 3Ware Raid set in linux

2007-06-18T22:55:00.000-07:00

This information is specific to the 3Ware 9500 Series controller. (More specifically, the 9500-4LP). However, the 3Ware CLI seems to be the same for other 3Ware 9XXX controllers which I have had experience with. (The 9550 for sure)

Under linux, the 3Ware cards can be manipulated through the "tw_cli" command. (The CLI tools can be downloaded for free from 3Ware's support website)

A healthy RAID set looks like this:


dev306:~# /opt/3Ware/bin/tw_cli
//dev306> info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    OK             -      256K    1117.56   ON     OFF      OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     372.61 GB   781422768     3PM0Q56Z    
p1     OK               u0     372.61 GB   781422768     3PM0Q3YY    
p2     OK               u0     372.61 GB   781422768     3PM0PFT7    
p3     OK               u0     372.61 GB   781422768     3PM0Q3B7

A failed RAID set looks like this:


dev306:~# /opt/3Ware/bin/tw_cli
//dev306> info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -      256K    1117.56   ON     OFF      OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     372.61 GB   781422768     3PM0Q56Z    
p1     OK               u0     372.61 GB   781422768     3PM0Q3YY    
p2     OK               u0     372.61 GB   781422768     3PM0PFT7    
p3     DEGRADED         u0     372.61 GB   781422768     3PM0Q3B7

Now I will remove this bad disk from the RAID set:


//dev306> maint remove c0 p3
Exporting port /c0/p3 ... Done.

I now need to physically replace the bad drive. Unfortunately since our vendor wired some of our cables cockeyed, I will usually cause some I/O on the disks at this point, to see which of the four disks is "actually" bad. (Hint: The one with no lights on is the bad one.)


dev306:~# find /opt -type f -exec cat '{}' > /dev/null \;

With the bad disk identified and replaced, now I need to go back into the 3Ware CLI and find the new disk, then tell the array to start rebuilding.


dev306:~# /opt/3Ware/bin/tw_cli
//dev306> maint rescan
Rescanning controller /c0 for units and drives ...Done.
Found the following unit(s): [none].
Found the following drive(s): [/c0/p3].


//dev306> maint rebuild c0 u0 p3
Sending rebuild start request to /c0/u0 on 1 disk(s) [3] ... Done.

//dev306> info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    REBUILDING     0      256K    1117.56   ON     OFF      OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     372.61 GB   781422768     3PM0Q56Z    
p1     OK               u0     372.61 GB   781422768     3PM0Q3YY    
p2     OK               u0     372.61 GB   781422768     3PM0PFT7    
p3     DEGRADED         u0     372.61 GB   781422768     3PM0Q3B7

Note that p3 still shows a status of "DEGRADED" but now the array itself is "REBUILDING". Under minimal IO load, a RAID-5 with 400GB disks such as this one will take about 2.5 hours to rebuild.

Supermicro H8DAR-T BIOS Settings

2007-06-18T18:55:00.000-07:00

We run a lot of Supermicro H8DAR-T motherboards in production. These are the BIOS settings that work well for us. I have not done a lot of tweaking trying to get more performance out of our systems with BIOS settings, since stability is key.

Note that unless specified here, we leave the settings at their default values. (Some of these settings are default values but documented because we need them set that way) Especially important options in BOLD.


Advanced->ACPI Settings->Advanced ACPI Settings
  ACPI 2.0                                     [No]
  ACPI APIC Support                            [Enabled]
  ACPI SRAT Table                              [Enabled]
  BIOS->AML ACPI Table                         [Enabled]
  Headless Mode                                [Enabled]
  OS Console Redirection                       [Always]

Advanced->AMD PowerNow Configuration
  PowerNow                                     [Disabled]

Advanced->Remote Access
  Remote Access                                [Enabled]
  Serial Port                                  [COM2]
  Serial Port Mode                             [19200,8,N,1]
  Flow Control                                 [None]
  Redirection After Post                       [Always]
  Terminal Type                                [vt100]
  UT-UTF8 Combo Keys                           [Enabled]
  SRedir Memory Display                        [No Delay]

Advanced->System Health->System Fan
  Fan Speed Control                            [1) Disable - Full Speed]

PCIPnP
  Plug and Play OS                             [No]
  PCI Latency                                  [64]
  Allocate IRQ to PCI VGA                      [Yes]
  Pallete Snooping                             [Disabled]
  PCI IDE BusMaster                            [Disabled]

Boot->Boot Device Priority
    1)  Floppy
    2)  PC-CD-244E   (cdrom)
    3)  MBA Slot 218 (first ethernet)
    4)  3Ware (or Onbard SATA)
    5)  MBA Slot 219 (second ethernet)

Chipset->NorthBridge->ECC Configuration
    DRAM ECC                                     [Enabled]
        MCA ECC Logging                          [Enabled]
        ECC Chipkill                             [Enabled]
        DRAM Scrub Redirect                      [Enabled]
        DRAM BG Scrub                            [163.8us]
    L2 Cache BG Scrub                            [ 10.2us]
    Data Cache BG Scrub                          [ 5.12us]

Chipset->NorthBridge->IOMMU Options
    IOMMU Mode                                   [Best Fit]
    Aperture Size                                [64MB]

Supermicro H8DAR-T version detection

2007-06-18T18:22:00.001-07:00

The Supermicro H8DAR-T motherboard comes in (at least) two flavors. The differences that I know about between the two versions are:

* The version 2.01 board will run Opensolaris/Nexenta out of the box. This is because of a difference in the SATA controller hardware. The version 1.01 board will not run Opensolaris without an add-on controller card.

* The 1.01 and 2.01 boards use different hardware sensors (For temperature, fan speed, etc). We get sensor stats through our IPMI cards; because of this the IPMI cards need to be flashed to the specific version of the hardware. The IPMI cards do work for poweron/poweroff and console redirection without this specific firmware, only the sensors do not work if the IPMI firmware mis-matches the motherboard version.

Unfortunately, I do not see enough of a difference at POST time to be able to tell them apart. However, there are two ways I know of to do the detection.

1. With the cover of the machine off, the version can be seen in the back left corner of the board. (Will post pics later)

2. Under linux, use the "dmidecode" command. The system board uses "Handle 0x0002". What works well for me is "dmidecode |grep -A3 'Base Board' ". v1.01 boards report their Version as "1234567890" (way to go Supermicro!). v2.01 boards report as being Version "2.0". Examples:


v1board:~# dmidecode |grep -A3 "Base Board"
     Base Board Information
             Manufacturer: Supermicro
             Product Name: H8DAR-T
               Version: 1234567890


v2board:~# dmidecode |grep -A3 "Base Board"
     Base Board Information
             Manufacturer: Supermicro
             Product Name: H8DAR-T
               Version: 2.0

Path MTU discovery and MTU troubleshooting

2007-02-17T17:02:00.000-08:00

Recently when debugging some performance issues on a client's site, I came across some very interesting behavior. Some users were reporting that the site performed very well for a short period of time, but after a while, performance became very poor, enough so to render the site unusable. Checking the apache logfiles for the IP addreses of those clients showed that the requests themselves were not taking an unusual amount of time, but instead the requests were coming into the webserver at a snails pace.

Checking at the network level, I saw some strange things happening:


prod-lb01:~# tethereal -R "http.request and ip.addr == (client)"
125.362898 (client) -> (server)   HTTP GET /search/stuff HTTP/1.1
125.362922   (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
126.612994 (client) -> (server)   HTTP GET /search/stuff HTTP/1.1
126.613018   (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
129.615113 (client) -> (server)   HTTP GET /search/stuff HTTP/1.1
129.615135   (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
135.616047 (client) -> (server)   HTTP GET /search/stuff HTTP/1.1
135.616066   (server) -> (client) ICMP Destination unreachable (Fragmentation needed)

Fragmentation Needed? (ICMP Type 3/Code 4) Why would we be needing to fragment incoming packets? This should only happen if the packet is bigger than the Maximum Transmission Size (MTU), and since this is all connected with ethernet, at a constant 1500 MTU, it is odd to see this.

Then I remembered this site is using Linux Virtual Server (LVS) for load balancing incoming requests. LVS can be configured in several ways, but this site is using IP-IP aka LVS-Tun load balancing, which encapsulates the incoming IP packet inside another packet and sends that to the destination server. Since this uses IP encapsulation, each request that hits the load balancer will have additional headers tacked on, to address the packet to the appropriate realserver. It happens to add 20 bytes to the header.

Okay, so the actual MTU of requests that go to the load balancer is 1480 due to the encapsulation overhead. Snooping for this type of packet at the router, I notice that we're sending out a LOT of them:


(router):~# tcpdump -n -i eth7  "icmp[icmptype] & icmp-unreach != 0 and icmp[icmpcode] & 4 != 0"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
17:07:00.608444 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.288197 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.910215 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.927728 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.391218 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.693094 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.912513 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:03.019852 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:03.398335 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)

These ICMP messages are not bad, per say, they are part of the Path MTU Discovery process. However, many firewalls indiscriminately block ICMP packets of all kinds. Based on the research I did on this problem, most of the documentation I found was from the end-user's perspective, i.e., users who had PPPoE or other types of encapsulated/tunneled connections and had trouble getting to certain websites. Now with the proliferation of personal firewall hardware and software, some of which may be overzealously configured to block all ICMP (even "good" ICMP like PMTU discovery), this is something that server admins have to worry about, too, especially if running a load balancing solution which encapsulates packets.

The research I did on the problem pointed me to the following iptables rule to be added on the router:

iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1400:1536 -j TCPMSS --clamp-mss-to-pmtu

This is intended to force the advertised Maximum Segment Size (MSS) to be the 40 less than of the smallest MTU that the router knows about. However, this didn't work for us (This tcpdump line looks for any TCP handshakes plus any ICMP unreachable errors):


(router):~# tcpdump -vv -n -i eth7 "(host (client) ) and \
      (tcp[tcpflags] & tcp-syn != 0 oricmp[icmptype] & icmp-unreach != 0)"
tcpdump: listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
18:00:17.479661 IP (tos 0x0, ttl  53, id 47601, offset 0, flags [DF], length: 52)
(client).1199 > (server).80: S [tcp sum ok] 2541494183:2541494183(0) win 65535
<mss 1460,nop,wscale 2,nop,nop,sackOK>

18:00:17.479861 IP (tos 0x0, ttl  63, id 0, offset 0, flags [DF], length: 52)
(server).80 > (client).1199: S [tcp sum ok] 2875112671:2875112671(0) ack 2541494184 win 5840
<mss 1460,nop,nop,sackOK,nop,wscale 7>

18:00:17.771080 IP (tos 0xc0, ttl  63, id 10080, offset 0, flags [none], length: 576)
(server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
for IP (tos 0x0, ttl  52, id 47613, offset 0, flags [DF], length: 1500)
(client).1199 > (server).80: . 546:2006(1460) ack 1 win 64240

It was still negotiating a 1460 byte MSS during the handshake. In hindsight, this makes sense, because the router doesn't really know that the MTU of the load balancer and the realservers is actually smaller than 1500 - the router communicates with these machines over their ethernet interfaces, which are all still set to a 1500 byte MTU. Digging some more into the problem (Including the LVS-Tun HOWTO linked above) there were quite a few things mentioned, but no real definitive answers.

I chose to fix this problem by hardcoding the MSS to 1440 at the router, rather than using the "clamp-mss-to-pmtu" setting:

iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1440:1536 -j TCPMSS --set-mss 1440

1440 is the normal MSS value of 1460, minus the 20 byte overhead for the encapsulated packet. This seems to have fixed the problem entirely:

(router):~# tcpdump -vv -n -i eth7 "(host (client) ) and \
      (tcp[tcpflags] & tcp-syn != 0 or icmp[icmptype] & icmp-unreach != 0)"
tcpdump: listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
18:02:19.466678 IP (tos 0x0, ttl  53, id 55012, offset 0, flags [DF], length: 52)
(client).1298 > (server).80: S [tcp sum ok] 2863214365:2863214365(0) win 65535
<mss 1460,nop,wscale 2,nop,nop,sackOK>

18:02:19.466886 IP (tos 0x0, ttl  63, id 0, offset 0, flags [DF], length: 52)
(server).80 > (client).1298: S [tcp sum ok] 2996826059:2996826059(0) ack 2863214366 win 5840
<mss 1440,nop,nop,sackOK,nop,wscale 7>

.... silence!

PS - The reason that I was seeing this very odd behavior - very fast at first, followed by an unusable site?

The client website had recently added a search history, which was stored in a browser cookie. Things would go great until enough data was in the cookie to push it up over 1440 bytes.
I had configured my home DSL router to discard ICMP some many years back and had forgotten about it - My firewall was throwing away the ICMP Fragmentation Needed packets, so my PC never "Got the memo" that it needed to send smaller packets!

This actually worked out for the better, though - this site had had reports of odd slowness in the recent past, and hopefully this was the root cause!

EDIT: Note that in the original post, I had missed an important option, in the iptables config it is important to use the "-m tcpmss --mss 1440:1536" setting. Without this flag, iptables will force the MSS of ALL traffic to 1440, including clients which request a size smaller than that. This obviously presents a problem to the client.

Search Engine Optimization with Apache and mod_rewrite

2007-02-08T11:30:00.000-08:00

I've recently been using the powerful mod_rewrite to modify the URL's on a client's website. mod_rewrite is a powerful tool that lets you turn "ugly" URL's like

http://example.com/search.cgi?searchType=pie&searchTerm=pumpkin%20pie

into cleaner URL's like

http://www.example.com/pie/pumpkin_pie

This is useful for a couple reasons - not only is it cleaner to look at, but it can help with search engine indexing. In this case, because "pumpkin_pie" is part of the URL as opposed to part of the query string, the keyword ranks higher in many search engines.

Lets say we have an application that will return search results for various categories, and we want the URL's to have the format of "http://www.example.com/(category)/(search term)". Also we want to have a landing page if the URL is simply "http://www.example.com/(category)". We want to make this as generic as possible so that the httpd.conf does not need to be edited every time a category is added.

This can be configured a number of ways, but the way I have it installed here is with apache running on port 80, and the application - a java servlet container - is running on a different port, say port 8000. Apache intercepts most of the requests for static, on-disk content, and uses the proxy mechanism to send dynamic requests to the servlet container. Let's break down the relevant sections of the apache configuration file:

First, it can be useful to funnel all traffic for your site through a single hostname, as opposed to links to both "example.com" and "www.example.com". This rule will force a redirect back to "www.example.com" with a HTTP 301 redirect:


RewriteCond         %{HTTP_HOST}    ^example.com$                  [NC]
RewriteRule         ^/(.*)          http://www.example.com/$1      [L,R=301]

Now lets map the static page elements and HTML to the local filesystem, so that they don't get remapped to a search query, and are served by apache instead of proxied through another layer. Note that we need to map favicon.ico to the local filesystem, else you can end up sending searches to your application when the browser requests the favicon.ico for /pie/pumpkin_pie/favicon.ico! The [L] in the rewrite modifier tells the rewrite engine to stop the processing at this point and serve the file directly.


RewriteRule         ^/js/(.*)       /opt/static/js/$1              [L]
RewriteRule         ^/pictures/(.*) /opt/static/pictures/$1        [L]
RewriteRule         ^/images/(.*)   /opt/static/images/$1          [L]
RewriteRule         ^/css/(.*)      /opt/static/css/$1             [L]
RewriteRule         /favicon.ico$   /opt/static/html/favicon.ico   [L]
RewriteRule         ^/robots.txt    /opt/static/html/robots.txt    [L]

Another useful trick is to re-map underscores to %20 in the search parameters, so we can use terms like "pumpkin_pie" that get remapped to "pumpkin%20pie" when sent to the backend application. This rule will match any URL that has an underscore in it, and then rewrite one underscore to a %20 and then send the processing back to the first rewrite rule. (So it will keep remapping them one at a time until they're all gone). This is necessary because we don't know how many underscores there might be in the URL, and there is no "replace all" modifier like "/g" for normal unix search and replace. Note the "QSA" in the rule modifiers; this means "Query String Append" and will leave any query string intact through the processing:


RewriteCond         %{REQUEST_URI}  ^/.*_
RewriteRule         ^/(.*)_(.*)              /$1\%20$2              [N,QSA]

Now lets say there are a couple of URL paths we want to treat differently, say, we need to treat the "buy" section of the site differently. With the way we map the general search cases later in this file, anything that needs to be treated differently needs to be mapped in a way that will bypass the generic match:


RewriteRule         ^/buy/(.*)               /purchase.jsp?cat=$1    [QSA]

Now for the "/(category)" landing page. We have to have a limitation here for categories to be only alphanumeric characters - this is so that things like "purchase.jsp" are not treated as categories! Also we prevent any request that contains a query string from being treated as a category, so we can have servlets, etc, continue to work:


RewriteCond         %{QUERY_STRING}         ^$
RewriteRule         ^/([a-z]*)$              /landingPage.jsp?category=$1          [NC]

Now for the generic /(category)/(searchterm) mapping.


RewriteRule         ^/([a-z]*)/(.*)         /search.jsp?category=$1&search=$2      [NC,QSA]

We are at the end of the line, we proxy the resulting modified URL back to our application:


RewriteRule         ^/(.*)                  http://127.0.0.1:8000/$1 [P]

And if you run into any trouble, you can turn logging on with the following commands:


RewriteLog /opt/app/logs/rewrite.log
RewriteLogLevel 9

Now of course, these remappings only map INCOMING URL's to our application. Our application is still responsible for sending this URL format back to the user, so if a user links to your site they are using this optimized URL format. Another way to get these URLs sent to search engines is with a sitemaps file, see www.sitemaps.org for details.

Tags: apache, mod_rewrite, seo

Linux memory overcommit

2007-01-24T22:03:00.000-08:00

Last week I learned something very interesting about the way Linux allocates and manages memory by default out of the box.

In a way, Linux allocates memory the way an airline sells plane tickets. An airline will sell more tickets than they have actual seats, in the hopes that some of the passengers don't show up. Memory in Linux is managed in a similar way, but actually to a much more serious degree.

Under the default memory management strategy, malloc() essentially always succeeds, with the kenrel assuming you're not _really_ going to use all of the memory you just asked for. The malloc()'s will continue to succeed, but not until you actually try to use the memory you allocated will the kernel 'really' allocate it. This leads to severe pathology in low memory conditions, because the application has already allocated the memory, it thinks it can use it free and clear, but when the system is in a low memory condition and an application is trying to use additional memory it has already allocated, the memory access takes a very long time as the kernel hunts around for memory to give.

In an extremely low memory condition, the kernel will start firing off the "OOM Killer" routine. Processes are given 'OOM Scores' and the process with the highest score, win^H^H^Hloses. This leads to random processes on a machine being killed by the kernel. Keeping in the airline analogies, I found this entertaining post.

I found some interesting information about the Linux memory manager here in section 9.6. This section has three small C programs to test memory allocation. The second and third program produced pretty similar results for me so I'm omitting the third:

Here are the results of the test on an 8GB debian Linux box:


demo1:  malloc memory and do not use it:       Allocated 1.4TB, killed by OOM killer
demo2:  malloc memory and use it right away:   Allocated 7.8GB, killed by OOM killer

Here are the results on an 8GB Nexenta/Opensolaris machine:


demo1:  malloc memory and do not use it:       Allocated 6.6GB, malloc() fails
demo2:  malloc memory and use it right away:   Allocated 6.5GB, malloc() fails

Apparently, a big reason linux manages memory this way out of the box is to optimize memory usage on fork()'ed processes; fork() creates a full copy of the process space, but in this instance, with overcommitted memory, only pages which have been written to actually need to be allocated by the kernel. This might work very well for a shell server, a desktop, or perhaps a server with a large memory footprint that forks an actual PID rather than a thread, but in our situation, this is very undesirable.

We run a pretty java-heavy environment, with multiple large JVMs configured per host. The problem is that the heap sizes have been getting larger, and we were running in an overcommitted situation and did not realize it. The JVMs would all start up and malloc() their large heaps, and then at some later time once enough of the heaps were actually used, the OOM killer would kick in and more or less randomly off one of our JVMs.

I found that linux can be brought more in line with traditional/expected memory management by setting the sysctls: (Apparently these are available only 2.6 kernels)


vm.overcommit_memory (0=default, 1=malloc always succeeds(?!?), 2=strict overcommit)
vm.overcommit_ratio (50=default, I used 100)

The ratio appears to be the percentage off the system's total VM that can be allocated via malloc() before malloc() fails. This MIGHT be on a per-pid basis (need to research). This number can be greater than 100%, presumably to allow for some slop in the copy-on-write fork()'s. When I set this to 100 on a 8GB system, I was able to malloc() about 7.5G of stuff, which seemed about right since I had normal multi-user processes running and no swap configured. I don't know why you'd want to use a number much less than 100, unless it were a per-process limit, or you wanted to force some saved room for fscache.

The big benefit here is that malloc() can actually fail in a low memory condition. This means that the error can be caught and handled by the application. In my case, it means that JVMs fail at STARTUP time, with an obvious memory shortage related error in the logs, rather than having the process have the rug yanked out from under it hours or days later with no message in the application log, and no opportunity to clean up what it was doing.

Here are the demo programs with a linux machine set to strict overcommit/100 ratio:


demo1:    malloc memory and do not use it:        Allocated 7.3GB, malloc fails.
demo2:    malloc memory and use it right away:    Allocated 7.3GB, malloc fails.

Technorati Tags: linux, memory, OOM

Debugging mysql5 on Nexenta

2007-01-23T15:10:00.000-08:00

Due to some very favorable benchmarking results, I am planning to migrate some of our production databases to Myqsl5 on Nexenta. Previously the database was running on a Debian Linux server, and the I/O subsystem performed much worse on that system.

I ran into a very strange problem with mysql5 under Nexenta, however. After a certain number of clients connected, sometimes the server would begin to refuse connections in a very strange way. It would accept the connection to the mysql port, and then immediately close the connection.

Fortunately, one of the other reasons I want to move to Nexenta is the more robust toolchain for troubleshooting just these kinds of problems. I started out by using 'truss' on thread 1 of the mysql daemon under the assumption that it was the thread responsible for managing incoming client connections - not a bad guess. Here is a trace of a mysql connection that works correctly vs one that breaks:

Works OK:


root@perftest-db01:~# truss  -w all  -p 6650/1
/1:     pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)
/1:     pollsys(0x080473C0, 2, 0x00000000, 0x00000000)  = 1
/1:     fcntl(11, F_SETFL, FWRITE|FNONBLOCK)            = 0
/1:     accept(11, 0x08047948, 0x08047958, SOV_DEFAULT) = 57
/1:     fcntl(11, F_SETFL, FWRITE)                      = 0
/1:     sigaction(SIGCLD, 0x08047420, 0x080474A0)       = 0
/1:     getpid()                                        = 6650 [6589]
/1:     getpeername(57, 0xFEF67A90, 0x080474B8, SOV_DEFAULT) = 0
/1:     getsockname(57, 0xFEF67A80, 0x080474B8, SOV_DEFAULT) = 0
/1:     open("/etc/hosts.allow", O_RDONLY)              = 58
/1:     fstat64(58, 0x08046B20)                         = 0
/1:     fstat64(58, 0x08046A50)                         = 0
/1:     ioctl(58, TCGETA, 0x08046AEC)                   Err#25 ENOTTY
/1:     read(58, " #   / e t c / h o s t s".., 8192)    = 677
/1:     read(58, 0x504EA88C, 8192)                      = 0
/1:     llseek(58, 0, SEEK_CUR)                         = 677
/1:     close(58)                                       = 0
/1:     open("/etc/hosts.deny", O_RDONLY)               = 58
/1:     fstat64(58, 0x08046B20)                         = 0
/1:     fstat64(58, 0x08046A50)                         = 0
/1:     ioctl(58, TCGETA, 0x08046AEC)                   Err#25 ENOTTY
/1:     read(58, " #   / e t c / h o s t s".., 8192)    = 901
/1:     read(58, 0x504EA88C, 8192)                      = 0
/1:     llseek(58, 0, SEEK_CUR)                         = 901
/1:     close(58)                                       = 0
/1:     getsockname(57, 0x08047938, 0x08047958, SOV_DEFAULT) = 0
/1:     fcntl(57, F_SETFL, (no flags))                  = 0
/1:     fcntl(57, F_GETFL)                              = 2
/1:     fcntl(57, F_SETFL, FWRITE|FNONBLOCK)            = 0
/1:     setsockopt(57, ip, 3, 0x0804748C, 4, SOV_DEFAULT) = 0
/1:     setsockopt(57, tcp, TCP_NODELAY, 0x0804748C, 4, SOV_DEFAULT) = 0
/1:     time()                                          = 1169599669
/1:     lwp_kill(73, SIG#0)                             Err#3 ESRCH
/1:     lwp_create(0x08047240, LWP_DETACHED|LWP_SUSPENDED, 0x08047464) = 243
/1:     lwp_continue(243)                               = 0
/1:     pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)

Immediately closes connection:

root@perftest-db01:~# truss  -w all  -p 6650/1
/1:     pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)
/1:     pollsys(0x080473C0, 2, 0x00000000, 0x00000000)  = 1
/1:     fcntl(11, F_SETFL, FWRITE|FNONBLOCK)            = 0
/1:     accept(11, 0x08047948, 0x08047958, SOV_DEFAULT) = 255
/1:     fcntl(11, F_SETFL, FWRITE)                      = 0
/1:     sigaction(SIGCLD, 0x08047420, 0x080474A0)       = 0
/1:     getpid()                                        = 6650 [6589]
/1:     getpeername(255, 0xFEF67A90, 0x080474B8, SOV_DEFAULT) = 0
/1:     getsockname(255, 0xFEF67A80, 0x080474B8, SOV_DEFAULT) = 0
/1:     open("/etc/hosts.allow", O_RDONLY)              = 257
/1:     close(257)                                      = 0
/1:     fxstat(2, 256, 0x08045DF8)                      = 0
/1:     time()                                          = 1169599778
/1:     getpid()                                        = 6650 [6589]
/1:     putmsg(256, 0x080467B8, 0x080467C4, 0)          = 0
/1:     open("/var/run/syslog_door", O_RDONLY)          = 257
/1:     door_info(257, 0x08045C10)                      = 0
/1:     getpid()                                        = 6650 [6589]
/1:     door_call(257, 0x08045C48)                      = 0
/1:     close(257)                                      = 0
/1:     fxstat(2, 256, 0x080459B8)                      = 0
/1:     time()                                          = 1169599778
/1:     getpid()                                        = 6650 [6589]
/1:     putmsg(256, 0x08046378, 0x08046384, 0)          = 0
/1:     open("/var/run/syslog_door", O_RDONLY)          = 257
/1:     door_info(257, 0x080457D0)                      = 0
/1:     getpid()                                        = 6650 [6589]
/1:     door_call(257, 0x08045808)                      = 0
/1:     close(257)                                      = 0
/1:     fxstat(2, 256, 0x08046A88)                      = 0
/1:     time()                                          = 1169599778
/1:     getpid()                                        = 6650 [6589]
/1:     putmsg(256, 0x08047448, 0x08047454, 0)          = 0
/1:     open("/var/run/syslog_door", O_RDONLY)          = 257
/1:     door_info(257, 0x080468A0)                      = 0
/1:     getpid()                                        = 6650 [6589]
/1:     door_call(257, 0x080468D8)                      = 0
/1:     close(257)                                      = 0
/1:     shutdown(255, SHUT_RDWR, SOV_DEFAULT)           = 0
/1:     close(255)                                      = 0
/1:     pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)

Looks like the main difference starts here:

/1:     open("/etc/hosts.allow", O_RDONLY)              = 257
/1:     close(257)                                      = 0

That explains a lot -- the "hosts.allow" file is part of the tcpwrappers system, which controls access to various daemons on the system based on access control rules set by the system administrator. No wonder I am getting a connection but then immediately getting booted. It is trying to open the hosts.allow file, but then is immediately closing it, vs actually reading and processing the file as seen in the working connection. Does the process not have enough filehandles?

root@perftest-db01:~# pfiles  6650 |head -2
6650:   /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql
Current rlimit: 8192 file descriptors

Nope, doesn't look that way -- it's configured to use 8192 filehandles. My next clue was the file descriptor number that was returned by the "open" system call, 257. That's awfully near one of those magic "power of 2" boundaries. I started snooping around in google.

It turns out that under Solaris, and maybe *BSD also, the tcpwrappers library (libwrap) uses the "stdio" library to manage IO. This library does not understand file handles above 255, therefore, as the mysql server continues to collect client processes and open tables for reading, eventually this file descriptor boundary is crossed and calls to open "hosts.allow"appear to fail because they return too high a file descriptor number. tcpwrappers appears to fail closed, so since it cannot read the "hosts.allow" file, it denies access to the service by immediately closing the communication channel.

Fortunately, there is a fix. Giri Mandalika has a blog entry that references the issue and is a good resource on the problem. The solution is to use the extendedFILE library that's provided in Solaris Express 06/06 or later (So this is included in Nexenta Alpha 6, and possibly earlier):


root@perftest-db01:~# export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1
root@perftest-db01:~# /etc/init.d/mysql restart

(Obviously I will also need to modify the /etc/init.d/mysql startup script to include the LD_PRELOAD_32). Now, I start up a test program to artificially create a bunch of connections to the database, and see what a truss looks like now:


root@perftest-db01:~# truss  -w all  -p 6846/1
/1:     pollsys(0x08047390, 2, 0x00000000, 0x00000000) (sleeping...)
/1:     pollsys(0x08047390, 2, 0x00000000, 0x00000000)  = 1
/1:     fcntl(11, F_SETFL, FWRITE|FNONBLOCK)            = 0
/1:     accept(11, 0x08047918, 0x08047928, SOV_DEFAULT) = 294
/1:     fcntl(11, F_SETFL, FWRITE)                      = 0
/1:     sigaction(SIGCLD, 0x080473F0, 0x08047470)       = 0
/1:     getpid()                                        = 6846 [6785]
/1:     getpeername(294, 0xFEF47A90, 0x08047488, SOV_DEFAULT) = 0
/1:     getsockname(294, 0xFEF47A80, 0x08047488, SOV_DEFAULT) = 0
/1:     open("/etc/hosts.allow", O_RDONLY)              = 295
/1:     fstat64(295, 0x08046AF0)                        = 0
/1:     fstat64(295, 0x08046A20)                        = 0
/1:     ioctl(295, TCGETA, 0x08046ABC)                  Err#25 ENOTTY
/1:     read(295, " #   / e t c / h o s t s".., 8192)   = 677
/1:     read(295, 0x5122F9D4, 8192)                     = 0
/1:     llseek(295, 0, SEEK_CUR)                        = 677
/1:     close(295)                                      = 0
/1:     open("/etc/hosts.deny", O_RDONLY)               = 295
/1:     fstat64(295, 0x08046AF0)                        = 0
/1:     fstat64(295, 0x08046A20)                        = 0
/1:     ioctl(295, TCGETA, 0x08046ABC)                  Err#25 ENOTTY
/1:     read(295, " #   / e t c / h o s t s".., 8192)   = 901
/1:     read(295, 0x5122F9D4, 8192)                     = 0
/1:     llseek(295, 0, SEEK_CUR)                        = 901
/1:     close(295)                                      = 0
/1:     getsockname(294, 0x08047908, 0x08047928, SOV_DEFAULT) = 0
/1:     fcntl(294, F_SETFL, (no flags))                 = 0
/1:     fcntl(294, F_GETFL)                             = 2
/1:     fcntl(294, F_SETFL, FWRITE|FNONBLOCK)           = 0
/1:     setsockopt(294, ip, 3, 0x0804745C, 4, SOV_DEFAULT) = 0
/1:     setsockopt(294, tcp, TCP_NODELAY, 0x0804745C, 4, SOV_DEFAULT) = 0
/1:     time()                                          = 1169601081
/1:     lwp_kill(273, SIG#0)                            Err#3 ESRCH
/1:     lwp_create(0x08047210, LWP_DETACHED|LWP_SUSPENDED, 0x08047434) = 274
/1:     lwp_continue(274)                               = 0
/1:     pollsys(0x08047390, 2, 0x00000000, 0x00000000) (sleeping...)

As you can see above - the "open" command on the "hosts.allow" file is returning a filehandle greater than 255, but reading and processing the hosts.allow file proceeds normally, and the connection is accepted.

Yay for truss!

Technorati Tags: mysql5, nexenta, opensolaris, truss

ZFS features

2007-01-22T22:44:00.000-08:00

Here's a post I just entered on the Nexenta/gnusolaris Beginners Forum that has some good info about ZFS. Apparently the formatting got eaten on the mailing list so I'm reposting it here:

Hi all,

Can I have it installed concurrently with linux and allocate linux partitions to the RAID Z? or RAID-Z takes the whole disks?

There are two "layers" of partitions in opensolaris; the first is managed with the "fdisk" utility, the second is managed with the "format" utility - these partitions are aka "slices". I am not an expert, but I believe that the "fdisk" managed partitions are the pieces that linux/windows/etc sees. You first would allocate one of these partitions to Solaris, and from there you can additionally split that fdisk partition into root/swap/data "slices". I believe that the linux partitions you'd see would be visible via the "fdisk" command.

According to some of the ZFS faq/wiki resources, ZFS is "better" if it manages the entire disk, however, it will work just fine managing either "partitions" or "slices". You can even make a ZFS pool with individual files.

Here is an example of one of my disks. There is one "fdisk" partition, and a few "slices":


root@medb01:~# fdisk -g /dev/rdsk/c0t0d0p0                                                                                       
* Label geometry for device /dev/rdsk/c0t0d0p0
* PCYL     NCYL     ACYL     BCYL     NHEAD NSECT SECSIZ
 48638    48638    2        0        255   63    512 

root@medb01:~# prtvtoc /dev/rdsk/c0t0d0p0                                                                                        
* /dev/rdsk/c0t0d0p0 partition map
*
* Dimensions:
*     512 bytes/sector
*      63 sectors/track
*     255 tracks/cylinder
*   16065 sectors/cylinder
*   48640 cylinders
*   48638 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
      0      0    00      16065   8401995   8418059
      1      0    00    8418060  16787925  25205984
      2      5    01          0 781369470 781369469
      6      0    00   25205985 756147420 781353404
      7      0    00  781353405     16065 781369469
      8      1    01          0     16065     16064

Note that in the following examples, I'll create ZFS pools with "c0tXd0s6", that is, the 6th "slice" listed in the solaris partition table.

Alternatively, Can I mount my Linux RAID partitions on Nexenta, at least for migration purposes? What about the LVM disks?

As far as I know, there is no LVM or linux-supported filesystem types built into Opensolaris/Nexenta. i.e. you could not just "mount -t ext3" a linux filesystem and be able to read it. Since you've mentioned that you're running a VMware server, I suppose it may be possible to have both guest operating systems running and copy the data over the 'network'. Also it's likely that Nexenta won't know about LVM managed partitions, it would have to be a real honest-to-goodness partition.

What about RAID-Z features:
Can I hot-swap a defective disk?

This should be possible, assuming that your hardware supports it. You may need to force a rescan of the devices if you replace a disk, check devfsadm. Reintegrating it into the pool would be accomplished with a "zpool replace pool device [new device]"

Can I add a disk to the server and tell it to enlarge the pool, to make more space available on the preexisting RAID?

Yes, with a caveat - ZFS doesn't do any magic stripe re-balancing. If you have a 4-disk pool, and add another disk, what you really have is a 4-disk raidz with a single disk tacked on at the end with no redundancy. Best practice would be to add space in 'chunks' of several disks. Fortunately I am in the middle of building a Nexenta-based box with 4 SATA drives so I can play around with some of the commands and show you the output:

Here is a 4-disk zpool using raidZ:


root@medb01:~# zpool create u01 raidz c0t0d0s6 c0t1d0s6 c0t2d0s6 c0t3d0s6                                                        
root@medb01:~# zpool status u01                                                                                                  
 pool: u01
state: ONLINE
scrub: none requested
config:

       NAME          STATE     READ WRITE CKSUM
       u01           ONLINE       0     0     0
         raidz1      ONLINE       0     0     0
           c0t0d0s6  ONLINE       0     0     0
           c0t1d0s6  ONLINE       0     0     0
           c0t2d0s6  ONLINE       0     0     0
           c0t3d0s6  ONLINE       0     0     0

Here is a 3-disk raidZ pool that I "grow" by adding a single additional disk. Note the subtle indentation difference on c0t3d0s6 in this example; it is not part of the original raidz1 and is just a standalone disk in the pool.


root@medb01:~# zpool destroy u01                                                                                                 
root@medb01:~# zpool create u01 raidz c0t0d0s6 c0t1d0s6 c0t2d0s6                                                                 
root@medb01:~# zpool add u01 c0t3d0s6                                                                                            
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses raidz and new vdev is disk
root@medb01:~# zpool add -f u01 c0t3d0s6                                                                                         
root@medb01:~# zpool status u01                                                                                                  
 pool: u01
state: ONLINE
scrub: none requested
config:

       NAME          STATE     READ WRITE CKSUM
       u01           ONLINE       0     0     0
         raidz1      ONLINE       0     0     0
           c0t0d0s6  ONLINE       0     0     0
           c0t1d0s6  ONLINE       0     0     0
           c0t2d0s6  ONLINE       0     0     0
         c0t3d0s6    ONLINE       0     0     0

Here is an example of adding space in "chunks", note the size of the volume is different in the "zpool list" before and after.


root@medb01:~# zpool destroy u01                                                                                                 
root@medb01:~# zpool create u01 mirror c0t0d0s6 c0t1d0s6                                                                         
root@medb01:~# zpool list u01                                                                                                    
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
u01                     360G   53.5K    360G     0%  ONLINE     -
root@medb01:~# zpool add u01 mirror c0t2d0s6 c0t3d0s6                                                                            
root@medb01:~# zpool list u01                                                                                                    
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
u01                     720G    190K    720G     0%  ONLINE     -
root@medb01:~# zpool status u01                                                                                                  
 pool: u01
state: ONLINE
scrub: none requested
config:

       NAME          STATE     READ WRITE CKSUM
       u01           ONLINE       0     0     0
         mirror      ONLINE       0     0     0
           c0t0d0s6  ONLINE       0     0     0
           c0t1d0s6  ONLINE       0     0     0
         mirror      ONLINE       0     0     0
           c0t2d0s6  ONLINE       0     0     0
           c0t3d0s6  ONLINE       0     0     0

PS, doing it this way appears to stripe writes across the two mirrored "subvolumes".

Does it have a facility similar to LVM, where I can create 'logical volumes' on top of the RAID and allocate/deallocate space as needed for flexible storage management (without putting the machine offline)?

Yes, there are two layers in ZFS, the pool management, managed through the "zpool" command, and the filesystem management, through the "zfs" command. Individual filesystems are created as subdirectories of the base pool, or can be relocated with the "zfs set mountpoint" option if you desire. Here I create a ZFS called /u01/opt with a 100MB quota, and then increase the quota to 250MB.


root@medb01:~# zfs create -oquota=100M u01/opt                                                                                   
root@medb01:~# df -k /u01 /u01/opt                                                                                               
Filesystem            kbytes    used   avail capacity  Mounted on
u01                  743178240      26 743178105     1%    /u01
u01/opt               102400      24  102375     1%    /u01/opt
root@medb01:~# zfs set quota=250m u01/opt                                                                                        
root@medb01:~# df -k /u01 /u01/opt                                                                                               
Filesystem            kbytes    used   avail capacity  Mounted on
u01                  743178240      26 743178105     1%    /u01
u01/opt               256000      24  255975     1%    /u01/opt

Also, things like atime update, compression, etc, can be set on a per filesystem basis.

Can I do fancy stuff like plug an e-sata disk to my machine and tell it to 'ghost' a 'logical volume' on-the-fly, online, without unmounting the volume?

Yes, this is possible. ZFS supports "snapshots" - moment in time copies of an entire ZFS filesystem. ZFS also supports a "send" and "receive" of a snapshot, so you can then take that moment in time copy of your filesystem and replicate it somewhere else. (Or just leave the snapshot laying around for recovery purpouses).

The procedure would be to create a ZFS volume on your external drive, and then "zpool import" that drive each time you plugged it in. Then create a snapshot on your filesystem and "send" it to the external drive, like so. (I don't have an external drive to import so I'll just create 2 pools). I test by creating a filesystem, creating a file in that filesystem, then snapshotting and sending that snapshot to a different pool. Note that the file I created exists in the destination when I'm done.


root@medb01:/# zpool destroy u01                                                                                                 
root@medb01:/# zpool destroy u02                                                                                                 
root@medb01:/# zpool create u01 mirror c0t0d0s6 c0t1d0s6                                                                         
root@medb01:/# zpool create u02 mirror c0t2d0s6 c0t3d0s6                                                                         
root@medb01:/# zfs create u01/data                                                                                               
root@medb01:/# echo "test test test" > /u01/data/testfile.txt                                                                    
root@medb01:/# zfs snapshot u01/data@send_test                                                                                   
root@medb01:/# zfs send u01/data@send_test | zfs receive u02/u01_copy                                                            
root@medb01:/# ls -l /u02/u01_copy                                                                                               
total 1
-rw-r--r-- 1 root root 15 Jan 23 04:49 testfile.txt
root@medb01:/# cat /u02/u01_copy/testfile.txt                                                                                    
test test test
root@medb01:/#

Hope all this helps (and maybe makes it into the wiki too :-) )

Modifying a Nexenta Boot ISO

2006-12-18T23:51:00.000-08:00

Today I needed to modify a nexenta Alpha-6 install ISO. There were several things I needed to fix.

1. We have two revisions of the same Supermicro H8DAR-T based beige-box systems. Solaris10/Nexenta works great on the 2.01 based motherboards. The older version - version 1.01 - does not have a SATA chipset that is supported by Nexenta out of the box. (The PCI ID is "pci11ab,6041.3" versus "pci11ab,6041.9" on the 2.01 version). I needed to add this PCI ID to the "/etc/driver_aliases" file. Aside: As far as I can tell, the only way to differentiate is to crack open the box and look for the silkscreened version on the back lefthand corner. :-( (Or, you could boot Nexenta/Solaris10 and see if it can see the disks :-) )

2. There is a bug in the nexenta-install.sh script that will hang the system while scanning for partitions after manual partitioning. I found a response to this post by LukeD in the Nexenta forums that is reported to fix this problem.

3. We wanted to add a couple of packages to the "minimal" set.

If we start a widespread Nexenta rollout there will be much more automation going into these CD images (Or even better, network images) to make for a hands-off install.

Here's how I made the changes:

1. Copied the .ISO image to another Nexenta system. (There is nothing Nexenta/Solaris specific to making the image, however, the commands used are somewhat different between different operating systems. On a linux system, you would use losetup instead of lofiadm. )

2. Create a loopback device for the .iso and mount it:

lofiadm -a /opt/temp/elatte_installcd_alpha6_i386.iso
( this creates /dev/lofi/1 )
mkdir /mnt
mount -F hsfs /dev/lofi/1 /mnt

3. Copy all of the files to a temporary location. This is necessary because the .ISO image is read only.

mkdir /opt/temp/cd
cp -av /mnt/. /opt/temp/cd/.

4. Find and mount the miniroot image as a second loopback device. The miniroot is a gzipped UFS filesystem created in the "boot" directory of the CD. Most of the filesystem is here, although it appears that nexenta remounts /usr from the CD-ROM later in the boot process. This step is required because I am changing the "driver_aliases" file, which lives inside the gzipped miniroot -- if you plan only to modify the nexenta_install.sh and/or add packages, these steps (4-6) are not necessary.

cd /opt/temp/cd/boot/
mv miniroot miniroot.gz
gunzip miniroot.gz
lofiadm -a /opt/temp/cd/boot/miniroot
( this creates /dev/lofi/2 )
mkdir /mnt2
mount /dev/lofi/2 /mnt2

5. Edit the driver_aliases file:

cd /mnt2/etc
(vi driver_aliases)

... If there are any other files that need to change within the miniroot, edit them now.

6. Unmount and re-gzip the miniroot, and clean up the other lofi mount too.

umount /mnt2
lofiadm -d /dev/lofi/2
umount /mnt
lofiadm -d /dev/lofi/1
cd /opt/temp/cd/boot
gzip miniroot
mv miniroot.gz miniroot

7. Fix the script mentioned in problem #2 and add some packages to the minimal set.

cd /opt/temp/cd/root/usr/gnusolaris
vi nexenta-install.sh
(I chose to do a "grep -i "Unknown_fstype" rather than run the sed script referenced in the forum post above)
vi base-minimal.lst
(add some packages here)

8. Create the new install CD. I found some good instructions on BigAdmin at the end of this article in the section "Using CD/DVD ISO Files":

mkisofs -o /opt/temp/nexenta_boot.iso -b boot/grub/stage2_eltorito \
   -c .catalog -no-emul-boot -boot-load-size 4 \
   -boot-info-table -relaxed-filenames -ldots -N -l -R \
   -d -D -V Elatte_InstallCD /opt/temp/cd/

This will create the new .ISO file. From there you'll need to burn it to disk.

Unfortunately, adding the new driver_alias line didn't work for me. It sees the controller now, but it pukes about not knowing how to talk to it. (Forgot the exact error message but it looked ominous). I will have to do some testing on a known-working box (one of the later revision ones) to see if the other changes I made to this CD were successful.

Edit: The mkisofs command I originally listed was incorrect. I have found the magic mkisofs command parameters and updated this post accordingly. Note that the volume name MUST be "Elatte_InstallCD".

Debian linux ethernet bonding

2006-10-20T22:33:00.000-07:00

We're working on some fault tolerant delpoyments of debian linux systems. It turns out that this is available in the stock debian kernel (sarge) and luckily, in the custom kernel we'd built at a later time.

I dug up a few howto's out there, but none of them really had all of the pieces in one place, specifically for the packages that Debian uses out of the box.

Two software pieces are necessary - the "bonding" driver, and the "ifenslave" package. The bonding driver is in the default sarge kernel, and appears to be in the defaults during a kernel build. The "ifenslave" package is the userspace program used to control the binding of physical interfaces to the bonded driver. To install this, simply

apt-get install ifenslave-2.6

It's important to get the latest version; the version that installs with a plain "apt-get install ifenslave" doesn't seem to work properly.

Next, there are two files which need to change to tell the kernel to load the bonding driver. I appended these lines to the following files:


/etc/modules:
bonding

/etc/modprobe.d/aliases:
alias bond0 bonding
options bonding mode=active-backup miimon=100 max_bonds=1

If more than one bonding interface is needed, add additional aliases in this file, and increase the "max_bonds" option as necessary. We plan on using bonding on a few machines that act as routers, so they will need to have multiple bonded interface sets.

Finally, the bond0 interface must be set up in the /etc/network/interfaces file. More than likely there is an existing entry for your primary network interface, i.e. eth0. I just changed the interface name from eth0 to bond0, and added the following line:

up    ifenslave bond0 eth0 eth1

For reference, the entire /etc/network/interfaces file looks like this - notice that there are no individual entries for eth0 and eth1.

# generated by FAI
auto lo bond0
iface lo inet loopback
iface bond0 inet static
address 192.168.195.145
netmask 255.255.255.0
broadcast 192.168.195.255
gateway 192.168.195.1
up    ifenslave bond0 eth0 eth1
post-up  /opt/tools/bin/init-ipmi

The linux Bonding howto is very comprehenive and covers the different modes of operation of this driver, as well as installation instructions for different flavors of linux and some discussion of deployment scenarios. We'll be expirimenting with various bonding modes this week to see what we can get away with; currently we're planning on running in the "active-backup" mode which is a simple active/passive failover.

Here are some other resources about linux/debian ethernet bonding:
http://www.howtoforge.com/nic_bonding
http://glasnost.beeznest.org/articles/179
http://www.debianhelp.co.uk/bonding.htm

More about bonding later!

Apache hackery

2006-09-27T18:39:00.000-07:00

The following is a patch against apache 2.0.54 (Probably applies clean to other versions, I've applied it to 2.0.55 also). It's built for debian linux, it's possible that some hacking may be necessary to get it to apply to a vanilla version of httpd but I doubt it. Copy the attached patch to a file called for example, 000_ProxyMultiSource.

Instructions for building on debian:

apt-get install debhelper apache2-threaded-dev
(also will need gcc, libtool, autoconf, etc, if not already installed)
cd /opt/apache/build
apt-get source apache2
cp 000_ProxyMultiSource debian/pacthes/.
debian/rules binary

Install the resulting .deb files: (We use the worker MPM, YMMV)

dpkg -i apache2_2.0.54-5_amd64.deb  apache2-common_2.0.54-5_amd64.deb  apache2-mpm-worker_2.0.54-5_amd64.deb

What it does:

This adds a new configuration directive to the apache config file. It is defined within the virtual host. The config item/syntax is:

ProxyMultiSource <ip> [IP] [IP]  [IP]

This causes the server, when acting as a proxy server, to randomly set it's source address to one of the <n> IP addresses above for each new request. This can be used, for example, to have a machine with a few DSL/T1 lines connected to it to split proxy traffic among all the links. It doesn't look all that random, especially at first, since all of the threads presumably have the same random seed so end up generating the same sequence of numbers. It hasn't been a big enough issue for me to fix it, since it evens out over time.

Note that these IP addresses actually have to be live on your system or the bind will fail, and probably with spectacular results. (I suspect it will lock up, since it repeatedly retries failures to bind() the local address -- this is to deal with "Address already in use" issues where the local and remote address/port pairs are identical across two transactions). Also see my previous post "Put it where it doesn't belong" to make sure that this IP traffic makes it out the appropriate interface instead of everything riding out the defaultroute. That is not what you want.

This also makes the variable "proxy-source" available to the logging system - for example:

LogFormat "%h %l %u %t \"%r\" %>s %b %T %{proxy-source}n" proxy

will include the IP address of the chosen proxy as the last value of the log entry. It will show as a "-" if it's not set -- if the request comes out of cache, or if it's a continuation of an HTTP/1.1 keepalive request, this may happen. (I may look into a way of preserving it for HTTP/1.1 requests in the future)

This code seems to be pretty stable; there have been a couple times where it's started up and given a signal 8(SIGWTF) but that's been rare. We see it gleefully take over 100 hits per second and push 100mbits+ traffic for extended periods.

Note: This has not been tested under the following circiumstances:
- Multiple virtual hosts sharing the same list of IP's
- Multiple virtual hosts with discrete lists of IP's
- A single ProxyMultiHost IP (probably useful in it's own right eh)
- Apache built with this patch _not_ implementing the ProxyMutliHost directive. (It may fail to bind an address at all)
- Virtual Hosts configured but not implementing the ProxyMultiHost directive.
If this stuff doesn't work, btw, it should not bomb the whole server, only the proxy functionality.

Note. I haven't written anything substantial in C in like 10 years, please be nice.

Get the patch here:
000_ProxyMultiSource

Put it where it doesn't belong!

2006-09-22T09:54:00.000-07:00

Our latest fight with linux has been to get a machine with multiple connections to the internet via different circiuts to pass traffic correctly. The machine is a web proxy which will bind the socket to a specific source address to round-robin the trafic across the two networks. This would be way useful for a company to effectively "bind" a few low-bandwidth links (dsl, t1, etc) into a higher bandwidth office proxy.

The first problem is to make the traffic that is sourced from a specific set of IP addresses use a different routing scheme. In this example, all of this traffic is going out to the internet, so with a default configuration, it would use the default route regardless of what the source address was set to.

We have a router machine with a bunch of interfaces - 2 onboard interfaces and a Silicom PXG-6 (6-port gigabit card). Lets say we have two DSL lines with a small set of static IP's, 66.166.22.0/28 and 66.156.12.0/28. We had 66.166.22.0 delivered first so it's set up as the default route to the internet over eth7, while eth6 is hooked up to the router for our internal LAN.

The routing table looks like this:


66.166.22.0     0.0.0.0         255.255.255.240 U         0 0          0 eth7
66.156.12.0     0.0.0.0         255.255.255.240 U         0 0          0 eth0
192.168.0.0     192.168.1.13    255.255.0.0     UG        0 0          0 eth6
10.0.0.0        192.168.1.13    255.0.0.0       UG        0 0          0 eth6
0.0.0.0         66.166.22.1     0.0.0.0         UG        0 0          0 eth7

The new link has been configured on eth0.

The first thing we'll need to configure this is the "iproute" tool, on debian, as easy as "apt-get install iproute".

First thing to be done with this tool is to create a parallel routing table. We will call the new table "pipe2". First, we need to edit /etc/iproute2/rt_tables and add a new line:

/etc/iproute2/rt_tables: (New entry added in bold)

# reserved values
#
255     local
254     main
253     default
200     pipe2   
0       unspec
#
# local
#
#1      inr.ruhep

(The number 200 is arbitrary but must fall between local and unspec)

Now add routes to these tables. I ended up putting these commands as "post-up" rules in the debian networking scripts for this interface (/etc/network/interfaces)

ip route add 10.0.0.0/8 via 192.168.1.13 table pipe2
ip route add 192.168.0.0/16 via 192.168.1.13 table pipe2
ip route add 172.16.0.0/12 via 192.168.1.13 table pipe2
ip route add default via 66.156.12.1 table pipe2

This sets up what the routing table should look like for traffic sourced from the second set of public addresses. Note that the rules to send office LAN traffic internally have to be duplicated in this table.

Next, we must insert a policy route that tells the kernel when to apply this routing table to the traffic:

ip rule add from 66.156.12.0/28 table pipe2

This gets traffic that is sourced from an IP on 66.156.12.0/28 to use the correct default router. However, there are still a few more steps. By default, linux will answer arps for any IP addresses it owns over any interface. This means that, in the above example, eth7(66.166.22.0 net) could claim to be the owner of an IP on the 66.156.12.0 network.

This is solved with the arp_filter control in /proc/sys/net/ipv4/(interface)/arp_filter. We eliminated this with a:

for i in `echo /proc/sys/net/ipv4/conf/*/arp_filter`; do echo 1 > $i; done

Here is a great discussion on what arp_filter does.
An excellent discussion of ARP as implemented on linux is Here. (This is where I found the solution to this problem, under "Arp Flux")

In retropect we might have wanted to do that part first, to prevent the arp caches of various equipment from getting the MAC of the wrong interface. If stuff upstream gets the wrong MAC in their table, you can reset the hardware (DSL modem) or ask your ISP to flush their arp cache. We also may or may not have had some luck with the "ip neigh flush".

Next, I will publish the patches to apache2 that allow for this proxy multisourcing.

Home fileserver on Solaris Express/ZFS

2006-09-21T18:51:00.000-07:00

(Mirroring this for posterity in case Svens ever forgets to pay his yahoo hosting bill :-) )

Hardware:
Basically I bought everything at Fry's. Going online, a lot of this stuff might be even cheaper, but probably not by a lot. With the exception of the raid controllers, all of this stuff was available at every store. The raid controller I finally found at the Brokaw Fry's. The only thing I re-used was an old 20x IDE CD-ROM, in retrospect since I ended up re-installing three times, it would have been worth $30 for a fast CDrom to get that 2 hours of my life back.

Case: Aluminus Ultra: $130
This is kinda ghey-with-an-H because of the side window and high gloss and everything, but it did have the advantage of having a neat internal 3.5-inch hard drive mounting rail system. Also it has 5 5 1/4 inch bays which is important for the other disks plus CD-ROM. Since it's mostly 120mm fans it's relatively quiet -- still a lot noisier than my shuttle, but quieter than my old tower.

SATA Cage: "random fry's brand": $120
This thingie lets you mount 4 sata drives in a neat hotswap case that takes up 3 5 1/4 inch bays.

2 x SATA Controller: "SIIG 4-port RAID": $80/ea
These are SiliconImage 3114 chipset based RAID cards. They come with all of the power and data cables you'll need.

8 x SATA disks: "Maxtor 250G Maxline Plus II" $100/ea
These are plain SATA-I disks but the controller only does SATA-I anyway.

Motherboard: "Asus K8N socket 754" $47 - open stock
Piece of crap open box motherboard. I got it because it has AGP video, 3-4 PCI slots, and builtin gigabit ethernet.

CPU: "AMD Sempron 2800+ 64bit" $41
Retail box -- slower CPU than what I expected, but it's good enough to run the RAID calcs, samba, httpd, etc, and it's 64 bit :-)

RAM: "Patriot 1G value Ram" $90
With $15 rebate that I'll probably forget about (edit 02/10/2007 -- I forgot)

Video: "No-Name GeForce MX400" $50
I bought it cause it was cheap and did not have a fan, enough crap spinning in there already.

I did not buy a floppy drive
If you do this, buy a floppy drive, or at least make sure you have one handy you can use temporarily.

The total damage was around $1500 w/ tax and everything. The total space after RAID is 1.8T, less than a dollar a gig.

II. Trials and Tribulations:
Everything snapped into this case pretty well. I spent extra time making sure that the sata and power cables were tied down nice, in promote good airflow.

I started installing the latest build of OpenSolaris, Nevada Build 31, because I wanted to play with ZFS and I wanted the most mature sata/network driver support avaialble. It's 4 CD's not counting the language or "add ons" (/opt/sfw) pack. The main problem I had was that Solaris 10 does not recognize these RAID cards out of the box. Solaris still does not recognize any SATA hardware that is acting as a raid card. The way to "fix" these cards is to remove their RAID functionality by loading a straight IDE BIOS. Download the IDE BIOS here. Also download the "BIOS Updater Utilities".

You'll need a DOS boot disk to copy these files to. If you do not have a DOS boot disk, there are instructions in this .ZIP file that tell you where to download a FreeDOS boot disk image where you can then copy the BIOS updater and the IDE BIOS .bin file. Getting a floppy drive hooked up is the hard part. Once you're booted to DOS, the command I used was

A:\> UPDFLASH B5304.BIN -v

The command will carp about some various stuff and then go about it's business updating your Flash BIOS. For this specific RAID card, I think I had to tell it that the Flash memory was compatible with STT 39??010 1M flash. The command updated both of the cards in my system at the same time; it did not require me to run it twice or use special command line flags.

Thus updated, you can now reboot your system. You may notice that during POST, the cards are now called "Silicon Image 3114 SATALink" instead of "3114 SATARaid" and have no option to press <ctl-s> to enter their BIOS. You can now install Solaris from CD as normal.

I do not care for installing Solaris off of CD. It gets to the "Using RPC for sysid configs" (or something) step of the boot, and then just hangs there for 5+ minutes. There's really no way to tell if your machine is horked, or your CD drive froze up, or what, it's just sitting there not doing anything.

The installer could now see all 8 of my disks. I chose to put a small ~4G partition on c1d0s0 and 512M for swap on c1d0s1. I left slice 6 empty and a 1-cylinder slice at s7 for the metadb. Once the operating system installed, I mirrored onto the first disk of the second controller (disk4) by giving it an identical partitioning scheme as disk 0 and using the standard solaris meta-commands:

#  metadb -fa c1d0s7 c3d0s7
#  metainit -f d1 1 1 c1d0s0
#  metainit -f d2 1 1 c3d0s0
#  metainit -f d6 1 1 c1d0s1
#  metainit -f d7 1 1 c3d0s1
#  metainit d0 -m d1
#  metainit d5 -m d6
#  metaroot d0
#  (edit vfstab to use d5 for swap)
#  lockfs -fa
#  reboot

Once the system came back up, I attached the metadevices:

#  metattach d0 d2
#  metattach d5 d7

The sync went fast since these are small slices. At this point I did some additional configuration. You'll want to use the "svcadm" command to turn off things like autofs, telnet, ftp, etc:

#  svcadm disable telnet
#  svcadm disable autofs
#  svcadm disable finger

(etc, I did not document exactly what I disabled but autofs has to be turned off if you want home directories to work :-)) Also if you have this specific motherboard, you'll need to add the following to your /etc/driver_aliases for the system to find your network card: nge "pci10de,df" (This tells the system to bind the Nvidia Gigabit Ethernet driver to the PCI card with vendor ID 10de and product ID 00de). Do the usual editing of /etc/hostname.nge0 /etc/inet/ipnodes /etc/hosts /etc/netmasks /etc/defaultrouter to get your network up and running.

If you had something that Solaris just saw out of the box, you probably set this up already during the install. If not, you might have to add something different to your /etc/driver_alises - there is pretty good google juice on the various ethernet cards out there. I rebooted at this point for the changes to take effect.

Now for the fun part, ZFS. For my non-root disks, I created another partition using almost all of the disks (I saved the first three cylinders since it looked like there was some boot information or something on there). zfs was a lot easier than I thought it would be. Do a man on zfs and zpool. To create the pool, it was easy:

zpool create -f u01 raidz c1d0s6 c1d1s6 c2d0s6 c2d1s6 c3d0s6 c3d1s6 c4d0s6 c4d1s6

The -f is to force it, because c1d0s6 and c3d0s6 are smaller than all of the other partitions. This command returned in about 4 seconds. It literally takes longer to type out the command than it does for ZFS to create you a 1.78TB filesystem. The filesystem will auto-magically be mounted on /u01 (because the pool name is u01, specified on the command line above, it could be any arbitrary name). From here, you can use the "zfs" command to create new filesystems that share data from this pool:


zfs create u01/home 
zfs create u01/home/amiller 
zfs set mountpoint=/home/amiller u01/home/amiller

None of this has to go into /etc/vfstab, the system just knows about it and mounts it at boot with "zfs mount -a".
So now on my box, I have:


amiller$ df -kh |grep u01
u01                   1.8T   121K   1.7T     1%    /u01 
u01/home              1.8T   119K   1.7T     1%    /u01/home 
u01/home/amiller      1.8T   3.7M   1.7T     1%    /home/amiller

ZFS on Nexenta

2006-09-15T17:37:00.000-07:00

This week I've played around with Nexenta. This is a neat operating system. It is basically a "distribution" of OpenSolaris, using debian-like package management system, built on the Ubuntu "Dapper Drake" release. Fortunately, it just happened to support all of the pieces of hardware in the test system, a Supermicro H8DAR-T, with Broadcom ethernet and a Marvell88 based onboard SATA2 controller.

I installed this in support of some filesystem/raid benchmarking I have been working on. ZFS seems to benchmark similarly to XFS on linux using software raid, while also having a lot of other additional neat features, like snapshots, filesystem level compression, clones, etc. With filesystem compression enabled, ZFS beats the pants off of these other filesystems in synthetic benchmarks, but I suspect that the files that tiobench creates compress very well so require very little actual IO to the disks. The other side of that coin is that if you are working with compressable files (text, html, possibly even databases), ZFS+compression might work very well for real-world performance. Granted the CPU utilization involved in managing the filesystem will be higher, but I've realized that I'd rather waste the CPU running the disks as fast as possible rather than have the systems sitting idle waiting for data to come off of the spindles. If I get motivated I may try to create a mysql database with some of our application data and do some read/write benchmarks.

One interesting ZFS-ism I ran into was the size reporting of ZFS volumes. I created a raidz (ZFS version of Raid5) with 4 "400G" disks (Which are actually about 370 gigs), which showed up in "df" as ~1.46TB -- which is the size I'd expect if all 4 of them were actual data drives. Software raid5 on linux and the hardware raid card both show ~1.1TB of space. It turns out that you don't magically gain space with ZFS, rather, the "parity tax" of raid5 is tacked on for each file as it is added to the filesystem. To demonstrate:

root@test:/u01/asdf# mkfile 2g testfile

root@test:/u01/asdf# ls -lh
total 2.7G
-rw------T 1 root root 2.0G Sep 16 00:58 testfile

root@test:/u01/asdf# df -kh /u01/asdf
Filesystem Size Used Avail Use% Mounted on
u01/asdf 1.3T 2.7G 1.3T 1% /u01/asdf

Hoping to complete the benchmarks I've been doing and post them here soon.

Linux BIOS research

2006-09-10T22:11:00.000-07:00

We did a lot of work in the BIOS of our machines last week. Doing BIOS edits by hand sucks, there's got to be a better way than connecting the keyboard and monitor to 100+ machines, rebooting them, and changing settings by hand.

Ben dug up some informaiton in a Sun v20z configuration guide, which seems to share a similar BIOS to the machines we have. We had ECC memory enabled in the BIOS before, but we seemed to have a lot of machines dying and hanging. We changed the following additional settings:
ECC Logging: [Enabled]
ECC Chipkill: [Enabled]
ECC Scrubbing: [163 us]
L2 Scrubbing: [10.2 us]
L1 Scrubbing: [5.12 us]

Also we upgraded to a later Linux kernel; version 2.6.17.8, and added the EDAC module. The k8/opteron code was not in the mainline code but is buildable as a module.

It's hard to declare a success based on the absence of machines hanging, but we've seen the EDAC catch some errors and haven't had a memory related machine hang yet.

Edit: A couple weeks later and the failure rate is way down.