Ops Monkey: linux

Showing posts with label linux. Show all posts

Monday, June 18, 2007

Rebuilding a 3Ware Raid set in linux

This information is specific to the 3Ware 9500 Series controller. (More specifically, the 9500-4LP). However, the 3Ware CLI seems to be the same for other 3Ware 9XXX controllers which I have had experience with. (The 9550 for sure)

Under linux, the 3Ware cards can be manipulated through the "tw_cli" command. (The CLI tools can be downloaded for free from 3Ware's support website)

A healthy RAID set looks like this:


dev306:~# /opt/3Ware/bin/tw_cli
//dev306> info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    OK             -      256K    1117.56   ON     OFF      OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     372.61 GB   781422768     3PM0Q56Z    
p1     OK               u0     372.61 GB   781422768     3PM0Q3YY    
p2     OK               u0     372.61 GB   781422768     3PM0PFT7    
p3     OK               u0     372.61 GB   781422768     3PM0Q3B7

A failed RAID set looks like this:


dev306:~# /opt/3Ware/bin/tw_cli
//dev306> info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    DEGRADED       -      256K    1117.56   ON     OFF      OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     372.61 GB   781422768     3PM0Q56Z    
p1     OK               u0     372.61 GB   781422768     3PM0Q3YY    
p2     OK               u0     372.61 GB   781422768     3PM0PFT7    
p3     DEGRADED         u0     372.61 GB   781422768     3PM0Q3B7

Now I will remove this bad disk from the RAID set:


//dev306> maint remove c0 p3
Exporting port /c0/p3 ... Done.

I now need to physically replace the bad drive. Unfortunately since our vendor wired some of our cables cockeyed, I will usually cause some I/O on the disks at this point, to see which of the four disks is "actually" bad. (Hint: The one with no lights on is the bad one.)


dev306:~# find /opt -type f -exec cat '{}' > /dev/null \;

With the bad disk identified and replaced, now I need to go back into the 3Ware CLI and find the new disk, then tell the array to start rebuilding.


dev306:~# /opt/3Ware/bin/tw_cli
//dev306> maint rescan
Rescanning controller /c0 for units and drives ...Done.
Found the following unit(s): [none].
Found the following drive(s): [/c0/p3].


//dev306> maint rebuild c0 u0 p3
Sending rebuild start request to /c0/u0 on 1 disk(s) [3] ... Done.

//dev306> info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-5    REBUILDING     0      256K    1117.56   ON     OFF      OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     372.61 GB   781422768     3PM0Q56Z    
p1     OK               u0     372.61 GB   781422768     3PM0Q3YY    
p2     OK               u0     372.61 GB   781422768     3PM0PFT7    
p3     DEGRADED         u0     372.61 GB   781422768     3PM0Q3B7

Note that p3 still shows a status of "DEGRADED" but now the array itself is "REBUILDING". Under minimal IO load, a RAID-5 with 400GB disks such as this one will take about 2.5 hours to rebuild.

Saturday, February 17, 2007

Path MTU discovery and MTU troubleshooting

Recently when debugging some performance issues on a client's site, I came across some very interesting behavior. Some users were reporting that the site performed very well for a short period of time, but after a while, performance became very poor, enough so to render the site unusable. Checking the apache logfiles for the IP addreses of those clients showed that the requests themselves were not taking an unusual amount of time, but instead the requests were coming into the webserver at a snails pace.

Checking at the network level, I saw some strange things happening:


prod-lb01:~# tethereal -R "http.request and ip.addr == (client)"
125.362898 (client) -> (server)   HTTP GET /search/stuff HTTP/1.1
125.362922   (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
126.612994 (client) -> (server)   HTTP GET /search/stuff HTTP/1.1
126.613018   (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
129.615113 (client) -> (server)   HTTP GET /search/stuff HTTP/1.1
129.615135   (server) -> (client) ICMP Destination unreachable (Fragmentation needed)
135.616047 (client) -> (server)   HTTP GET /search/stuff HTTP/1.1
135.616066   (server) -> (client) ICMP Destination unreachable (Fragmentation needed)

Fragmentation Needed? (ICMP Type 3/Code 4) Why would we be needing to fragment incoming packets? This should only happen if the packet is bigger than the Maximum Transmission Size (MTU), and since this is all connected with ethernet, at a constant 1500 MTU, it is odd to see this.

Then I remembered this site is using Linux Virtual Server (LVS) for load balancing incoming requests. LVS can be configured in several ways, but this site is using IP-IP aka LVS-Tun load balancing, which encapsulates the incoming IP packet inside another packet and sends that to the destination server. Since this uses IP encapsulation, each request that hits the load balancer will have additional headers tacked on, to address the packet to the appropriate realserver. It happens to add 20 bytes to the header.

Okay, so the actual MTU of requests that go to the load balancer is 1480 due to the encapsulation overhead. Snooping for this type of packet at the router, I notice that we're sending out a LOT of them:


(router):~# tcpdump -n -i eth7  "icmp[icmptype] & icmp-unreach != 0 and icmp[icmpcode] & 4 != 0"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
17:07:00.608444 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.288197 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.910215 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:01.927728 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.391218 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.693094 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:02.912513 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:03.019852 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
17:07:03.398335 IP (server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)

These ICMP messages are not bad, per say, they are part of the Path MTU Discovery process. However, many firewalls indiscriminately block ICMP packets of all kinds. Based on the research I did on this problem, most of the documentation I found was from the end-user's perspective, i.e., users who had PPPoE or other types of encapsulated/tunneled connections and had trouble getting to certain websites. Now with the proliferation of personal firewall hardware and software, some of which may be overzealously configured to block all ICMP (even "good" ICMP like PMTU discovery), this is something that server admins have to worry about, too, especially if running a load balancing solution which encapsulates packets.

The research I did on the problem pointed me to the following iptables rule to be added on the router:

iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1400:1536 -j TCPMSS --clamp-mss-to-pmtu

This is intended to force the advertised Maximum Segment Size (MSS) to be the 40 less than of the smallest MTU that the router knows about. However, this didn't work for us (This tcpdump line looks for any TCP handshakes plus any ICMP unreachable errors):


(router):~# tcpdump -vv -n -i eth7 "(host (client) ) and \
      (tcp[tcpflags] & tcp-syn != 0 oricmp[icmptype] & icmp-unreach != 0)"
tcpdump: listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
18:00:17.479661 IP (tos 0x0, ttl  53, id 47601, offset 0, flags [DF], length: 52)
(client).1199 > (server).80: S [tcp sum ok] 2541494183:2541494183(0) win 65535
<mss 1460,nop,wscale 2,nop,nop,sackOK>

18:00:17.479861 IP (tos 0x0, ttl  63, id 0, offset 0, flags [DF], length: 52)
(server).80 > (client).1199: S [tcp sum ok] 2875112671:2875112671(0) ack 2541494184 win 5840
<mss 1460,nop,nop,sackOK,nop,wscale 7>

18:00:17.771080 IP (tos 0xc0, ttl  63, id 10080, offset 0, flags [none], length: 576)
(server) > (client): icmp 556: (server) unreachable - need to frag (mtu 1480)
for IP (tos 0x0, ttl  52, id 47613, offset 0, flags [DF], length: 1500)
(client).1199 > (server).80: . 546:2006(1460) ack 1 win 64240

It was still negotiating a 1460 byte MSS during the handshake. In hindsight, this makes sense, because the router doesn't really know that the MTU of the load balancer and the realservers is actually smaller than 1500 - the router communicates with these machines over their ethernet interfaces, which are all still set to a 1500 byte MTU. Digging some more into the problem (Including the LVS-Tun HOWTO linked above) there were quite a few things mentioned, but no real definitive answers.

I chose to fix this problem by hardcoding the MSS to 1440 at the router, rather than using the "clamp-mss-to-pmtu" setting:

iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -m tcpmss --mss 1440:1536 -j TCPMSS --set-mss 1440

1440 is the normal MSS value of 1460, minus the 20 byte overhead for the encapsulated packet. This seems to have fixed the problem entirely:

(router):~# tcpdump -vv -n -i eth7 "(host (client) ) and \
      (tcp[tcpflags] & tcp-syn != 0 or icmp[icmptype] & icmp-unreach != 0)"
tcpdump: listening on eth7, link-type EN10MB (Ethernet), capture size 96 bytes
18:02:19.466678 IP (tos 0x0, ttl  53, id 55012, offset 0, flags [DF], length: 52)
(client).1298 > (server).80: S [tcp sum ok] 2863214365:2863214365(0) win 65535
<mss 1460,nop,wscale 2,nop,nop,sackOK>

18:02:19.466886 IP (tos 0x0, ttl  63, id 0, offset 0, flags [DF], length: 52)
(server).80 > (client).1298: S [tcp sum ok] 2996826059:2996826059(0) ack 2863214366 win 5840
<mss 1440,nop,nop,sackOK,nop,wscale 7>

.... silence!

PS - The reason that I was seeing this very odd behavior - very fast at first, followed by an unusable site?

The client website had recently added a search history, which was stored in a browser cookie. Things would go great until enough data was in the cookie to push it up over 1440 bytes.
I had configured my home DSL router to discard ICMP some many years back and had forgotten about it - My firewall was throwing away the ICMP Fragmentation Needed packets, so my PC never "Got the memo" that it needed to send smaller packets!

This actually worked out for the better, though - this site had had reports of odd slowness in the recent past, and hopefully this was the root cause!

EDIT: Note that in the original post, I had missed an important option, in the iptables config it is important to use the "-m tcpmss --mss 1440:1536" setting. Without this flag, iptables will force the MSS of ALL traffic to 1440, including clients which request a size smaller than that. This obviously presents a problem to the client.

Wednesday, January 24, 2007

Linux memory overcommit

Last week I learned something very interesting about the way Linux allocates and manages memory by default out of the box.

In a way, Linux allocates memory the way an airline sells plane tickets. An airline will sell more tickets than they have actual seats, in the hopes that some of the passengers don't show up. Memory in Linux is managed in a similar way, but actually to a much more serious degree.

Under the default memory management strategy, malloc() essentially always succeeds, with the kenrel assuming you're not _really_ going to use all of the memory you just asked for. The malloc()'s will continue to succeed, but not until you actually try to use the memory you allocated will the kernel 'really' allocate it. This leads to severe pathology in low memory conditions, because the application has already allocated the memory, it thinks it can use it free and clear, but when the system is in a low memory condition and an application is trying to use additional memory it has already allocated, the memory access takes a very long time as the kernel hunts around for memory to give.

In an extremely low memory condition, the kernel will start firing off the "OOM Killer" routine. Processes are given 'OOM Scores' and the process with the highest score, win^H^H^Hloses. This leads to random processes on a machine being killed by the kernel. Keeping in the airline analogies, I found this entertaining post.

I found some interesting information about the Linux memory manager here in section 9.6. This section has three small C programs to test memory allocation. The second and third program produced pretty similar results for me so I'm omitting the third:

Here are the results of the test on an 8GB debian Linux box:


demo1:  malloc memory and do not use it:       Allocated 1.4TB, killed by OOM killer
demo2:  malloc memory and use it right away:   Allocated 7.8GB, killed by OOM killer

Here are the results on an 8GB Nexenta/Opensolaris machine:


demo1:  malloc memory and do not use it:       Allocated 6.6GB, malloc() fails
demo2:  malloc memory and use it right away:   Allocated 6.5GB, malloc() fails

Apparently, a big reason linux manages memory this way out of the box is to optimize memory usage on fork()'ed processes; fork() creates a full copy of the process space, but in this instance, with overcommitted memory, only pages which have been written to actually need to be allocated by the kernel. This might work very well for a shell server, a desktop, or perhaps a server with a large memory footprint that forks an actual PID rather than a thread, but in our situation, this is very undesirable.

We run a pretty java-heavy environment, with multiple large JVMs configured per host. The problem is that the heap sizes have been getting larger, and we were running in an overcommitted situation and did not realize it. The JVMs would all start up and malloc() their large heaps, and then at some later time once enough of the heaps were actually used, the OOM killer would kick in and more or less randomly off one of our JVMs.

I found that linux can be brought more in line with traditional/expected memory management by setting the sysctls: (Apparently these are available only 2.6 kernels)


vm.overcommit_memory (0=default, 1=malloc always succeeds(?!?), 2=strict overcommit)
vm.overcommit_ratio (50=default, I used 100)

The ratio appears to be the percentage off the system's total VM that can be allocated via malloc() before malloc() fails. This MIGHT be on a per-pid basis (need to research). This number can be greater than 100%, presumably to allow for some slop in the copy-on-write fork()'s. When I set this to 100 on a 8GB system, I was able to malloc() about 7.5G of stuff, which seemed about right since I had normal multi-user processes running and no swap configured. I don't know why you'd want to use a number much less than 100, unless it were a per-process limit, or you wanted to force some saved room for fscache.

The big benefit here is that malloc() can actually fail in a low memory condition. This means that the error can be caught and handled by the application. In my case, it means that JVMs fail at STARTUP time, with an obvious memory shortage related error in the logs, rather than having the process have the rug yanked out from under it hours or days later with no message in the application log, and no opportunity to clean up what it was doing.

Here are the demo programs with a linux machine set to strict overcommit/100 ratio:


demo1:    malloc memory and do not use it:        Allocated 7.3GB, malloc fails.
demo2:    malloc memory and use it right away:    Allocated 7.3GB, malloc fails.

Technorati Tags: linux, memory, OOM

Friday, October 20, 2006

Debian linux ethernet bonding

We're working on some fault tolerant delpoyments of debian linux systems. It turns out that this is available in the stock debian kernel (sarge) and luckily, in the custom kernel we'd built at a later time.

I dug up a few howto's out there, but none of them really had all of the pieces in one place, specifically for the packages that Debian uses out of the box.

Two software pieces are necessary - the "bonding" driver, and the "ifenslave" package. The bonding driver is in the default sarge kernel, and appears to be in the defaults during a kernel build. The "ifenslave" package is the userspace program used to control the binding of physical interfaces to the bonded driver. To install this, simply

apt-get install ifenslave-2.6

It's important to get the latest version; the version that installs with a plain "apt-get install ifenslave" doesn't seem to work properly.

Next, there are two files which need to change to tell the kernel to load the bonding driver. I appended these lines to the following files:


/etc/modules:
bonding

/etc/modprobe.d/aliases:
alias bond0 bonding
options bonding mode=active-backup miimon=100 max_bonds=1

If more than one bonding interface is needed, add additional aliases in this file, and increase the "max_bonds" option as necessary. We plan on using bonding on a few machines that act as routers, so they will need to have multiple bonded interface sets.

Finally, the bond0 interface must be set up in the /etc/network/interfaces file. More than likely there is an existing entry for your primary network interface, i.e. eth0. I just changed the interface name from eth0 to bond0, and added the following line:

up    ifenslave bond0 eth0 eth1

For reference, the entire /etc/network/interfaces file looks like this - notice that there are no individual entries for eth0 and eth1.

# generated by FAI
auto lo bond0
iface lo inet loopback
iface bond0 inet static
address 192.168.195.145
netmask 255.255.255.0
broadcast 192.168.195.255
gateway 192.168.195.1
up    ifenslave bond0 eth0 eth1
post-up  /opt/tools/bin/init-ipmi

The linux Bonding howto is very comprehenive and covers the different modes of operation of this driver, as well as installation instructions for different flavors of linux and some discussion of deployment scenarios. We'll be expirimenting with various bonding modes this week to see what we can get away with; currently we're planning on running in the "active-backup" mode which is a simple active/passive failover.

Here are some other resources about linux/debian ethernet bonding:
http://www.howtoforge.com/nic_bonding
http://glasnost.beeznest.org/articles/179
http://www.debianhelp.co.uk/bonding.htm

More about bonding later!

Wednesday, September 27, 2006

Apache hackery

The following is a patch against apache 2.0.54 (Probably applies clean to other versions, I've applied it to 2.0.55 also). It's built for debian linux, it's possible that some hacking may be necessary to get it to apply to a vanilla version of httpd but I doubt it. Copy the attached patch to a file called for example, 000_ProxyMultiSource.

Instructions for building on debian:

apt-get install debhelper apache2-threaded-dev
(also will need gcc, libtool, autoconf, etc, if not already installed)
cd /opt/apache/build
apt-get source apache2
cp 000_ProxyMultiSource debian/pacthes/.
debian/rules binary

Install the resulting .deb files: (We use the worker MPM, YMMV)

dpkg -i apache2_2.0.54-5_amd64.deb  apache2-common_2.0.54-5_amd64.deb  apache2-mpm-worker_2.0.54-5_amd64.deb

What it does:

This adds a new configuration directive to the apache config file. It is defined within the virtual host. The config item/syntax is:

ProxyMultiSource <ip> [IP] [IP]  [IP]

This causes the server, when acting as a proxy server, to randomly set it's source address to one of the <n> IP addresses above for each new request. This can be used, for example, to have a machine with a few DSL/T1 lines connected to it to split proxy traffic among all the links. It doesn't look all that random, especially at first, since all of the threads presumably have the same random seed so end up generating the same sequence of numbers. It hasn't been a big enough issue for me to fix it, since it evens out over time.

Note that these IP addresses actually have to be live on your system or the bind will fail, and probably with spectacular results. (I suspect it will lock up, since it repeatedly retries failures to bind() the local address -- this is to deal with "Address already in use" issues where the local and remote address/port pairs are identical across two transactions). Also see my previous post "Put it where it doesn't belong" to make sure that this IP traffic makes it out the appropriate interface instead of everything riding out the defaultroute. That is not what you want.

This also makes the variable "proxy-source" available to the logging system - for example:

LogFormat "%h %l %u %t \"%r\" %>s %b %T %{proxy-source}n" proxy

will include the IP address of the chosen proxy as the last value of the log entry. It will show as a "-" if it's not set -- if the request comes out of cache, or if it's a continuation of an HTTP/1.1 keepalive request, this may happen. (I may look into a way of preserving it for HTTP/1.1 requests in the future)

This code seems to be pretty stable; there have been a couple times where it's started up and given a signal 8(SIGWTF) but that's been rare. We see it gleefully take over 100 hits per second and push 100mbits+ traffic for extended periods.

Note: This has not been tested under the following circiumstances:
- Multiple virtual hosts sharing the same list of IP's
- Multiple virtual hosts with discrete lists of IP's
- A single ProxyMultiHost IP (probably useful in it's own right eh)
- Apache built with this patch _not_ implementing the ProxyMutliHost directive. (It may fail to bind an address at all)
- Virtual Hosts configured but not implementing the ProxyMultiHost directive.
If this stuff doesn't work, btw, it should not bomb the whole server, only the proxy functionality.

Note. I haven't written anything substantial in C in like 10 years, please be nice.

Get the patch here:
000_ProxyMultiSource

Friday, September 22, 2006

Put it where it doesn't belong!

Our latest fight with linux has been to get a machine with multiple connections to the internet via different circiuts to pass traffic correctly. The machine is a web proxy which will bind the socket to a specific source address to round-robin the trafic across the two networks. This would be way useful for a company to effectively "bind" a few low-bandwidth links (dsl, t1, etc) into a higher bandwidth office proxy.

The first problem is to make the traffic that is sourced from a specific set of IP addresses use a different routing scheme. In this example, all of this traffic is going out to the internet, so with a default configuration, it would use the default route regardless of what the source address was set to.

We have a router machine with a bunch of interfaces - 2 onboard interfaces and a Silicom PXG-6 (6-port gigabit card). Lets say we have two DSL lines with a small set of static IP's, 66.166.22.0/28 and 66.156.12.0/28. We had 66.166.22.0 delivered first so it's set up as the default route to the internet over eth7, while eth6 is hooked up to the router for our internal LAN.

The routing table looks like this:


66.166.22.0     0.0.0.0         255.255.255.240 U         0 0          0 eth7
66.156.12.0     0.0.0.0         255.255.255.240 U         0 0          0 eth0
192.168.0.0     192.168.1.13    255.255.0.0     UG        0 0          0 eth6
10.0.0.0        192.168.1.13    255.0.0.0       UG        0 0          0 eth6
0.0.0.0         66.166.22.1     0.0.0.0         UG        0 0          0 eth7

The new link has been configured on eth0.

The first thing we'll need to configure this is the "iproute" tool, on debian, as easy as "apt-get install iproute".

First thing to be done with this tool is to create a parallel routing table. We will call the new table "pipe2". First, we need to edit /etc/iproute2/rt_tables and add a new line:

/etc/iproute2/rt_tables: (New entry added in bold)

# reserved values
#
255     local
254     main
253     default
200     pipe2   
0       unspec
#
# local
#
#1      inr.ruhep

(The number 200 is arbitrary but must fall between local and unspec)

Now add routes to these tables. I ended up putting these commands as "post-up" rules in the debian networking scripts for this interface (/etc/network/interfaces)

ip route add 10.0.0.0/8 via 192.168.1.13 table pipe2
ip route add 192.168.0.0/16 via 192.168.1.13 table pipe2
ip route add 172.16.0.0/12 via 192.168.1.13 table pipe2
ip route add default via 66.156.12.1 table pipe2

This sets up what the routing table should look like for traffic sourced from the second set of public addresses. Note that the rules to send office LAN traffic internally have to be duplicated in this table.

Next, we must insert a policy route that tells the kernel when to apply this routing table to the traffic:

ip rule add from 66.156.12.0/28 table pipe2

This gets traffic that is sourced from an IP on 66.156.12.0/28 to use the correct default router. However, there are still a few more steps. By default, linux will answer arps for any IP addresses it owns over any interface. This means that, in the above example, eth7(66.166.22.0 net) could claim to be the owner of an IP on the 66.156.12.0 network.

This is solved with the arp_filter control in /proc/sys/net/ipv4/(interface)/arp_filter. We eliminated this with a:

for i in `echo /proc/sys/net/ipv4/conf/*/arp_filter`; do echo 1 > $i; done

Here is a great discussion on what arp_filter does.
An excellent discussion of ARP as implemented on linux is Here. (This is where I found the solution to this problem, under "Arp Flux")

In retropect we might have wanted to do that part first, to prevent the arp caches of various equipment from getting the MAC of the wrong interface. If stuff upstream gets the wrong MAC in their table, you can reset the hardware (DSL modem) or ask your ISP to flush their arp cache. We also may or may not have had some luck with the "ip neigh flush".

Next, I will publish the patches to apache2 that allow for this proxy multisourcing.

Friday, September 15, 2006

ZFS on Nexenta

This week I've played around with Nexenta. This is a neat operating system. It is basically a "distribution" of OpenSolaris, using debian-like package management system, built on the Ubuntu "Dapper Drake" release. Fortunately, it just happened to support all of the pieces of hardware in the test system, a Supermicro H8DAR-T, with Broadcom ethernet and a Marvell88 based onboard SATA2 controller.

I installed this in support of some filesystem/raid benchmarking I have been working on. ZFS seems to benchmark similarly to XFS on linux using software raid, while also having a lot of other additional neat features, like snapshots, filesystem level compression, clones, etc. With filesystem compression enabled, ZFS beats the pants off of these other filesystems in synthetic benchmarks, but I suspect that the files that tiobench creates compress very well so require very little actual IO to the disks. The other side of that coin is that if you are working with compressable files (text, html, possibly even databases), ZFS+compression might work very well for real-world performance. Granted the CPU utilization involved in managing the filesystem will be higher, but I've realized that I'd rather waste the CPU running the disks as fast as possible rather than have the systems sitting idle waiting for data to come off of the spindles. If I get motivated I may try to create a mysql database with some of our application data and do some read/write benchmarks.

One interesting ZFS-ism I ran into was the size reporting of ZFS volumes. I created a raidz (ZFS version of Raid5) with 4 "400G" disks (Which are actually about 370 gigs), which showed up in "df" as ~1.46TB -- which is the size I'd expect if all 4 of them were actual data drives. Software raid5 on linux and the hardware raid card both show ~1.1TB of space. It turns out that you don't magically gain space with ZFS, rather, the "parity tax" of raid5 is tacked on for each file as it is added to the filesystem. To demonstrate:

root@test:/u01/asdf# mkfile 2g testfile

root@test:/u01/asdf# ls -lh
total 2.7G
-rw------T 1 root root 2.0G Sep 16 00:58 testfile

root@test:/u01/asdf# df -kh /u01/asdf
Filesystem Size Used Avail Use% Mounted on
u01/asdf 1.3T 2.7G 1.3T 1% /u01/asdf

Hoping to complete the benchmarks I've been doing and post them here soon.

Sunday, September 10, 2006

Linux BIOS research

We did a lot of work in the BIOS of our machines last week. Doing BIOS edits by hand sucks, there's got to be a better way than connecting the keyboard and monitor to 100+ machines, rebooting them, and changing settings by hand.

Ben dug up some informaiton in a Sun v20z configuration guide, which seems to share a similar BIOS to the machines we have. We had ECC memory enabled in the BIOS before, but we seemed to have a lot of machines dying and hanging. We changed the following additional settings:
ECC Logging: [Enabled]
ECC Chipkill: [Enabled]
ECC Scrubbing: [163 us]
L2 Scrubbing: [10.2 us]
L1 Scrubbing: [5.12 us]

Also we upgraded to a later Linux kernel; version 2.6.17.8, and added the EDAC module. The k8/opteron code was not in the mainline code but is buildable as a module.

It's hard to declare a success based on the absence of machines hanging, but we've seen the EDAC catch some errors and haven't had a memory related machine hang yet.

Edit: A couple weeks later and the failure rate is way down.

Ops Monkey