Ops Monkey: 09/01/2006

Wednesday, September 27, 2006

Apache hackery

The following is a patch against apache 2.0.54 (Probably applies clean to other versions, I've applied it to 2.0.55 also). It's built for debian linux, it's possible that some hacking may be necessary to get it to apply to a vanilla version of httpd but I doubt it. Copy the attached patch to a file called for example, 000_ProxyMultiSource.

Instructions for building on debian:

apt-get install debhelper apache2-threaded-dev
(also will need gcc, libtool, autoconf, etc, if not already installed)
cd /opt/apache/build
apt-get source apache2
cp 000_ProxyMultiSource debian/pacthes/.
debian/rules binary

Install the resulting .deb files: (We use the worker MPM, YMMV)

dpkg -i apache2_2.0.54-5_amd64.deb  apache2-common_2.0.54-5_amd64.deb  apache2-mpm-worker_2.0.54-5_amd64.deb

What it does:

This adds a new configuration directive to the apache config file. It is defined within the virtual host. The config item/syntax is:

ProxyMultiSource <ip> [IP] [IP]  [IP]

This causes the server, when acting as a proxy server, to randomly set it's source address to one of the <n> IP addresses above for each new request. This can be used, for example, to have a machine with a few DSL/T1 lines connected to it to split proxy traffic among all the links. It doesn't look all that random, especially at first, since all of the threads presumably have the same random seed so end up generating the same sequence of numbers. It hasn't been a big enough issue for me to fix it, since it evens out over time.

Note that these IP addresses actually have to be live on your system or the bind will fail, and probably with spectacular results. (I suspect it will lock up, since it repeatedly retries failures to bind() the local address -- this is to deal with "Address already in use" issues where the local and remote address/port pairs are identical across two transactions). Also see my previous post "Put it where it doesn't belong" to make sure that this IP traffic makes it out the appropriate interface instead of everything riding out the defaultroute. That is not what you want.

This also makes the variable "proxy-source" available to the logging system - for example:

LogFormat "%h %l %u %t \"%r\" %>s %b %T %{proxy-source}n" proxy

will include the IP address of the chosen proxy as the last value of the log entry. It will show as a "-" if it's not set -- if the request comes out of cache, or if it's a continuation of an HTTP/1.1 keepalive request, this may happen. (I may look into a way of preserving it for HTTP/1.1 requests in the future)

This code seems to be pretty stable; there have been a couple times where it's started up and given a signal 8(SIGWTF) but that's been rare. We see it gleefully take over 100 hits per second and push 100mbits+ traffic for extended periods.

Note: This has not been tested under the following circiumstances:
- Multiple virtual hosts sharing the same list of IP's
- Multiple virtual hosts with discrete lists of IP's
- A single ProxyMultiHost IP (probably useful in it's own right eh)
- Apache built with this patch _not_ implementing the ProxyMutliHost directive. (It may fail to bind an address at all)
- Virtual Hosts configured but not implementing the ProxyMultiHost directive.
If this stuff doesn't work, btw, it should not bomb the whole server, only the proxy functionality.

Note. I haven't written anything substantial in C in like 10 years, please be nice.

Get the patch here:
000_ProxyMultiSource

Friday, September 22, 2006

Put it where it doesn't belong!

Our latest fight with linux has been to get a machine with multiple connections to the internet via different circiuts to pass traffic correctly. The machine is a web proxy which will bind the socket to a specific source address to round-robin the trafic across the two networks. This would be way useful for a company to effectively "bind" a few low-bandwidth links (dsl, t1, etc) into a higher bandwidth office proxy.

The first problem is to make the traffic that is sourced from a specific set of IP addresses use a different routing scheme. In this example, all of this traffic is going out to the internet, so with a default configuration, it would use the default route regardless of what the source address was set to.

We have a router machine with a bunch of interfaces - 2 onboard interfaces and a Silicom PXG-6 (6-port gigabit card). Lets say we have two DSL lines with a small set of static IP's, 66.166.22.0/28 and 66.156.12.0/28. We had 66.166.22.0 delivered first so it's set up as the default route to the internet over eth7, while eth6 is hooked up to the router for our internal LAN.

The routing table looks like this:


66.166.22.0     0.0.0.0         255.255.255.240 U         0 0          0 eth7
66.156.12.0     0.0.0.0         255.255.255.240 U         0 0          0 eth0
192.168.0.0     192.168.1.13    255.255.0.0     UG        0 0          0 eth6
10.0.0.0        192.168.1.13    255.0.0.0       UG        0 0          0 eth6
0.0.0.0         66.166.22.1     0.0.0.0         UG        0 0          0 eth7

The new link has been configured on eth0.

The first thing we'll need to configure this is the "iproute" tool, on debian, as easy as "apt-get install iproute".

First thing to be done with this tool is to create a parallel routing table. We will call the new table "pipe2". First, we need to edit /etc/iproute2/rt_tables and add a new line:

/etc/iproute2/rt_tables: (New entry added in bold)

# reserved values
#
255     local
254     main
253     default
200     pipe2   
0       unspec
#
# local
#
#1      inr.ruhep

(The number 200 is arbitrary but must fall between local and unspec)

Now add routes to these tables. I ended up putting these commands as "post-up" rules in the debian networking scripts for this interface (/etc/network/interfaces)

ip route add 10.0.0.0/8 via 192.168.1.13 table pipe2
ip route add 192.168.0.0/16 via 192.168.1.13 table pipe2
ip route add 172.16.0.0/12 via 192.168.1.13 table pipe2
ip route add default via 66.156.12.1 table pipe2

This sets up what the routing table should look like for traffic sourced from the second set of public addresses. Note that the rules to send office LAN traffic internally have to be duplicated in this table.

Next, we must insert a policy route that tells the kernel when to apply this routing table to the traffic:

ip rule add from 66.156.12.0/28 table pipe2

This gets traffic that is sourced from an IP on 66.156.12.0/28 to use the correct default router. However, there are still a few more steps. By default, linux will answer arps for any IP addresses it owns over any interface. This means that, in the above example, eth7(66.166.22.0 net) could claim to be the owner of an IP on the 66.156.12.0 network.

This is solved with the arp_filter control in /proc/sys/net/ipv4/(interface)/arp_filter. We eliminated this with a:

for i in `echo /proc/sys/net/ipv4/conf/*/arp_filter`; do echo 1 > $i; done

Here is a great discussion on what arp_filter does.
An excellent discussion of ARP as implemented on linux is Here. (This is where I found the solution to this problem, under "Arp Flux")

In retropect we might have wanted to do that part first, to prevent the arp caches of various equipment from getting the MAC of the wrong interface. If stuff upstream gets the wrong MAC in their table, you can reset the hardware (DSL modem) or ask your ISP to flush their arp cache. We also may or may not have had some luck with the "ip neigh flush".

Next, I will publish the patches to apache2 that allow for this proxy multisourcing.

Thursday, September 21, 2006

Home fileserver on Solaris Express/ZFS

(Mirroring this for posterity in case Svens ever forgets to pay his yahoo hosting bill :-) )

Hardware:
Basically I bought everything at Fry's. Going online, a lot of this stuff might be even cheaper, but probably not by a lot. With the exception of the raid controllers, all of this stuff was available at every store. The raid controller I finally found at the Brokaw Fry's. The only thing I re-used was an old 20x IDE CD-ROM, in retrospect since I ended up re-installing three times, it would have been worth $30 for a fast CDrom to get that 2 hours of my life back.

Case: Aluminus Ultra: $130
This is kinda ghey-with-an-H because of the side window and high gloss and everything, but it did have the advantage of having a neat internal 3.5-inch hard drive mounting rail system. Also it has 5 5 1/4 inch bays which is important for the other disks plus CD-ROM. Since it's mostly 120mm fans it's relatively quiet -- still a lot noisier than my shuttle, but quieter than my old tower.

SATA Cage: "random fry's brand": $120
This thingie lets you mount 4 sata drives in a neat hotswap case that takes up 3 5 1/4 inch bays.

2 x SATA Controller: "SIIG 4-port RAID": $80/ea
These are SiliconImage 3114 chipset based RAID cards. They come with all of the power and data cables you'll need.

8 x SATA disks: "Maxtor 250G Maxline Plus II" $100/ea
These are plain SATA-I disks but the controller only does SATA-I anyway.

Motherboard: "Asus K8N socket 754" $47 - open stock
Piece of crap open box motherboard. I got it because it has AGP video, 3-4 PCI slots, and builtin gigabit ethernet.

CPU: "AMD Sempron 2800+ 64bit" $41
Retail box -- slower CPU than what I expected, but it's good enough to run the RAID calcs, samba, httpd, etc, and it's 64 bit :-)

RAM: "Patriot 1G value Ram" $90
With $15 rebate that I'll probably forget about (edit 02/10/2007 -- I forgot)

Video: "No-Name GeForce MX400" $50
I bought it cause it was cheap and did not have a fan, enough crap spinning in there already.

I did not buy a floppy drive
If you do this, buy a floppy drive, or at least make sure you have one handy you can use temporarily.

The total damage was around $1500 w/ tax and everything. The total space after RAID is 1.8T, less than a dollar a gig.

II. Trials and Tribulations:
Everything snapped into this case pretty well. I spent extra time making sure that the sata and power cables were tied down nice, in promote good airflow.

I started installing the latest build of OpenSolaris, Nevada Build 31, because I wanted to play with ZFS and I wanted the most mature sata/network driver support avaialble. It's 4 CD's not counting the language or "add ons" (/opt/sfw) pack. The main problem I had was that Solaris 10 does not recognize these RAID cards out of the box. Solaris still does not recognize any SATA hardware that is acting as a raid card. The way to "fix" these cards is to remove their RAID functionality by loading a straight IDE BIOS. Download the IDE BIOS here. Also download the "BIOS Updater Utilities".

You'll need a DOS boot disk to copy these files to. If you do not have a DOS boot disk, there are instructions in this .ZIP file that tell you where to download a FreeDOS boot disk image where you can then copy the BIOS updater and the IDE BIOS .bin file. Getting a floppy drive hooked up is the hard part. Once you're booted to DOS, the command I used was

A:\> UPDFLASH B5304.BIN -v

The command will carp about some various stuff and then go about it's business updating your Flash BIOS. For this specific RAID card, I think I had to tell it that the Flash memory was compatible with STT 39??010 1M flash. The command updated both of the cards in my system at the same time; it did not require me to run it twice or use special command line flags.

Thus updated, you can now reboot your system. You may notice that during POST, the cards are now called "Silicon Image 3114 SATALink" instead of "3114 SATARaid" and have no option to press <ctl-s> to enter their BIOS. You can now install Solaris from CD as normal.

I do not care for installing Solaris off of CD. It gets to the "Using RPC for sysid configs" (or something) step of the boot, and then just hangs there for 5+ minutes. There's really no way to tell if your machine is horked, or your CD drive froze up, or what, it's just sitting there not doing anything.

The installer could now see all 8 of my disks. I chose to put a small ~4G partition on c1d0s0 and 512M for swap on c1d0s1. I left slice 6 empty and a 1-cylinder slice at s7 for the metadb. Once the operating system installed, I mirrored onto the first disk of the second controller (disk4) by giving it an identical partitioning scheme as disk 0 and using the standard solaris meta-commands:

#  metadb -fa c1d0s7 c3d0s7
#  metainit -f d1 1 1 c1d0s0
#  metainit -f d2 1 1 c3d0s0
#  metainit -f d6 1 1 c1d0s1
#  metainit -f d7 1 1 c3d0s1
#  metainit d0 -m d1
#  metainit d5 -m d6
#  metaroot d0
#  (edit vfstab to use d5 for swap)
#  lockfs -fa
#  reboot

Once the system came back up, I attached the metadevices:

#  metattach d0 d2
#  metattach d5 d7

The sync went fast since these are small slices. At this point I did some additional configuration. You'll want to use the "svcadm" command to turn off things like autofs, telnet, ftp, etc:

#  svcadm disable telnet
#  svcadm disable autofs
#  svcadm disable finger

(etc, I did not document exactly what I disabled but autofs has to be turned off if you want home directories to work :-)) Also if you have this specific motherboard, you'll need to add the following to your /etc/driver_aliases for the system to find your network card: nge "pci10de,df" (This tells the system to bind the Nvidia Gigabit Ethernet driver to the PCI card with vendor ID 10de and product ID 00de). Do the usual editing of /etc/hostname.nge0 /etc/inet/ipnodes /etc/hosts /etc/netmasks /etc/defaultrouter to get your network up and running.

If you had something that Solaris just saw out of the box, you probably set this up already during the install. If not, you might have to add something different to your /etc/driver_alises - there is pretty good google juice on the various ethernet cards out there. I rebooted at this point for the changes to take effect.

Now for the fun part, ZFS. For my non-root disks, I created another partition using almost all of the disks (I saved the first three cylinders since it looked like there was some boot information or something on there). zfs was a lot easier than I thought it would be. Do a man on zfs and zpool. To create the pool, it was easy:

zpool create -f u01 raidz c1d0s6 c1d1s6 c2d0s6 c2d1s6 c3d0s6 c3d1s6 c4d0s6 c4d1s6

The -f is to force it, because c1d0s6 and c3d0s6 are smaller than all of the other partitions. This command returned in about 4 seconds. It literally takes longer to type out the command than it does for ZFS to create you a 1.78TB filesystem. The filesystem will auto-magically be mounted on /u01 (because the pool name is u01, specified on the command line above, it could be any arbitrary name). From here, you can use the "zfs" command to create new filesystems that share data from this pool:


zfs create u01/home 
zfs create u01/home/amiller 
zfs set mountpoint=/home/amiller u01/home/amiller

None of this has to go into /etc/vfstab, the system just knows about it and mounts it at boot with "zfs mount -a".
So now on my box, I have:


amiller$ df -kh |grep u01
u01                   1.8T   121K   1.7T     1%    /u01 
u01/home              1.8T   119K   1.7T     1%    /u01/home 
u01/home/amiller      1.8T   3.7M   1.7T     1%    /home/amiller

Friday, September 15, 2006

ZFS on Nexenta

This week I've played around with Nexenta. This is a neat operating system. It is basically a "distribution" of OpenSolaris, using debian-like package management system, built on the Ubuntu "Dapper Drake" release. Fortunately, it just happened to support all of the pieces of hardware in the test system, a Supermicro H8DAR-T, with Broadcom ethernet and a Marvell88 based onboard SATA2 controller.

I installed this in support of some filesystem/raid benchmarking I have been working on. ZFS seems to benchmark similarly to XFS on linux using software raid, while also having a lot of other additional neat features, like snapshots, filesystem level compression, clones, etc. With filesystem compression enabled, ZFS beats the pants off of these other filesystems in synthetic benchmarks, but I suspect that the files that tiobench creates compress very well so require very little actual IO to the disks. The other side of that coin is that if you are working with compressable files (text, html, possibly even databases), ZFS+compression might work very well for real-world performance. Granted the CPU utilization involved in managing the filesystem will be higher, but I've realized that I'd rather waste the CPU running the disks as fast as possible rather than have the systems sitting idle waiting for data to come off of the spindles. If I get motivated I may try to create a mysql database with some of our application data and do some read/write benchmarks.

One interesting ZFS-ism I ran into was the size reporting of ZFS volumes. I created a raidz (ZFS version of Raid5) with 4 "400G" disks (Which are actually about 370 gigs), which showed up in "df" as ~1.46TB -- which is the size I'd expect if all 4 of them were actual data drives. Software raid5 on linux and the hardware raid card both show ~1.1TB of space. It turns out that you don't magically gain space with ZFS, rather, the "parity tax" of raid5 is tacked on for each file as it is added to the filesystem. To demonstrate:

root@test:/u01/asdf# mkfile 2g testfile

root@test:/u01/asdf# ls -lh
total 2.7G
-rw------T 1 root root 2.0G Sep 16 00:58 testfile

root@test:/u01/asdf# df -kh /u01/asdf
Filesystem Size Used Avail Use% Mounted on
u01/asdf 1.3T 2.7G 1.3T 1% /u01/asdf

Hoping to complete the benchmarks I've been doing and post them here soon.

Sunday, September 10, 2006

Linux BIOS research

We did a lot of work in the BIOS of our machines last week. Doing BIOS edits by hand sucks, there's got to be a better way than connecting the keyboard and monitor to 100+ machines, rebooting them, and changing settings by hand.

Ben dug up some informaiton in a Sun v20z configuration guide, which seems to share a similar BIOS to the machines we have. We had ECC memory enabled in the BIOS before, but we seemed to have a lot of machines dying and hanging. We changed the following additional settings:
ECC Logging: [Enabled]
ECC Chipkill: [Enabled]
ECC Scrubbing: [163 us]
L2 Scrubbing: [10.2 us]
L1 Scrubbing: [5.12 us]

Also we upgraded to a later Linux kernel; version 2.6.17.8, and added the EDAC module. The k8/opteron code was not in the mainline code but is buildable as a module.

It's hard to declare a success based on the absence of machines hanging, but we've seen the EDAC catch some errors and haven't had a memory related machine hang yet.

Edit: A couple weeks later and the failure rate is way down.

Ops Monkey