Monday, December 18, 2006

Modifying a Nexenta Boot ISO

Today I needed to modify a nexenta Alpha-6 install ISO. There were several things I needed to fix.

1. We have two revisions of the same Supermicro H8DAR-T based beige-box systems. Solaris10/Nexenta works great on the 2.01 based motherboards. The older version - version 1.01 - does not have a SATA chipset that is supported by Nexenta out of the box. (The PCI ID is "pci11ab,6041.3" versus "pci11ab,6041.9" on the 2.01 version). I needed to add this PCI ID to the "/etc/driver_aliases" file. Aside: As far as I can tell, the only way to differentiate is to crack open the box and look for the silkscreened version on the back lefthand corner. :-( (Or, you could boot Nexenta/Solaris10 and see if it can see the disks :-) )

2. There is a bug in the script that will hang the system while scanning for partitions after manual partitioning. I found a response to this post by LukeD in the Nexenta forums that is reported to fix this problem.

3. We wanted to add a couple of packages to the "minimal" set.

If we start a widespread Nexenta rollout there will be much more automation going into these CD images (Or even better, network images) to make for a hands-off install.

Here's how I made the changes:

1. Copied the .ISO image to another Nexenta system. (There is nothing Nexenta/Solaris specific to making the image, however, the commands used are somewhat different between different operating systems. On a linux system, you would use losetup instead of lofiadm. )

2. Create a loopback device for the .iso and mount it:
lofiadm -a /opt/temp/elatte_installcd_alpha6_i386.iso
( this creates /dev/lofi/1 )
mkdir /mnt
mount -F hsfs /dev/lofi/1 /mnt
3. Copy all of the files to a temporary location. This is necessary because the .ISO image is read only.
mkdir /opt/temp/cd
cp -av /mnt/. /opt/temp/cd/.
4. Find and mount the miniroot image as a second loopback device. The miniroot is a gzipped UFS filesystem created in the "boot" directory of the CD. Most of the filesystem is here, although it appears that nexenta remounts /usr from the CD-ROM later in the boot process. This step is required because I am changing the "driver_aliases" file, which lives inside the gzipped miniroot -- if you plan only to modify the and/or add packages, these steps (4-6) are not necessary.
cd /opt/temp/cd/boot/
mv miniroot miniroot.gz
gunzip miniroot.gz
lofiadm -a /opt/temp/cd/boot/miniroot
( this creates /dev/lofi/2 )
mkdir /mnt2
mount /dev/lofi/2 /mnt2
5. Edit the driver_aliases file:
cd /mnt2/etc
(vi driver_aliases)
... If there are any other files that need to change within the miniroot, edit them now.

6. Unmount and re-gzip the miniroot, and clean up the other lofi mount too.
umount /mnt2
lofiadm -d /dev/lofi/2
umount /mnt
lofiadm -d /dev/lofi/1
cd /opt/temp/cd/boot
gzip miniroot
mv miniroot.gz miniroot
7. Fix the script mentioned in problem #2 and add some packages to the minimal set.
cd /opt/temp/cd/root/usr/gnusolaris
(I chose to do a "grep -i "Unknown_fstype" rather than run the sed script referenced in the forum post above)
vi base-minimal.lst
(add some packages here)
8. Create the new install CD. I found some good instructions on BigAdmin at the end of this article in the section "Using CD/DVD ISO Files":
mkisofs -o /opt/temp/nexenta_boot.iso -b boot/grub/stage2_eltorito \
-c .catalog -no-emul-boot -boot-load-size 4 \
-boot-info-table -relaxed-filenames -ldots -N -l -R \
-d -D -V Elatte_InstallCD /opt/temp/cd/
This will create the new .ISO file. From there you'll need to burn it to disk.

Unfortunately, adding the new driver_alias line didn't work for me. It sees the controller now, but it pukes about not knowing how to talk to it. (Forgot the exact error message but it looked ominous). I will have to do some testing on a known-working box (one of the later revision ones) to see if the other changes I made to this CD were successful.

Edit: The mkisofs command I originally listed was incorrect. I have found the magic mkisofs command parameters and updated this post accordingly. Note that the volume name MUST be "Elatte_InstallCD".

Friday, October 20, 2006

Debian linux ethernet bonding

We're working on some fault tolerant delpoyments of debian linux systems. It turns out that this is available in the stock debian kernel (sarge) and luckily, in the custom kernel we'd built at a later time.

I dug up a few howto's out there, but none of them really had all of the pieces in one place, specifically for the packages that Debian uses out of the box.

Two software pieces are necessary - the "bonding" driver, and the "ifenslave" package. The bonding driver is in the default sarge kernel, and appears to be in the defaults during a kernel build. The "ifenslave" package is the userspace program used to control the binding of physical interfaces to the bonded driver. To install this, simply
apt-get install ifenslave-2.6
It's important to get the latest version; the version that installs with a plain "apt-get install ifenslave" doesn't seem to work properly.

Next, there are two files which need to change to tell the kernel to load the bonding driver. I appended these lines to the following files:


alias bond0 bonding
options bonding mode=active-backup miimon=100 max_bonds=1
If more than one bonding interface is needed, add additional aliases in this file, and increase the "max_bonds" option as necessary. We plan on using bonding on a few machines that act as routers, so they will need to have multiple bonded interface sets.

Finally, the bond0 interface must be set up in the /etc/network/interfaces file. More than likely there is an existing entry for your primary network interface, i.e. eth0. I just changed the interface name from eth0 to bond0, and added the following line:
up    ifenslave bond0 eth0 eth1
For reference, the entire /etc/network/interfaces file looks like this - notice that there are no individual entries for eth0 and eth1.
# generated by FAI
auto lo bond0
iface lo inet loopback
iface bond0 inet static
up ifenslave bond0 eth0 eth1
post-up /opt/tools/bin/init-ipmi
The linux Bonding howto is very comprehenive and covers the different modes of operation of this driver, as well as installation instructions for different flavors of linux and some discussion of deployment scenarios. We'll be expirimenting with various bonding modes this week to see what we can get away with; currently we're planning on running in the "active-backup" mode which is a simple active/passive failover.

Here are some other resources about linux/debian ethernet bonding:

More about bonding later!

Wednesday, September 27, 2006

Apache hackery

The following is a patch against apache 2.0.54 (Probably applies clean to other versions, I've applied it to 2.0.55 also). It's built for debian linux, it's possible that some hacking may be necessary to get it to apply to a vanilla version of httpd but I doubt it. Copy the attached patch to a file called for example, 000_ProxyMultiSource.

Instructions for building on debian:
apt-get install debhelper apache2-threaded-dev
(also will need gcc, libtool, autoconf, etc, if not already installed)
cd /opt/apache/build
apt-get source apache2
cp 000_ProxyMultiSource debian/pacthes/.
debian/rules binary
Install the resulting .deb files: (We use the worker MPM, YMMV)
dpkg -i apache2_2.0.54-5_amd64.deb  apache2-common_2.0.54-5_amd64.deb  apache2-mpm-worker_2.0.54-5_amd64.deb
What it does:

This adds a new configuration directive to the apache config file. It is defined within the virtual host. The config item/syntax is:
ProxyMultiSource <ip> [IP] [IP]  [IP]
This causes the server, when acting as a proxy server, to randomly set it's source address to one of the <n> IP addresses above for each new request. This can be used, for example, to have a machine with a few DSL/T1 lines connected to it to split proxy traffic among all the links. It doesn't look all that random, especially at first, since all of the threads presumably have the same random seed so end up generating the same sequence of numbers. It hasn't been a big enough issue for me to fix it, since it evens out over time.

Note that these IP addresses actually have to be live on your system or the bind will fail, and probably with spectacular results. (I suspect it will lock up, since it repeatedly retries failures to bind() the local address -- this is to deal with "Address already in use" issues where the local and remote address/port pairs are identical across two transactions). Also see my previous post "Put it where it doesn't belong" to make sure that this IP traffic makes it out the appropriate interface instead of everything riding out the defaultroute. That is not what you want.

This also makes the variable "proxy-source" available to the logging system - for example:
LogFormat "%h %l %u %t \"%r\" %>s %b %T %{proxy-source}n" proxy
will include the IP address of the chosen proxy as the last value of the log entry. It will show as a "-" if it's not set -- if the request comes out of cache, or if it's a continuation of an HTTP/1.1 keepalive request, this may happen. (I may look into a way of preserving it for HTTP/1.1 requests in the future)

This code seems to be pretty stable; there have been a couple times where it's started up and given a signal 8(SIGWTF) but that's been rare. We see it gleefully take over 100 hits per second and push 100mbits+ traffic for extended periods.

Note: This has not been tested under the following circiumstances:
- Multiple virtual hosts sharing the same list of IP's
- Multiple virtual hosts with discrete lists of IP's
- A single ProxyMultiHost IP (probably useful in it's own right eh)
- Apache built with this patch _not_ implementing the ProxyMutliHost directive. (It may fail to bind an address at all)
- Virtual Hosts configured but not implementing the ProxyMultiHost directive.
If this stuff doesn't work, btw, it should not bomb the whole server, only the proxy functionality.

Note. I haven't written anything substantial in C in like 10 years, please be nice.

Get the patch here:

Friday, September 22, 2006

Put it where it doesn't belong!

Our latest fight with linux has been to get a machine with multiple connections to the internet via different circiuts to pass traffic correctly. The machine is a web proxy which will bind the socket to a specific source address to round-robin the trafic across the two networks. This would be way useful for a company to effectively "bind" a few low-bandwidth links (dsl, t1, etc) into a higher bandwidth office proxy.

The first problem is to make the traffic that is sourced from a specific set of IP addresses use a different routing scheme. In this example, all of this traffic is going out to the internet, so with a default configuration, it would use the default route regardless of what the source address was set to.

We have a router machine with a bunch of interfaces - 2 onboard interfaces and a Silicom PXG-6 (6-port gigabit card). Lets say we have two DSL lines with a small set of static IP's, and We had delivered first so it's set up as the default route to the internet over eth7, while eth6 is hooked up to the router for our internal LAN.

The routing table looks like this: U 0 0 0 eth7 U 0 0 0 eth0 UG 0 0 0 eth6 UG 0 0 0 eth6 UG 0 0 0 eth7
The new link has been configured on eth0.

The first thing we'll need to configure this is the "iproute" tool, on debian, as easy as "apt-get install iproute".

First thing to be done with this tool is to create a parallel routing table. We will call the new table "pipe2". First, we need to edit /etc/iproute2/rt_tables and add a new line:

/etc/iproute2/rt_tables: (New entry added in bold)
# reserved values
255 local
254 main
253 default
200 pipe2
0 unspec
# local
#1 inr.ruhep
(The number 200 is arbitrary but must fall between local and unspec)

Now add routes to these tables. I ended up putting these commands as "post-up" rules in the debian networking scripts for this interface (/etc/network/interfaces)
ip route add via table pipe2
ip route add via table pipe2
ip route add via table pipe2
ip route add default via table pipe2
This sets up what the routing table should look like for traffic sourced from the second set of public addresses. Note that the rules to send office LAN traffic internally have to be duplicated in this table.

Next, we must insert a policy route that tells the kernel when to apply this routing table to the traffic:
ip rule add from table pipe2

This gets traffic that is sourced from an IP on to use the correct default router. However, there are still a few more steps. By default, linux will answer arps for any IP addresses it owns over any interface. This means that, in the above example, eth7( net) could claim to be the owner of an IP on the network.

This is solved with the arp_filter control in /proc/sys/net/ipv4/(interface)/arp_filter. We eliminated this with a:
for i in `echo /proc/sys/net/ipv4/conf/*/arp_filter`; do echo 1 > $i; done

Here is a great discussion on what arp_filter does.
An excellent discussion of ARP as implemented on linux is Here. (This is where I found the solution to this problem, under "Arp Flux")

In retropect we might have wanted to do that part first, to prevent the arp caches of various equipment from getting the MAC of the wrong interface. If stuff upstream gets the wrong MAC in their table, you can reset the hardware (DSL modem) or ask your ISP to flush their arp cache. We also may or may not have had some luck with the "ip neigh flush".

Next, I will publish the patches to apache2 that allow for this proxy multisourcing.

Thursday, September 21, 2006

Home fileserver on Solaris Express/ZFS

(Mirroring this for posterity in case Svens ever forgets to pay his yahoo hosting bill :-) )

Basically I bought everything at Fry's. Going online, a lot of this stuff might be even cheaper, but probably not by a lot. With the exception of the raid controllers, all of this stuff was available at every store. The raid controller I finally found at the Brokaw Fry's. The only thing I re-used was an old 20x IDE CD-ROM, in retrospect since I ended up re-installing three times, it would have been worth $30 for a fast CDrom to get that 2 hours of my life back.

Case: Aluminus Ultra: $130
This is kinda ghey-with-an-H because of the side window and high gloss and everything, but it did have the advantage of having a neat internal 3.5-inch hard drive mounting rail system. Also it has 5 5 1/4 inch bays which is important for the other disks plus CD-ROM. Since it's mostly 120mm fans it's relatively quiet -- still a lot noisier than my shuttle, but quieter than my old tower.

SATA Cage: "random fry's brand": $120
This thingie lets you mount 4 sata drives in a neat hotswap case that takes up 3 5 1/4 inch bays.

2 x SATA Controller: "SIIG 4-port RAID": $80/ea
These are SiliconImage 3114 chipset based RAID cards. They come with all of the power and data cables you'll need.

8 x SATA disks: "Maxtor 250G Maxline Plus II" $100/ea
These are plain SATA-I disks but the controller only does SATA-I anyway.

Motherboard: "Asus K8N socket 754" $47 - open stock
Piece of crap open box motherboard. I got it because it has AGP video, 3-4 PCI slots, and builtin gigabit ethernet.

CPU: "AMD Sempron 2800+ 64bit" $41
Retail box -- slower CPU than what I expected, but it's good enough to run the RAID calcs, samba, httpd, etc, and it's 64 bit :-)

RAM: "Patriot 1G value Ram" $90
With $15 rebate that I'll probably forget about (edit 02/10/2007 -- I forgot)

Video: "No-Name GeForce MX400" $50
I bought it cause it was cheap and did not have a fan, enough crap spinning in there already.

I did not buy a floppy drive
If you do this, buy a floppy drive, or at least make sure you have one handy you can use temporarily.

The total damage was around $1500 w/ tax and everything. The total space after RAID is 1.8T, less than a dollar a gig.

II. Trials and Tribulations:
Everything snapped into this case pretty well. I spent extra time making sure that the sata and power cables were tied down nice, in promote good airflow.

I started installing the latest build of OpenSolaris, Nevada Build 31, because I wanted to play with ZFS and I wanted the most mature sata/network driver support avaialble. It's 4 CD's not counting the language or "add ons" (/opt/sfw) pack. The main problem I had was that Solaris 10 does not recognize these RAID cards out of the box. Solaris still does not recognize any SATA hardware that is acting as a raid card. The way to "fix" these cards is to remove their RAID functionality by loading a straight IDE BIOS. Download the IDE BIOS here. Also download the "BIOS Updater Utilities".

You'll need a DOS boot disk to copy these files to. If you do not have a DOS boot disk, there are instructions in this .ZIP file that tell you where to download a FreeDOS boot disk image where you can then copy the BIOS updater and the IDE BIOS .bin file. Getting a floppy drive hooked up is the hard part. Once you're booted to DOS, the command I used was
A:\> UPDFLASH B5304.BIN -v 

The command will carp about some various stuff and then go about it's business updating your Flash BIOS. For this specific RAID card, I think I had to tell it that the Flash memory was compatible with STT 39??010 1M flash. The command updated both of the cards in my system at the same time; it did not require me to run it twice or use special command line flags.

Thus updated, you can now reboot your system. You may notice that during POST, the cards are now called "Silicon Image 3114 SATALink" instead of "3114 SATARaid" and have no option to press <ctl-s> to enter their BIOS. You can now install Solaris from CD as normal.

I do not care for installing Solaris off of CD. It gets to the "Using RPC for sysid configs" (or something) step of the boot, and then just hangs there for 5+ minutes. There's really no way to tell if your machine is horked, or your CD drive froze up, or what, it's just sitting there not doing anything.

The installer could now see all 8 of my disks. I chose to put a small ~4G partition on c1d0s0 and 512M for swap on c1d0s1. I left slice 6 empty and a 1-cylinder slice at s7 for the metadb. Once the operating system installed, I mirrored onto the first disk of the second controller (disk4) by giving it an identical partitioning scheme as disk 0 and using the standard solaris meta-commands:
#  metadb -fa c1d0s7 c3d0s7
# metainit -f d1 1 1 c1d0s0
# metainit -f d2 1 1 c3d0s0
# metainit -f d6 1 1 c1d0s1
# metainit -f d7 1 1 c3d0s1
# metainit d0 -m d1
# metainit d5 -m d6
# metaroot d0
# (edit vfstab to use d5 for swap)
# lockfs -fa
# reboot

Once the system came back up, I attached the metadevices:
#  metattach d0 d2
# metattach d5 d7

The sync went fast since these are small slices. At this point I did some additional configuration. You'll want to use the "svcadm" command to turn off things like autofs, telnet, ftp, etc:
#  svcadm disable telnet
# svcadm disable autofs
# svcadm disable finger
(etc, I did not document exactly what I disabled but autofs has to be turned off if you want home directories to work :-)) Also if you have this specific motherboard, you'll need to add the following to your /etc/driver_aliases for the system to find your network card: nge "pci10de,df" (This tells the system to bind the Nvidia Gigabit Ethernet driver to the PCI card with vendor ID 10de and product ID 00de). Do the usual editing of /etc/hostname.nge0 /etc/inet/ipnodes /etc/hosts /etc/netmasks /etc/defaultrouter to get your network up and running.

If you had something that Solaris just saw out of the box, you probably set this up already during the install. If not, you might have to add something different to your /etc/driver_alises - there is pretty good google juice on the various ethernet cards out there. I rebooted at this point for the changes to take effect.

Now for the fun part, ZFS. For my non-root disks, I created another partition using almost all of the disks (I saved the first three cylinders since it looked like there was some boot information or something on there). zfs was a lot easier than I thought it would be. Do a man on zfs and zpool. To create the pool, it was easy:
zpool create -f u01 raidz c1d0s6 c1d1s6 c2d0s6 c2d1s6 c3d0s6 c3d1s6 c4d0s6 c4d1s6

The -f is to force it, because c1d0s6 and c3d0s6 are smaller than all of the other partitions. This command returned in about 4 seconds. It literally takes longer to type out the command than it does for ZFS to create you a 1.78TB filesystem. The filesystem will auto-magically be mounted on /u01 (because the pool name is u01, specified on the command line above, it could be any arbitrary name). From here, you can use the "zfs" command to create new filesystems that share data from this pool:

zfs create u01/home
zfs create u01/home/amiller
zfs set mountpoint=/home/amiller u01/home/amiller

None of this has to go into /etc/vfstab, the system just knows about it and mounts it at boot with "zfs mount -a".
So now on my box, I have:

amiller$ df -kh |grep u01
u01 1.8T 121K 1.7T 1% /u01
u01/home 1.8T 119K 1.7T 1% /u01/home
u01/home/amiller 1.8T 3.7M 1.7T 1% /home/amiller

Friday, September 15, 2006

ZFS on Nexenta

This week I've played around with Nexenta. This is a neat operating system. It is basically a "distribution" of OpenSolaris, using debian-like package management system, built on the Ubuntu "Dapper Drake" release. Fortunately, it just happened to support all of the pieces of hardware in the test system, a Supermicro H8DAR-T, with Broadcom ethernet and a Marvell88 based onboard SATA2 controller.

I installed this in support of some filesystem/raid benchmarking I have been working on. ZFS seems to benchmark similarly to XFS on linux using software raid, while also having a lot of other additional neat features, like snapshots, filesystem level compression, clones, etc. With filesystem compression enabled, ZFS beats the pants off of these other filesystems in synthetic benchmarks, but I suspect that the files that tiobench creates compress very well so require very little actual IO to the disks. The other side of that coin is that if you are working with compressable files (text, html, possibly even databases), ZFS+compression might work very well for real-world performance. Granted the CPU utilization involved in managing the filesystem will be higher, but I've realized that I'd rather waste the CPU running the disks as fast as possible rather than have the systems sitting idle waiting for data to come off of the spindles. If I get motivated I may try to create a mysql database with some of our application data and do some read/write benchmarks.

One interesting ZFS-ism I ran into was the size reporting of ZFS volumes. I created a raidz (ZFS version of Raid5) with 4 "400G" disks (Which are actually about 370 gigs), which showed up in "df" as ~1.46TB -- which is the size I'd expect if all 4 of them were actual data drives. Software raid5 on linux and the hardware raid card both show ~1.1TB of space. It turns out that you don't magically gain space with ZFS, rather, the "parity tax" of raid5 is tacked on for each file as it is added to the filesystem. To demonstrate:

root@test:/u01/asdf# mkfile 2g testfile

root@test:/u01/asdf# ls -lh
total 2.7G
-rw------T 1 root root 2.0G Sep 16 00:58 testfile

root@test:/u01/asdf# df -kh /u01/asdf
Filesystem Size Used Avail Use% Mounted on
u01/asdf 1.3T 2.7G 1.3T 1% /u01/asdf

Hoping to complete the benchmarks I've been doing and post them here soon.

Sunday, September 10, 2006

Linux BIOS research

We did a lot of work in the BIOS of our machines last week. Doing BIOS edits by hand sucks, there's got to be a better way than connecting the keyboard and monitor to 100+ machines, rebooting them, and changing settings by hand.

Ben dug up some informaiton in a Sun v20z configuration guide, which seems to share a similar BIOS to the machines we have. We had ECC memory enabled in the BIOS before, but we seemed to have a lot of machines dying and hanging. We changed the following additional settings:
ECC Logging: [Enabled]
ECC Chipkill: [Enabled]
ECC Scrubbing: [163 us]
L2 Scrubbing: [10.2 us]
L1 Scrubbing: [5.12 us]

Also we upgraded to a later Linux kernel; version, and added the EDAC module. The k8/opteron code was not in the mainline code but is buildable as a module.

It's hard to declare a success based on the absence of machines hanging, but we've seen the EDAC catch some errors and haven't had a memory related machine hang yet.

Edit: A couple weeks later and the failure rate is way down.