Wednesday, January 24, 2007

Linux memory overcommit

Last week I learned something very interesting about the way Linux allocates and manages memory by default out of the box.

In a way, Linux allocates memory the way an airline sells plane tickets. An airline will sell more tickets than they have actual seats, in the hopes that some of the passengers don't show up. Memory in Linux is managed in a similar way, but actually to a much more serious degree.

Under the default memory management strategy, malloc() essentially always succeeds, with the kenrel assuming you're not _really_ going to use all of the memory you just asked for. The malloc()'s will continue to succeed, but not until you actually try to use the memory you allocated will the kernel 'really' allocate it. This leads to severe pathology in low memory conditions, because the application has already allocated the memory, it thinks it can use it free and clear, but when the system is in a low memory condition and an application is trying to use additional memory it has already allocated, the memory access takes a very long time as the kernel hunts around for memory to give.

In an extremely low memory condition, the kernel will start firing off the "OOM Killer" routine. Processes are given 'OOM Scores' and the process with the highest score, win^H^H^Hloses. This leads to random processes on a machine being killed by the kernel. Keeping in the airline analogies, I found this entertaining post.

I found some interesting information about the Linux memory manager here in section 9.6. This section has three small C programs to test memory allocation. The second and third program produced pretty similar results for me so I'm omitting the third:

Here are the results of the test on an 8GB debian Linux box:

demo1: malloc memory and do not use it: Allocated 1.4TB, killed by OOM killer
demo2: malloc memory and use it right away: Allocated 7.8GB, killed by OOM killer


Here are the results on an 8GB Nexenta/Opensolaris machine:

demo1: malloc memory and do not use it: Allocated 6.6GB, malloc() fails
demo2: malloc memory and use it right away: Allocated 6.5GB, malloc() fails


Apparently, a big reason linux manages memory this way out of the box is to optimize memory usage on fork()'ed processes; fork() creates a full copy of the process space, but in this instance, with overcommitted memory, only pages which have been written to actually need to be allocated by the kernel. This might work very well for a shell server, a desktop, or perhaps a server with a large memory footprint that forks an actual PID rather than a thread, but in our situation, this is very undesirable.

We run a pretty java-heavy environment, with multiple large JVMs configured per host. The problem is that the heap sizes have been getting larger, and we were running in an overcommitted situation and did not realize it. The JVMs would all start up and malloc() their large heaps, and then at some later time once enough of the heaps were actually used, the OOM killer would kick in and more or less randomly off one of our JVMs.

I found that linux can be brought more in line with traditional/expected memory management by setting the sysctls: (Apparently these are available only 2.6 kernels)

vm.overcommit_memory (0=default, 1=malloc always succeeds(?!?), 2=strict overcommit)
vm.overcommit_ratio (50=default, I used 100)


The ratio appears to be the percentage off the system's total VM that can be allocated via malloc() before malloc() fails. This MIGHT be on a per-pid basis (need to research). This number can be greater than 100%, presumably to allow for some slop in the copy-on-write fork()'s. When I set this to 100 on a 8GB system, I was able to malloc() about 7.5G of stuff, which seemed about right since I had normal multi-user processes running and no swap configured. I don't know why you'd want to use a number much less than 100, unless it were a per-process limit, or you wanted to force some saved room for fscache.

The big benefit here is that malloc() can actually fail in a low memory condition. This means that the error can be caught and handled by the application. In my case, it means that JVMs fail at STARTUP time, with an obvious memory shortage related error in the logs, rather than having the process have the rug yanked out from under it hours or days later with no message in the application log, and no opportunity to clean up what it was doing.

Here are the demo programs with a linux machine set to strict overcommit/100 ratio:

demo1: malloc memory and do not use it: Allocated 7.3GB, malloc fails.
demo2: malloc memory and use it right away: Allocated 7.3GB, malloc fails.


Technorati Tags: , , OOM

Tuesday, January 23, 2007

Debugging mysql5 on Nexenta

Due to some very favorable benchmarking results, I am planning to migrate some of our production databases to Myqsl5 on Nexenta. Previously the database was running on a Debian Linux server, and the I/O subsystem performed much worse on that system.

I ran into a very strange problem with mysql5 under Nexenta, however. After a certain number of clients connected, sometimes the server would begin to refuse connections in a very strange way. It would accept the connection to the mysql port, and then immediately close the connection.

Fortunately, one of the other reasons I want to move to Nexenta is the more robust toolchain for troubleshooting just these kinds of problems. I started out by using 'truss' on thread 1 of the mysql daemon under the assumption that it was the thread responsible for managing incoming client connections - not a bad guess. Here is a trace of a mysql connection that works correctly vs one that breaks:

Works OK:

root@perftest-db01:~# truss -w all -p 6650/1
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) = 1
/1: fcntl(11, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: accept(11, 0x08047948, 0x08047958, SOV_DEFAULT) = 57
/1: fcntl(11, F_SETFL, FWRITE) = 0
/1: sigaction(SIGCLD, 0x08047420, 0x080474A0) = 0
/1: getpid() = 6650 [6589]
/1: getpeername(57, 0xFEF67A90, 0x080474B8, SOV_DEFAULT) = 0
/1: getsockname(57, 0xFEF67A80, 0x080474B8, SOV_DEFAULT) = 0
/1: open("/etc/hosts.allow", O_RDONLY) = 58
/1: fstat64(58, 0x08046B20) = 0
/1: fstat64(58, 0x08046A50) = 0
/1: ioctl(58, TCGETA, 0x08046AEC) Err#25 ENOTTY
/1: read(58, " # / e t c / h o s t s".., 8192) = 677
/1: read(58, 0x504EA88C, 8192) = 0
/1: llseek(58, 0, SEEK_CUR) = 677
/1: close(58) = 0
/1: open("/etc/hosts.deny", O_RDONLY) = 58
/1: fstat64(58, 0x08046B20) = 0
/1: fstat64(58, 0x08046A50) = 0
/1: ioctl(58, TCGETA, 0x08046AEC) Err#25 ENOTTY
/1: read(58, " # / e t c / h o s t s".., 8192) = 901
/1: read(58, 0x504EA88C, 8192) = 0
/1: llseek(58, 0, SEEK_CUR) = 901
/1: close(58) = 0
/1: getsockname(57, 0x08047938, 0x08047958, SOV_DEFAULT) = 0
/1: fcntl(57, F_SETFL, (no flags)) = 0
/1: fcntl(57, F_GETFL) = 2
/1: fcntl(57, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: setsockopt(57, ip, 3, 0x0804748C, 4, SOV_DEFAULT) = 0
/1: setsockopt(57, tcp, TCP_NODELAY, 0x0804748C, 4, SOV_DEFAULT) = 0
/1: time() = 1169599669
/1: lwp_kill(73, SIG#0) Err#3 ESRCH
/1: lwp_create(0x08047240, LWP_DETACHED|LWP_SUSPENDED, 0x08047464) = 243
/1: lwp_continue(243) = 0
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)

Immediately closes connection:

root@perftest-db01:~# truss  -w all  -p 6650/1
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) = 1
/1: fcntl(11, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: accept(11, 0x08047948, 0x08047958, SOV_DEFAULT) = 255
/1: fcntl(11, F_SETFL, FWRITE) = 0
/1: sigaction(SIGCLD, 0x08047420, 0x080474A0) = 0
/1: getpid() = 6650 [6589]
/1: getpeername(255, 0xFEF67A90, 0x080474B8, SOV_DEFAULT) = 0
/1: getsockname(255, 0xFEF67A80, 0x080474B8, SOV_DEFAULT) = 0
/1: open("/etc/hosts.allow", O_RDONLY) = 257
/1: close(257) = 0
/1: fxstat(2, 256, 0x08045DF8) = 0
/1: time() = 1169599778
/1: getpid() = 6650 [6589]
/1: putmsg(256, 0x080467B8, 0x080467C4, 0) = 0
/1: open("/var/run/syslog_door", O_RDONLY) = 257
/1: door_info(257, 0x08045C10) = 0
/1: getpid() = 6650 [6589]
/1: door_call(257, 0x08045C48) = 0
/1: close(257) = 0
/1: fxstat(2, 256, 0x080459B8) = 0
/1: time() = 1169599778
/1: getpid() = 6650 [6589]
/1: putmsg(256, 0x08046378, 0x08046384, 0) = 0
/1: open("/var/run/syslog_door", O_RDONLY) = 257
/1: door_info(257, 0x080457D0) = 0
/1: getpid() = 6650 [6589]
/1: door_call(257, 0x08045808) = 0
/1: close(257) = 0
/1: fxstat(2, 256, 0x08046A88) = 0
/1: time() = 1169599778
/1: getpid() = 6650 [6589]
/1: putmsg(256, 0x08047448, 0x08047454, 0) = 0
/1: open("/var/run/syslog_door", O_RDONLY) = 257
/1: door_info(257, 0x080468A0) = 0
/1: getpid() = 6650 [6589]
/1: door_call(257, 0x080468D8) = 0
/1: close(257) = 0
/1: shutdown(255, SHUT_RDWR, SOV_DEFAULT) = 0
/1: close(255) = 0
/1: pollsys(0x080473C0, 2, 0x00000000, 0x00000000) (sleeping...)

Looks like the main difference starts here:
/1:     open("/etc/hosts.allow", O_RDONLY)              = 257
/1: close(257) = 0
That explains a lot -- the "hosts.allow" file is part of the tcpwrappers system, which controls access to various daemons on the system based on access control rules set by the system administrator. No wonder I am getting a connection but then immediately getting booted. It is trying to open the hosts.allow file, but then is immediately closing it, vs actually reading and processing the file as seen in the working connection. Does the process not have enough filehandles?
root@perftest-db01:~# pfiles  6650 |head -2
6650: /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --user=mysql
Current rlimit: 8192 file descriptors
Nope, doesn't look that way -- it's configured to use 8192 filehandles. My next clue was the file descriptor number that was returned by the "open" system call, 257. That's awfully near one of those magic "power of 2" boundaries. I started snooping around in google.

It turns out that under Solaris, and maybe *BSD also, the tcpwrappers library (libwrap) uses the "stdio" library to manage IO. This library does not understand file handles above 255, therefore, as the mysql server continues to collect client processes and open tables for reading, eventually this file descriptor boundary is crossed and calls to open "hosts.allow"appear to fail because they return too high a file descriptor number. tcpwrappers appears to fail closed, so since it cannot read the "hosts.allow" file, it denies access to the service by immediately closing the communication channel.

Fortunately, there is a fix. Giri Mandalika has a blog entry that references the issue and is a good resource on the problem. The solution is to use the extendedFILE library that's provided in Solaris Express 06/06 or later (So this is included in Nexenta Alpha 6, and possibly earlier):

root@perftest-db01:~# export LD_PRELOAD_32=/usr/lib/extendedFILE.so.1
root@perftest-db01:~# /etc/init.d/mysql restart

(Obviously I will also need to modify the /etc/init.d/mysql startup script to include the LD_PRELOAD_32). Now, I start up a test program to artificially create a bunch of connections to the database, and see what a truss looks like now:

root@perftest-db01:~# truss -w all -p 6846/1
/1: pollsys(0x08047390, 2, 0x00000000, 0x00000000) (sleeping...)
/1: pollsys(0x08047390, 2, 0x00000000, 0x00000000) = 1
/1: fcntl(11, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: accept(11, 0x08047918, 0x08047928, SOV_DEFAULT) = 294
/1: fcntl(11, F_SETFL, FWRITE) = 0
/1: sigaction(SIGCLD, 0x080473F0, 0x08047470) = 0
/1: getpid() = 6846 [6785]
/1: getpeername(294, 0xFEF47A90, 0x08047488, SOV_DEFAULT) = 0
/1: getsockname(294, 0xFEF47A80, 0x08047488, SOV_DEFAULT) = 0
/1: open("/etc/hosts.allow", O_RDONLY) = 295
/1: fstat64(295, 0x08046AF0) = 0
/1: fstat64(295, 0x08046A20) = 0
/1: ioctl(295, TCGETA, 0x08046ABC) Err#25 ENOTTY
/1: read(295, " # / e t c / h o s t s".., 8192) = 677
/1: read(295, 0x5122F9D4, 8192) = 0
/1: llseek(295, 0, SEEK_CUR) = 677
/1: close(295) = 0
/1: open("/etc/hosts.deny", O_RDONLY) = 295
/1: fstat64(295, 0x08046AF0) = 0
/1: fstat64(295, 0x08046A20) = 0
/1: ioctl(295, TCGETA, 0x08046ABC) Err#25 ENOTTY
/1: read(295, " # / e t c / h o s t s".., 8192) = 901
/1: read(295, 0x5122F9D4, 8192) = 0
/1: llseek(295, 0, SEEK_CUR) = 901
/1: close(295) = 0
/1: getsockname(294, 0x08047908, 0x08047928, SOV_DEFAULT) = 0
/1: fcntl(294, F_SETFL, (no flags)) = 0
/1: fcntl(294, F_GETFL) = 2
/1: fcntl(294, F_SETFL, FWRITE|FNONBLOCK) = 0
/1: setsockopt(294, ip, 3, 0x0804745C, 4, SOV_DEFAULT) = 0
/1: setsockopt(294, tcp, TCP_NODELAY, 0x0804745C, 4, SOV_DEFAULT) = 0
/1: time() = 1169601081
/1: lwp_kill(273, SIG#0) Err#3 ESRCH
/1: lwp_create(0x08047210, LWP_DETACHED|LWP_SUSPENDED, 0x08047434) = 274
/1: lwp_continue(274) = 0
/1: pollsys(0x08047390, 2, 0x00000000, 0x00000000) (sleeping...)
As you can see above - the "open" command on the "hosts.allow" file is returning a filehandle greater than 255, but reading and processing the hosts.allow file proceeds normally, and the connection is accepted.

Yay for truss!

Technorati Tags: , , ,

Monday, January 22, 2007

ZFS features

Here's a post I just entered on the Nexenta/gnusolaris Beginners Forum that has some good info about ZFS. Apparently the formatting got eaten on the mailing list so I'm reposting it here:


Hi all,

Can I have it installed concurrently with linux and allocate linux partitions to the RAID Z? or RAID-Z takes the whole disks?


There are two "layers" of partitions in opensolaris; the first is managed with the "fdisk" utility, the second is managed with the "format" utility - these partitions are aka "slices". I am not an expert, but I believe that the "fdisk" managed partitions are the pieces that linux/windows/etc sees. You first would allocate one of these partitions to Solaris, and from there you can additionally split that fdisk partition into root/swap/data "slices". I believe that the linux partitions you'd see would be visible via the "fdisk" command.

According to some of the ZFS faq/wiki resources, ZFS is "better" if it manages the entire disk, however, it will work just fine managing either "partitions" or "slices". You can even make a ZFS pool with individual files.

Here is an example of one of my disks. There is one "fdisk" partition, and a few "slices":


root@medb01:~# fdisk -g /dev/rdsk/c0t0d0p0
* Label geometry for device /dev/rdsk/c0t0d0p0
* PCYL NCYL ACYL BCYL NHEAD NSECT SECSIZ
48638 48638 2 0 255 63 512

root@medb01:~# prtvtoc /dev/rdsk/c0t0d0p0
* /dev/rdsk/c0t0d0p0 partition map
*
* Dimensions:
* 512 bytes/sector
* 63 sectors/track
* 255 tracks/cylinder
* 16065 sectors/cylinder
* 48640 cylinders
* 48638 accessible cylinders
*
* Flags:
* 1: unmountable
* 10: read-only
*
* First Sector Last
* Partition Tag Flags Sector Count Sector Mount Directory
0 0 00 16065 8401995 8418059
1 0 00 8418060 16787925 25205984
2 5 01 0 781369470 781369469
6 0 00 25205985 756147420 781353404
7 0 00 781353405 16065 781369469
8 1 01 0 16065 16064


Note that in the following examples, I'll create ZFS pools with "c0tXd0s6", that is, the 6th "slice" listed in the solaris partition table.


Alternatively, Can I mount my Linux RAID partitions on Nexenta, at least for migration purposes? What about the LVM disks?


As far as I know, there is no LVM or linux-supported filesystem types built into Opensolaris/Nexenta. i.e. you could not just "mount -t ext3" a linux filesystem and be able to read it. Since you've mentioned that you're running a VMware server, I suppose it may be possible to have both guest operating systems running and copy the data over the 'network'. Also it's likely that Nexenta won't know about LVM managed partitions, it would have to be a real honest-to-goodness partition.


What about RAID-Z features:
Can I hot-swap a defective disk?


This should be possible, assuming that your hardware supports it. You may need to force a rescan of the devices if you replace a disk, check devfsadm. Reintegrating it into the pool would be accomplished with a "zpool replace pool device [new device]"


Can I add a disk to the server and tell it to enlarge the pool, to make more space available on the preexisting RAID?


Yes, with a caveat - ZFS doesn't do any magic stripe re-balancing. If you have a 4-disk pool, and add another disk, what you really have is a 4-disk raidz with a single disk tacked on at the end with no redundancy. Best practice would be to add space in 'chunks' of several disks. Fortunately I am in the middle of building a Nexenta-based box with 4 SATA drives so I can play around with some of the commands and show you the output:

Here is a 4-disk zpool using raidZ:


root@medb01:~# zpool create u01 raidz c0t0d0s6 c0t1d0s6 c0t2d0s6 c0t3d0s6
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0



Here is a 3-disk raidZ pool that I "grow" by adding a single additional disk. Note the subtle indentation difference on c0t3d0s6 in this example; it is not part of the original raidz1 and is just a standalone disk in the pool.


root@medb01:~# zpool destroy u01
root@medb01:~# zpool create u01 raidz c0t0d0s6 c0t1d0s6 c0t2d0s6
root@medb01:~# zpool add u01 c0t3d0s6
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool uses raidz and new vdev is disk
root@medb01:~# zpool add -f u01 c0t3d0s6
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0




Here is an example of adding space in "chunks", note the size of the volume is different in the "zpool list" before and after.


root@medb01:~# zpool destroy u01
root@medb01:~# zpool create u01 mirror c0t0d0s6 c0t1d0s6
root@medb01:~# zpool list u01
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
u01 360G 53.5K 360G 0% ONLINE -
root@medb01:~# zpool add u01 mirror c0t2d0s6 c0t3d0s6
root@medb01:~# zpool list u01
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
u01 720G 190K 720G 0% ONLINE -
root@medb01:~# zpool status u01
pool: u01
state: ONLINE
scrub: none requested
config:

NAME STATE READ WRITE CKSUM
u01 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t0d0s6 ONLINE 0 0 0
c0t1d0s6 ONLINE 0 0 0
mirror ONLINE 0 0 0
c0t2d0s6 ONLINE 0 0 0
c0t3d0s6 ONLINE 0 0 0


PS, doing it this way appears to stripe writes across the two mirrored "subvolumes".


Does it have a facility similar to LVM, where I can create 'logical volumes' on top of the RAID and allocate/deallocate space as needed for flexible storage management (without putting the machine offline)?


Yes, there are two layers in ZFS, the pool management, managed through the "zpool" command, and the filesystem management, through the "zfs" command. Individual filesystems are created as subdirectories of the base pool, or can be relocated with the "zfs set mountpoint" option if you desire. Here I create a ZFS called /u01/opt with a 100MB quota, and then increase the quota to 250MB.


root@medb01:~# zfs create -oquota=100M u01/opt
root@medb01:~# df -k /u01 /u01/opt
Filesystem kbytes used avail capacity Mounted on
u01 743178240 26 743178105 1% /u01
u01/opt 102400 24 102375 1% /u01/opt
root@medb01:~# zfs set quota=250m u01/opt
root@medb01:~# df -k /u01 /u01/opt
Filesystem kbytes used avail capacity Mounted on
u01 743178240 26 743178105 1% /u01
u01/opt 256000 24 255975 1% /u01/opt


Also, things like atime update, compression, etc, can be set on a per filesystem basis.



Can I do fancy stuff like plug an e-sata disk to my machine and tell it to 'ghost' a 'logical volume' on-the-fly, online, without unmounting the volume?


Yes, this is possible. ZFS supports "snapshots" - moment in time copies of an entire ZFS filesystem. ZFS also supports a "send" and "receive" of a snapshot, so you can then take that moment in time copy of your filesystem and replicate it somewhere else. (Or just leave the snapshot laying around for recovery purpouses).

The procedure would be to create a ZFS volume on your external drive, and then "zpool import" that drive each time you plugged it in. Then create a snapshot on your filesystem and "send" it to the external drive, like so. (I don't have an external drive to import so I'll just create 2 pools). I test by creating a filesystem, creating a file in that filesystem, then snapshotting and sending that snapshot to a different pool. Note that the file I created exists in the destination when I'm done.


root@medb01:/# zpool destroy u01
root@medb01:/# zpool destroy u02
root@medb01:/# zpool create u01 mirror c0t0d0s6 c0t1d0s6
root@medb01:/# zpool create u02 mirror c0t2d0s6 c0t3d0s6
root@medb01:/# zfs create u01/data
root@medb01:/# echo "test test test" > /u01/data/testfile.txt
root@medb01:/# zfs snapshot u01/data@send_test
root@medb01:/# zfs send u01/data@send_test | zfs receive u02/u01_copy
root@medb01:/# ls -l /u02/u01_copy
total 1
-rw-r--r-- 1 root root 15 Jan 23 04:49 testfile.txt
root@medb01:/# cat /u02/u01_copy/testfile.txt
test test test
root@medb01:/#


Hope all this helps (and maybe makes it into the wiki too :-) )