Wednesday, January 24, 2007

Linux memory overcommit

Last week I learned something very interesting about the way Linux allocates and manages memory by default out of the box.

In a way, Linux allocates memory the way an airline sells plane tickets. An airline will sell more tickets than they have actual seats, in the hopes that some of the passengers don't show up. Memory in Linux is managed in a similar way, but actually to a much more serious degree.

Under the default memory management strategy, malloc() essentially always succeeds, with the kenrel assuming you're not _really_ going to use all of the memory you just asked for. The malloc()'s will continue to succeed, but not until you actually try to use the memory you allocated will the kernel 'really' allocate it. This leads to severe pathology in low memory conditions, because the application has already allocated the memory, it thinks it can use it free and clear, but when the system is in a low memory condition and an application is trying to use additional memory it has already allocated, the memory access takes a very long time as the kernel hunts around for memory to give.

In an extremely low memory condition, the kernel will start firing off the "OOM Killer" routine. Processes are given 'OOM Scores' and the process with the highest score, win^H^H^Hloses. This leads to random processes on a machine being killed by the kernel. Keeping in the airline analogies, I found this entertaining post.

I found some interesting information about the Linux memory manager here in section 9.6. This section has three small C programs to test memory allocation. The second and third program produced pretty similar results for me so I'm omitting the third:

Here are the results of the test on an 8GB debian Linux box:

demo1: malloc memory and do not use it: Allocated 1.4TB, killed by OOM killer
demo2: malloc memory and use it right away: Allocated 7.8GB, killed by OOM killer

Here are the results on an 8GB Nexenta/Opensolaris machine:

demo1: malloc memory and do not use it: Allocated 6.6GB, malloc() fails
demo2: malloc memory and use it right away: Allocated 6.5GB, malloc() fails

Apparently, a big reason linux manages memory this way out of the box is to optimize memory usage on fork()'ed processes; fork() creates a full copy of the process space, but in this instance, with overcommitted memory, only pages which have been written to actually need to be allocated by the kernel. This might work very well for a shell server, a desktop, or perhaps a server with a large memory footprint that forks an actual PID rather than a thread, but in our situation, this is very undesirable.

We run a pretty java-heavy environment, with multiple large JVMs configured per host. The problem is that the heap sizes have been getting larger, and we were running in an overcommitted situation and did not realize it. The JVMs would all start up and malloc() their large heaps, and then at some later time once enough of the heaps were actually used, the OOM killer would kick in and more or less randomly off one of our JVMs.

I found that linux can be brought more in line with traditional/expected memory management by setting the sysctls: (Apparently these are available only 2.6 kernels)

vm.overcommit_memory (0=default, 1=malloc always succeeds(?!?), 2=strict overcommit)
vm.overcommit_ratio (50=default, I used 100)

The ratio appears to be the percentage off the system's total VM that can be allocated via malloc() before malloc() fails. This MIGHT be on a per-pid basis (need to research). This number can be greater than 100%, presumably to allow for some slop in the copy-on-write fork()'s. When I set this to 100 on a 8GB system, I was able to malloc() about 7.5G of stuff, which seemed about right since I had normal multi-user processes running and no swap configured. I don't know why you'd want to use a number much less than 100, unless it were a per-process limit, or you wanted to force some saved room for fscache.

The big benefit here is that malloc() can actually fail in a low memory condition. This means that the error can be caught and handled by the application. In my case, it means that JVMs fail at STARTUP time, with an obvious memory shortage related error in the logs, rather than having the process have the rug yanked out from under it hours or days later with no message in the application log, and no opportunity to clean up what it was doing.

Here are the demo programs with a linux machine set to strict overcommit/100 ratio:

demo1: malloc memory and do not use it: Allocated 7.3GB, malloc fails.
demo2: malloc memory and use it right away: Allocated 7.3GB, malloc fails.

Technorati Tags: , , OOM


Unknown said...

See also this RedHat document on the topic.

To quote their description of over_commit_memory setting 2:
"The kernel fails requests for memory that add up to all of swap plus the percent of physical RAM specified in /proc/sys/vm/overcommit_ratio. This setting is best for those who desire less risk of memory overcommitment."

And from their coverage of overcommit_ratio:
"Specifies the percentage of physical RAM considered when /proc/sys/vm/overcommit_memory is set to 2. The default value is 50."

They also suggest that overcommit_memory should only be set to 2 on systems with large swap space. I suppose that's something to keep in mind if you're running an embedded system with no writeable hard drive, for instance. Even in that case, though, I would prefer my malloc()s return failure immediately rather than lead me to believe the memory is available.

Anonymous said...

played about with this, it appears that overcommit_ratio is 'percentage of FREE physical memory', rather than total. Just though I'd highlight this as it makes a difference to what you set the ratio to...

Craig said...

Fascinating. I had never really understood before what the OOM killer was all about. What an incredibly stupid idea. "Let's allow memory allocations we can't satisfy and then semi-randomly kill processes when the demand for actual pages exceeds what we can supply!" In other words, no matter how good my code is, no matter how carefully tested it is, it can still randomly die on Linux because Linux itself can't be trusted to honor its contracts with applications. That other post about passengers being ejected from airplanes has it right.

Anonymous said...

From the kernel docs:The CommitLimit is calculated with the following formula: CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap. For example, on a system with 1G of physical RAM and 7G of swap with a `vm.overcommit_ratio` of 30 it would yield a CommitLimit of 7.3G.

Christopher Smith said...

Craig: It is not nearly as stupid as you think, nor is it behaviour unique to Linux. Processes frequently ask for more virtual memory than they need *right now*, so without overcommit you get errors when none are warranted.

Let's take those giant JVM's. Imagine now that you fork off a child process to do something as basic as "chown". For a brief moment in time, the child will take up just as much virtual memory as the parent. In these tight memory scenarios, that would mean the fork would fail. Yet in practice the child only needs a few megabytes at most to run chown, and that is about to become very apparent once "exec()" runs. Overcommit allows the "chown" to succeed as one might hope it would.