We did a lot of work in the BIOS of our machines last week. Doing BIOS edits by hand sucks, there's got to be a better way than connecting the keyboard and monitor to 100+ machines, rebooting them, and changing settings by hand.
Ben dug up some informaiton in a Sun v20z configuration guide, which seems to share a similar BIOS to the machines we have. We had ECC memory enabled in the BIOS before, but we seemed to have a lot of machines dying and hanging. We changed the following additional settings:
ECC Logging: [Enabled]
ECC Chipkill: [Enabled]
ECC Scrubbing: [163 us]
L2 Scrubbing: [10.2 us]
L1 Scrubbing: [5.12 us]
Also we upgraded to a later Linux kernel; version 126.96.36.199, and added the EDAC module. The k8/opteron code was not in the mainline code but is buildable as a module.
It's hard to declare a success based on the absence of machines hanging, but we've seen the EDAC catch some errors and haven't had a memory related machine hang yet.
Edit: A couple weeks later and the failure rate is way down.