Hacker News new | past | comments | ask | show | jobs | submit login

We once had a new server with all new hardware which had weird problems and kept crashing mysteriously. Memory tests showed no errors, so we were all tearing our hair out. We took the server offline and set it to test continously – still no errors. After running Memtest86 on nothing but test #4, for about a day or sothen a few memory errors showed up. Replaced memory, problem gone, server started working.

Memory errors are especially insidious compared to how common they are. ECC is worth it.




I remember circa 1999 having a database server which had a stuck bit in memory. The bit happened to be placed in the page cache, so it subtly corrupted disk writes resulting in the database throwing checksum errors. It took an insane amount of time to even diagnose where the problem could be. We of course thought it was the disks themselves and tried many variations of disks and external RAID cards. Finally, one run of memtest86 found the real problem, and I threw away the memory and motherboard and replaced it with one capable of ECC RAM.

I forget now why we even thought to build a server without ECC RAM, but I sure learned my lesson after that.


i wouldn't even call a machine without ecc a server or workstation. more like a consumer device that's been given a job it can't do.


This was many, maybe more than 10, years ago.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: