Hacker News new | past | comments | ask | show | jobs | submit login

This is very interesting. However, the OP's idea of "failure" is when something goes wrong with the system -- in this case, ext3 detected an error when trying to write to its journal. It's possible that silent data corruption occurred much earlier. If the OP reads this, I'd love to see the experiment repeated, with the added step of verifying the written data every time it's written.

That is, you would dd a random bit pattern to disk. Then you would read it back and XOR it with the pristine pattern you know you wrote. If you see any difference, you've got a much scarier kind of failure than the fail-stop behavior exhibited in the post.

Also, does the OP know whether any write buffering is going on? That could explain why some writes took much longer than others -- when the write actually goes to disk, it would take much longer than when it gets cached in memory. Use the "sync" command between each write to make everything go right to disk (I think?).




It's possible that silent data corruption occurred much earlier.

That is possible, but since flash uses ECC the controller should notice corruption.

Also, does the OP know whether any write buffering is going on?

There shouldn't be, since he's already using O_DIRECT.

why some writes took much longer than others

That's probably caused by erasing and data copying, since one erase is required for every ~1MB of data written.

If y'all are really interested in this topic, there are some good academic papers.

http://nvsl.ucsd.edu/ftest/


Flash drives use ECC precisely so that they can detect when memory cells are nearing the end of their useful writable lifetime before the data is unreadable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: