Bit errors still not a myth

I had this post in draft for a long time now, however it is evergreen. Just consider the recent exploits focused on RAM leakage. Although, do note that ECC by itself can't give 100% protection against row hammer, so it is only loosely related.

An excellent empirical work has been published in 2009 regarding DRAM Errors in the Wild: A Large Scale Field Study.

"About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year." - analysis by James Hamilton

"Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent. This means that some popular mobos have poor EMI hygiene." - analysis by ZDnet

Note that such an event plainly translates to faults on non-ECC machines.

So despite the fact that human error is much more probable, it is not unlikely at all that you encounter bit errors in everyday life. Consider the following key points of failure:

  1. You type in your phone number for a web site on your computer. The data gets shuffled along in various paths, like from the keyboard (parity on PS/2 isn't safe at all) to the operating system's character buffers, towards library layers, APIs, inside interpreters and towards an output packet buffer. RAM corruption is not probable, but possible. Let's assume it gets protected by a checksum after leaving towards the network.
  2. Your home wireless router picks up the signal, verifies the checksum(s) and hopefully drops the packet on error. Depending on exact implementation, it then stores and shuffles your packet around within its multi-megabyte packet/frame buffer and outputs it at a later date depending on QoS and other factors. RAM corruption is a bit more probable. I wouldn't be surprised at all if all these cheap embedded devices had a lower standard of quality control compared to desktop computers. How many factories run memtest for hours on units costing $10? Your packet is at a minimum rearranged because of NAT, so a new checksum will be calculated before being output.
  3. Your home router is connected to a customer-premises equipment like a modem to connect to the Internet. These embedded devices do very similar reframing as the above point.
  4. The same can be told about the server side CPE/router.
  5. When the budget, non-ECC server receives your request, the data is again shuffled around in its memory between packet buffers, interpreters, the database engine and the disk cache before reaching the nonvolatile storage units. This is usually a highly loaded machine, so the incidence of bit error is higher.
  6. Operating systems like to cache a lot with a reason. Linux started using most of the free RAM as a page cache a long time ago. If any single bit or byte is touched, a whole block will need to be rewritten to disk later on. Write back caches can hold onto dirty pages for quite some time. This can happen a lot for complicated high performance data structures used in databases and file systems, especially if running one other the other. The size of a block can be influenced at the OS level, however each one is usually from a few kB to a few MB in size, depending on use case and tuning. However, it must definitely be at least as large as a sector (usually 4096 bytes, previously 512 bytes).
  7. For SSDs, modifying a single sector involves shuffling around data as large as an erase (multi-)block which can be multiple megabytes in size. Of course all this shuffling boils down to more read-modify-write cycles to the internal DRAM. Fortunately, I would expect these to be stored in the buffers with their original ECC intact for performance reasons, so a bit error should be detected on read back. Of course this is unless an error had to be corrected which warrants some more in-memory shuffling and checksum recomputation. Note that this analysis usually does not apply to most hardware caches because they use SRAM.
  8. You are out of luck if your nonvolatile storage system periodically does any of data scrubbingstatic wear leveling or defragmentation. Each of these mechanisms read large chunks of data into (embedded) DRAM and then (potentially) rewrites (parts of) them. The potentially new checksum should be recalculated in hardware each time during a write.
  9. Of course at a later date, you will need to get back the data or pass it along an API, which repeats all the above steps in reverse.
  10. Note that non volatile storage devices should get an article of their own, because data corruption can occasionally happen on any kind of technology. I have personally witnessed data corruption on each of the following: 64M-4GB USB flash key chain, 80MB-80GB (1TB?) HDD, CD-ROM (I guess we've all seen corruption on floppies and tapes).

I've listed quite a few potential places for error. I haven't considered erroneous computation of a CPU due to heat or interference, instead only concentrating on temporary storage in memory. As you can see, computers can indeed be mistaken, and they do that quite often actually.

Comments

Popular posts from this blog

Tftp secret of TL-WR740N uncovered

Hidden TFTP of TP-Link routers

Haskell for embedded: C output, compilers, monads, Timber