Silent Data Corruption
A brief introduction into the reality of hardware errors in large scale computing and breaking the myth of fail stop computing.

Are you sitting uncomfortably? No? Let me fix that for you. Let’s talk about GPU failures.
I don’t want to talk about the easy stuff either. The kinds of failures that have you reaching for your check pointing or treating your on-prem GPU machines like low priority cloud VMs as prone to eviction as a low income tenant with a Scrooge McDuck as his landlord.
No instead let’s talk about the stuff of nightmares. Let’s talk about silent data corruption (SDC).
You know, the kinds of errors that hardware and software doesn’t think are errors at all and keeps on happily churning away but give you a nice steaming pile of garbage as output. No I don’t mean your LLMs 😁.
SDC was always a possibility. We know that transistors aren’t perfect, and the idea of fail stop computing, whilst comforting, was always a bit of a myth. Mostly though we’d put this down to cosmic rays, install some ECC RAM, pray to the gods of lithography, and go about our merry way.
Oh, yea that does mean it affects CPUs too. This isn’t a GPU only problem, though, anecdotally at least, it does seem to be a lot more prevalent on GPUs. The hook was just to get your attention. Sorry!
If you made it this far but haven’t a monkey’s what I’m talking about, let me give you an example lifted straight from the paper by Meta.
You compute 1.1^53 as an integer and it works fine, you compute 1.1^52 as an integer on the exact same core and you get 0. No errors. Nothing to indicate something went wrong. Just the wrong answer.
Yes, you understood that right, on a particular core/CPU/GPU, one code path will start giving you the wrong answer, but only with specific input data. No errors. Just the wrong output.
If you work on anything critical at scale and this isn’t giving you nightmares, please come and tell me how you solved it!
We embraced software failure as a part of life when we moved to operating at scale, simply because even at low probabilities of failure, the sheer amount of software being run meant dealing with failure had to become a normal occurrence and not the exception.
The same needs to be true for dealing with not only silent data corruption but all errors at the hardware level. What used to be a once in a lifetime event when running your single core Xeon in 2005 now happens frequently on your multi thousand GPU supercomputer.
There is ample evidence to show, not only that QA checks by manufacturers will not capture all defects, but also that defects can occur later in life. This may be early on, within months due to manufacturing imperfections, or much later due to age related wear. Yep, believe it or not transistors wear out too and the probability of SDCs goes up as your hardware gets older.
What I find in equal parts strange, and frustrating is the relative secrecy and lack of conversation around this topic. Sure, there are a few good papers out there but if you actually wanted to implement a solution there’s precious little information in them. Want to know what the failure rates are to make an informed risk decision? Go fish.
As much by curiosity as by necessity I think this something I’m going to end up digging into further but if you’d like to start reading up more about it yourself here are a couple of interesting starting points.
Understanding GPU Memory Corruption at Extreme Scale