> Bitflips are waaay more common than you think they are... Over the course of a...

shakna · 2026-01-01T14:02:07 1767276127

1 in 100,000 devices is about 1 in about 40,000 customers due to how many devices most people own.

Which means if you're about medium business or above, one of your customers will see this about once a year.

That classifies more as "inevitable" than "rare" in my book.

swiftcoder · 2026-01-01T15:13:15 1767280395

> That classifies more as "inevitable" than "rare" in my book.

But also pretty much insignificant. Is any other component in your product achieving 5 9s reliability?

shakna · 2026-01-01T15:58:27 1767283107

We're not talking 5 9s, here.

> ... A new consumer grade machine with 4GiB of DRAM, will encounter 3 errors a month, even assuming the lowest estimate of 120 FIT per megabit.

The guarantees offered by our hardware suppliers today, is not "never happens" but "accounted for in software".

So, if you ignore it, and start to operate at any scale, you will start to see random irreproducible faults.

Sure, you can close all tickets as user error or unable to reproduce. But it isn't the user at fault. Account for it, and your software has less glitches than the competitor.

swiftcoder · 2026-01-01T17:07:37 1767287257

> We're not talking 5 9s, here.

1 in 40,000 customer devices experiencing a failure annually is considerable better than 4 9s of reliability. So we are debating whether going from 4 9s to 5 9s is worth it.

And like, sure, if the rest of your stack is sufficiently polished (and your scale is sufficiently large) that the once-a-year bit flip event becomes a meaningful problem... then by all means do something about it.

But I maintain that the vast majority of software developers will never actually reach that point, and there are a lot of lower-hanging fruit on the reliability tree