Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Bitflips are waaay more common than you think they are... Over the course of about seven months, 52,317 requests...

Your data does not show them to be common - less than 1 in 100,000 computing devices seeing an issue during a 7 month test qualifies as "rare" in my book (and in fact the vast majority of those events seem to come from a small number of server failures).

And we know from Google's datacenter research[0] that bit flips are highly correlated hard failures (i.e. they tend to result from a faulty DRAM module, and so affect a small number of machines repeatedly).

It's hard to pin down numbers for soft failures, but it seems to be somewhere in the realm of 100 events/gigabyte/year - and that's before any of the many ECC mechanisms do their thing. In practical sense, no consumer software worries about bit flips in RAM (whereas bit flips in storage are much more likely, hence checksumming DB rows, etc).

[0]: https://static.googleusercontent.com/media/research.google.c...



1 in 100,000 devices is about 1 in about 40,000 customers due to how many devices most people own.

Which means if you're about medium business or above, one of your customers will see this about once a year.

That classifies more as "inevitable" than "rare" in my book.


> That classifies more as "inevitable" than "rare" in my book.

But also pretty much insignificant. Is any other component in your product achieving 5 9s reliability?


We're not talking 5 9s, here.

> ... A new consumer grade machine with 4GiB of DRAM, will encounter 3 errors a month, even assuming the lowest estimate of 120 FIT per megabit.

The guarantees offered by our hardware suppliers today, is not "never happens" but "accounted for in software".

So, if you ignore it, and start to operate at any scale, you will start to see random irreproducible faults.

Sure, you can close all tickets as user error or unable to reproduce. But it isn't the user at fault. Account for it, and your software has less glitches than the competitor.


> We're not talking 5 9s, here.

1 in 40,000 customer devices experiencing a failure annually is considerable better than 4 9s of reliability. So we are debating whether going from 4 9s to 5 9s is worth it.

And like, sure, if the rest of your stack is sufficiently polished (and your scale is sufficiently large) that the once-a-year bit flip event becomes a meaningful problem... then by all means do something about it.

But I maintain that the vast majority of software developers will never actually reach that point, and there are a lot of lower-hanging fruit on the reliability tree




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: