I don't get the sense that there was any attempt to recover from errors - it sou...

cpeterso · on Dec 5, 2013

In systems that overcommit memory (like Linux), malloc() can return non-NULL and then crash when you read or write that address because the system doesn't have enough real memory to back that virtual address.

deathanatos · on Dec 6, 2013

Even in Linux which set to overcommit, malloc() can still return NULL if you exhaust the virtual address space of your process, though I expect it's much less likely now on 64-bit platforms.

alextingle · on Dec 5, 2013

What if you application is the fly-by-wire for an airliner? Can you imagine that there might be better options than just calling abort(3)?

clarry · on Dec 5, 2013

Yes, a better option is to make sure this error cannot happen, by making sure the program has enough memory to begin with. Fly-by-wire shouldn't need unbounded memory allocations at runtime.

There are some applications where you can try to recover by freeing something that isn't critical, or by waiting and trying again. Or you can gracefully fail whatever computation is going on right now, without aborting the entire program. But these are last resort things and will not always save you. If your fly-by-wire ever depends on such a last resort, it's broken by design :-)

Crito · on Dec 5, 2013

From what I understand, in those sort of absolutely critical applications the standard is to design software that fails hard, fast, and safe. You don't want your fly-by-wire computer operating in an abnormal state for any amount of time, you want that system to click off and you want the other backup systems to come online immediately.

The computer in the Space Shuttle was actually 5 computers, 4 of them running in lockstep and able to vote out malfunctioning systems. The fifth ran an independent implementation of much of the same functionality. If there was a software fault with the 4 main computers, they wanted everything to fail as fast as possible so that they could switch to the 5th system.

sitkack · on Dec 5, 2013

Embedded systems like this do not use dynamic memory allocation.

Crito · on Dec 5, 2013

Related: The JPL C guidelines forbid dynamic memory allocation: http://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf (HN discussion: https://news.ycombinator.com/item?id=4339999)

They also demand that code be littered with sanity checking assertions.

sitkack · on Dec 6, 2013

Thanks for posting this.

Tangent, I was thinking about Toyota's software process failure and how they _invented_ industrial level mistake proofing yet did not apply it their engine throttle code.

C is obviously the wrong language, but from a software perspective they should have at least tested the engine controller from an adversarial standpoint (salt water on the board, stuck sensors). That is the crappy thing about Harvard architecture cpu (separate instruction and data memory), you can have while loops that NEVER crash controlling a machine that continues to wreck havoc, sometimes you want a hard reset and a fast recovery.

http://en.wikipedia.org/wiki/Crash-only_software

rmrfrmrf · on Dec 5, 2013

Ugh that's why I'll never buy Boeing; non-upgradable memory in 2013? Please!

alextingle · on Dec 6, 2013

I knew that people would nit-pick on this and not address the actual issue. Next time, I should try harder to come up with a better example.

My point is: sometimes it is worth trying to recover when malloc() fails.

sitkack · on Dec 6, 2013

I wasn't trying to nitpick. Correcting the example, and yes recovering from a malloc failure _could_ be a worthy goal, but on Linux by the time your app is getting signaled about malloc failures the OOM killer is already playing little bunny foo foo with ur processes.

If your app can operate under different allocation regimes then there should be side channels for dynamically adjusting the memory usage at runtime. On Linux, failed malloc is not that signal and since _so many_ libraries and language runtime semantics allocate memory, making sure you allocate no memory in your bad-malloc path is very difficult.

joshguthrie · on Dec 5, 2013

Like eliteraspberrie said, the proper way to recover from an error is to unroll your stack back to your main function and return 1 there.

Error checking was enforced for EVERY syscall, be it malloc() or open(). Checking for errors was indeed required but not enough: proper and graceful shut down was required too.