I don't get the sense that there was any attempt to recover from errors - it sounds more like they were enforcing that error checking occurred, by replacing `malloc` with one that just returned `NULL` always. It sounds like the goal was to make sure that one didn't assume `malloc` would always succeed and just use the memory.
Indeed, recovery is basically futile in this case and your program is going to shut down pretty quickly either way. Maybe you'll get the chance to tell the user that you ran out of memory before you die, which seems polite.
In systems that overcommit memory (like Linux), malloc() can return non-NULL and then crash when you read or write that address because the system doesn't have enough real memory to back that virtual address.
Even in Linux which set to overcommit, malloc() can still return NULL if you exhaust the virtual address space of your process, though I expect it's much less likely now on 64-bit platforms.
Yes, a better option is to make sure this error cannot happen, by making sure the program has enough memory to begin with. Fly-by-wire shouldn't need unbounded memory allocations at runtime.
There are some applications where you can try to recover by freeing something that isn't critical, or by waiting and trying again. Or you can gracefully fail whatever computation is going on right now, without aborting the entire program. But these are last resort things and will not always save you. If your fly-by-wire ever depends on such a last resort, it's broken by design :-)
From what I understand, in those sort of absolutely critical applications the standard is to design software that fails hard, fast, and safe. You don't want your fly-by-wire computer operating in an abnormal state for any amount of time, you want that system to click off and you want the other backup systems to come online immediately.
The computer in the Space Shuttle was actually 5 computers, 4 of them running in lockstep and able to vote out malfunctioning systems. The fifth ran an independent implementation of much of the same functionality. If there was a software fault with the 4 main computers, they wanted everything to fail as fast as possible so that they could switch to the 5th system.
Tangent, I was thinking about Toyota's software process failure and how they _invented_ industrial level mistake proofing yet did not apply it their engine throttle code.
C is obviously the wrong language, but from a software perspective they should have at least tested the engine controller from an adversarial standpoint (salt water on the board, stuck sensors). That is the crappy thing about Harvard architecture cpu (separate instruction and data memory), you can have while loops that NEVER crash controlling a machine that continues to wreck havoc, sometimes you want a hard reset and a fast recovery.
I wasn't trying to nitpick. Correcting the example, and yes recovering from a malloc failure _could_ be a worthy goal, but on Linux by the time your app is getting signaled about malloc failures the OOM killer is already playing little bunny foo foo with ur processes.
If your app can operate under different allocation regimes then there should be side channels for dynamically adjusting the memory usage at runtime. On Linux, failed malloc is not that signal and since _so many_ libraries and language runtime semantics allocate memory, making sure you allocate no memory in your bad-malloc path is very difficult.
Like eliteraspberrie said, the proper way to recover from an error is to unroll your stack back to your main function and return 1 there.
Error checking was enforced for EVERY syscall, be it malloc() or open(). Checking for errors was indeed required but not enough: proper and graceful shut down was required too.
Indeed, recovery is basically futile in this case and your program is going to shut down pretty quickly either way. Maybe you'll get the chance to tell the user that you ran out of memory before you die, which seems polite.