Crash-only programs crash safely and recover quickly. There is only one way to stop such software -- by crashing it -- and only one way to bring it up -- by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-system component crashes from end users. In this paper we advocate a crash-only design for Internet systems, showing that it can lead to more reliable, predictable code and faster, more effective recovery. We present ideas on how to build such crash-only Internet services, taking successful techniques to their logical extreme.
Sent to me by Aahz, who found the link on a Python mailing list. Intriguing, but really only applies at the lower end of the high-reliability spectrum. For software to get to the point where I'd trust it with my life passing a truck on a two-lane road with oncoming traffic, I'd expect proof-of-correctness on some little single-tasking microcontroller, with all of its state in either registers or ROM.