Talk given at the IEEE Computer Society, Vancouver Chapter, May 20, 2010. [ PDF Slides ]
Abstract: The complexity of computer systems has grown to a point where it is no longer feasible to make them perfectly reliable. In the past, computer systems used techniques such as triple-modular redundancy and type-safe languages to mask
and avoid errors. Such techniques either incur high overheads or require intensive effort from programmers, thereby driving up their costs. As errors become more and more prevalent in commodity systems, it is important to develop software that can anticipate and adapt to errors. In this talk, we propose to build software systems that can produce acceptable outputs in the face of hardware and software errors. The key insight is that many errors are not serious enough to warrant detection and recovery and therefore, it is
sufficient to tolerate the few serious errors that do. Our goal is to provide “good enough” reliability, using low-cost techniques on commodity platforms.
I’ll present two systems that embody the above philosophy, Samurai to tolerate software memory errors, and Flicker, to tolerate hardware memory errors. The systems focus on protecting a subset of the application’s data, called its critical data from errors. Samurai protects critical data from memory corruption errors in untrusted third-party code by replicating the data. Similarly, Flicker achieves significant power-savings by exposing hardware errors to the application, and protecting critical data from such errors.
Finally, I will conclude by looking at the broader implications of the “good enough” approach and future directions.
* The title is based on an article in the Aug’09 issue of the Wired magazine
titled “The Good Enough Revolution: When cheap and simple is just fine”.