This project presents two systems to protect critical data from a wide range errors due to hardware and software. This is part of the work I did at Microsoft Research (Redmond) .
We are in the age of “good-enough”, where imperfect and cheap trumps perfect and expensive, e.g., MP3s, Netbooks, IP telephony etc. We know that hardware and software systems experience errors, yet we continue to use these systems for our day-to-say needs. Rather than eradicate errors, our goal is to build robust systems that can tolerate both hardware and software errors and provide acceptable outputs.
We present two systems that embody our philosophy: (1) Samurai, to tolerate software memory-corruption errors in C/C++ programs and, (2) Flicker, to tolerate hardware memory errors introduced by lowering DRAM refresh rates for saving power. Samurai allows programs written in unsafe languages (i.e., C and C++) to continue executing with sound semantics despite memory errors. Similarly, Flicker allows DRAM memory to be refreshed at far lower rates than they are today, thereby reducing power consumption.
Both Samurai and Flicker emphasize the protection of critical data in programs. We define critical data as any data that cannot be regenerated if the application crashes (i.e., its persistent state) and that is important for the application to produce correct or acceptable outputs. For example, in a word-processing application, the document data would be critical and in a computer game, the score and user-data would be critical. While both Samurai and Flicker require the programmer to explicitly identify critical data, the two systems differ in how they protect the data. Samurai protects critical data by replicating it within the process’s address space, while Flicker protects critical data by allocating it in a separate, high-refresh memory partition.
Samurai is a memory allocator and runtime system to protect critical data from accidental overwrites due to memory-corruption errors in type-unsafe programs. Samurai assumes that the portion of the program that manipulates the critical data is type-safe, and hence legitimate reads/writes of critical data can be identified in the program. Samurai replaces loads and stores of critical data by cload and cstore operations respectively. This can be done either automatically through the compiler or manually by the programmer. Further, the protection provided by Samurai is currently limited to heap objects (i.e., dynamically allocated data).
The above figure shows the operation of Samurai. The goal of Samurai is to prevent illegitimate pointer writes in the program from overwriting critical heap data. It probabilistically achieves this goal by replicating critical heap objects at random locations in the heap in order to minimize the probability of correlated corruptions of the replicas. The information about the replicas of an object are stored as part of its meta-data, which is used by cstore and cload operations to update and compare the replicas respectively. Mismatches detected during the comparison operation are corrected using majority voting among the object’s replicas.
DRAM refresh is a significant consumer of power in mobile systems. DRAM memories need to be constantly refreshed even when not in use or else they will lose their data. Memory manufacturers conservatively set the refresh rate of DRAM systems to that of the fastest-leaking cells. However, there is considerable variation among the leakage rates of memory cells in a DRAM and hence many cells retain their data even if the refresh rate is lowered. The Flicker system assigns critical and non-critical data to different parts of DRAM, and lowers the refresh rate of the part containing non-critical data at the cost of introducing a modest number of errors in it . However, the part containing critical data is refreshed at the regular refresh rate and is hence error-free. This differentiated allocation strategy allows Flicker to obtain power-savings (up to 25%) with almost no reduction in the program’s reliability.
The above figure shows the steps in the operation of Flicker. First, the programmer identifies critical data at the granularity of program objects. Second, the Flicker allocator assigns these objects to separate virtual pages, and does not mix critical and non-critical data on the same page. Third, the operating system (OS) maps the virtual pages containing critical data to the high-refresh portion of the DRAM. Finally, the DRAM chip is partitioned into a high-refresh and low-refresh portion, which can be configured by the OS before putting the mobile device to sleep. In sleep mode, the high-refresh portion is refreshed at the regular rate (32 milliseconds) while the rest of memory is refreshed at drastically lower refresh rates (1 second). The hardware changes required by Flicker are minimal and are based on the existing Partial Array Self-Refresh (PASR) feature of mobile DRAMs.