ToleRace

This project is about detecting and tolerating race conditions in programs. This is carried out in collaboration with colleagues at Microsoft Research and elsewhere (Cornell, UT Austin, IISc).

Motivation

The emergence of multi-core processors has ushered in an era of parallelism in the desktop computing world. Yet, most existing software systems are unprepared to deal with the advent of multi-cores and cannot take advantage of the parallelism. As software systems are made multi-threaded (for parallel execution), it is inevitable that they will suffer from concurrency bugs such as race conditions and/or deadlocks. A variety of techniques have been developed in the literature to detect and prevent race conditions in programs. However, such techniques assume that they are deployed on the entire code-base – an unrealistic assumption in real-systems where third-party libraries and plugins are loaded into the application. Even if the application program is free of data-races, libraries may access shared data without proper synchronization and hence cause race conditions. These are called asymmetric races.

ToleRace is a runtime system that detects and tolerates asymmetric races in programs. Unlike many static analysis techniques, ToleRace does not require access to the entire code-base (especially not the libraries). Further, unlike dynamic techniques, ToleRace does not instrument every load and store in the program, thus avoiding prohibitive overheads. Finally, upon detecting a race, ToleRace gracefully recovers the program by guaranteeing correct semantics to the application thread (under certain conditions, see below).

Project Summary
ToleRace Oracle

The operation of ToleRace is shown above. The main idea behind ToleRace is to create two copies of the shared variable (V) within the critical section of the application’s (good) thread. All reads/writes to the shared variable within the critical section are instrumented to refer to one of the replicas (V’) instead of to the main copy. The other copy (V”) represents the original value of the shared variable prior to entering the critical section. Prior to the exiting the critical section, the replica is compared with the original value (V). In case of a mismatch, a repair mechanism is activated to restore the correct value to the shared variable. A more detailed description of the detection and repair mechanism can be found here.

Publications

Collaborators: Benjamin Zorn (Microsoft Research), Darko Kirovski (Microsoft Research), Paruj Ratanaworabhan, Martin Burtscher (Univ of Texas, Austin), Rahul Nagpal (IISc)