Bo Fang, Hassan Halawa, Karthik Pattabiraman, Matei Ripeanu and Sriram Krishnamurthy, , Proceedings of the ACM International Conference on Supercomputing (ICS), 2019. (Acceptance Rate: 23.2 %). [ PDF | Talk ]
Abstract:: Detectable but Uncorrectable Errors (DUEs) in the memory subsystem are becoming increasingly frequent. Today, up on encountering a DUE, applications crash, and the recovery methods used incur significant performance, storage, and energy overheads. To mitigate the impact of these errors, we start from two high-level observations that apply to some classes of HPC applications (e.g., stencil computations on regular grids or irregular meshes): first, these applications, display a property we dub spatial data smoothness: i.e., data items that are nearby in the application’s logical space are relatively similar. Second, since these data items are generally used together, programmers go to great lengths to place them in nearby memory locations to improve application’s performance by improving access locality. Based on these observations we explore the feasibility of a roll-forward recovery scheme that leverages spatial data smoothness repairs of the memory location corrupted by a DUE and continues the application execution. We present BonVoision, a run-time system that intercepts DUE events, analyzes the application binary at runtime to identify the data elements in the neighbourhood of the memory location that generates a DUE, and uses them to fix the corrupted data. Our evaluation demonstrates that BonVoision is (i) efficient- it incurs negligible overhead, (ii) effective – it is frequently successful in continuing the application with benign outcomes, and (iii) does not require programmer input or access to source code. We demonstrate that using BonVoision can lead to significant savings in the context of a checkpointing/restart schemes by enabling longer checkpoint intervals.