Zitao Chen, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben, To appear in the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2019. (Acceptance Rate: 21%) [ PDF | Talk ] ( Code )
Abstract: As machine learning (ML) becomes pervasive in high performance computing (HPC), ML has also found its way into safety-critical domains such as autonomous vehicles (AVs). Thus, the reliability of ML has grown in importance. Specifically, failures of ML systems can have catastrophic consequences (e.g., AVs collision), and can occur due to soft errors, which are increasing in frequency due to system scaling. Therefore, we need to evaluate ML systems in the presence of soft errors in order to protect them at low costs.
In this work, we propose an efficient fault injector for finding the safety-critical bits in ML systems. We find that the computations in widely-used ML models are often monotonic. We can thus approximate the error propagation behavior (a composition of different ML computations) as a monotonic function, which implies that a larger fault is more likely to cause a greater outcome deviation. Therefore, we design a binary-search like FI technique (we call BinFI) that can pinpoint the safety-critical bits in ML systems, while also measuring the overall resilience. We evaluate BinFI on 8 ML models on 6 datasets (including a real-world AV dataset). Our results show that BinFI can identify an average of 99.56% of the safety-critical bits (with 99.63% precision), which significantly outperforms random FI, and has a speedup of 5X over exhaustive FI.