Udit Agarwal, Abraham Chan, and Karthik Pattabiraman, IEEE International Symposium on Software Reliability Engineering (ISSRE), 2022. (Acceptance Rate: 29%) [ PDF | Talk (video) ] (Code)
Abstract: As machine learning (ML) has become more prevalent across many critical domains, so has the need to understand ML applications’ resilience. While prior work like TensorFI [1], MindFI [2], and PyTorchFI [3] has focused on building ML fault injectors for specific ML frameworks, there has been little work on performing fault injection (FI) for ML applications written in multiple frameworks. We present LLTFI, a Framework-Agnostic Fault Injection tool for ML applications, allowing users to run FI experiments on ML applications at the LLVM IR level. LLTFI provides users with finer FI granularity at the level of instructions and a better understanding of how faults manifest and propagate between different ML components. We evaluate LLTFI on six ML programs and compare it with TensorFI. We found significant differences in the Silent Data Corruption (SDC) rates for similar faults between the two tools. Finally, we use LLTFI to evaluate the efficacy of selective instruction duplication – an error mitigation technique – for ML programs.