DLAFI: Software-Based Fault Injection for Permanent Faults in Deep Learning Accelerators

Seyedmani Sadati, Abraham Chan, Udit Kumar Agarwal and Karthik Pattabiraman. To appear in the IEEE International Symposium on Software Reliability Engineering (ISSRE) 2025. (Acceptance Rate: 28%) [ PDF (coming soon) | Talk ] (Code)

Abstract: Deep learning accelerators (DLAs) are used in safety-critical applications, making it crucial to address their reliability, particularly concerning permanent faults arising due to wear and tear or manufacturing defects. Current permanent fault injection methods are either slow (hardware simulations) or inaccurate (software-level). We introduce DLAFI, an LLVM-based fault injection framework that accurately simulates the hardware behavior of systolic arrays (SAs) — the core compute components of DLAs, while achieving the speed of software-level injection. DLAFI models the SA’s scheduler and its strategy to dynamically map machine learning (ML) operations to the SA’s processing elements.

Compared with hardware simulation-based fault injection, DLAFI enables the analysis of higher complexity ML applications such as object detection and large language models, and is three orders of magnitude faster overall. Using DLAFI, we evaluate the resilience of various ML workloads across SA sizes and scheduling strategies, and find that larger SAs reduce fault impact, balanced schedulers can lower resilience, faults in final layers exhibit higher vulnerability, and vision models are more resilient than language models.

Comments are closed.