Udit Kumar Agarwal, Abraham Chan, Ali Asgari, and Karthik Pattabiraman. 19th IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE), 2023. Received Best-of-SELSE award (one of three papers). [ PDF | Presentation ] (Code)
Abstract: Neural Networks are ubiquitously used in safety-critical applications such as autonomous vehicles and medical diagnostics. The increasing complexity and compute-intensiveness of deep neural networks (DNN) have motivated the need for DNN accelerators like Google’s Tensor Processing Unit (TPU) to accelerate convolution and matrix multiplication operations. At its core, a TPU consists of a 2-Dimensional array of Multiply and Accumulation Units, called a systolic array, which is susceptible to permanent (e.g., stuck-at faults in the data path) and transient hardware faults (e.g., radiation-induced). We propose an RTL-level fault injection (FI) framework for systolic arrays. Using this framework, we characterize the software effect of errors (called Fault Patterns) induced by stuck-at faults within the multiply and accumulation units of the systolic array. We further analyze the effect of different dataflows mapping schemes (output and weight stationery), operation types (convolution and matrix multiplication), and operation configurations (e.g., input size, convolution kernel size). Through the FI experiments, we categorized the fault patterns for stuck-at faults into well-defined classes based on their spatial patterns.