Understanding Error Propagation in Deep-Learning Neural Networks (DNN) Accelerators and Applications

Guanpeng Li, Siva Hari, Michael Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel Emer, Stephen Keckler, International Conference for High-Performance Computing, Networking, Storage and Analysis (SC), 2017. (Acceptance Rate: 19%) [PDF | Talk] (Injector code)
Chosen for IEEE Top Picks in Test and Reliability (TPTR), 2023.

(NEW: UBC ECE did a story about this paper)

Abstract: Deep learning neural networks (DNNs) have been successful in solving a wide range of machine learning problems. Specialized hardware accelerators have been proposed to accelerate the execution of DNN algorithms for high-performance and energy efficiency. Recently, they have been deployed in data centers (potentially for business-critical or industrial applications) and safety-critical systems such as self-driving cars. So‰ errors caused by high-energy particles have been increasing in hardware systems, and these can lead to catastrophic failures in DNN systems. Traditional methods for building resilient systems, e.g., Triple Modular Redundancy (TMR), are agnostic of the DNN algorithm and the DNN accelerator’s architecture. Hence, these traditional resilience approaches incur high overheads, which makes them challenging to deploy. In this paper, we experimentally evaluate the resilience characteristics of DNN systems (i.e., DNN so‰ftware running on specialized accelerators). We fi€nd that the error resilience of a DNN system depends on the data types, values, data reuses, and the types of layers in the design. Based on our observations, we propose two efficient protection techniques for DNN systems.

Comments are closed.