Zitao Chen, Guanpeng Li, and Karthik Pattabiraman, To appear in the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2021. (Acceptance Rate: 16.5%). [ PDF | Talk ] (arXIV, code)
Abstract: The adoption of deep neural networks (DNNs) in safety-critical domains has engendered serious reliability concerns. A prominent example is hardware transient faults that are growing in frequency due to the progressive technology scaling, and can lead to failures in DNNs. This work proposes Ranger, a low-cost fault corrector, which directly rectifies the faulty output due to transient faults without re-computation. DNNs are inherently resilient to benign faults (which will not cause output corruption), but not to critical faults (which can result in erroneous output). Ranger is an automated transformation to selectively restrict the value ranges in DNNs, which reduces the large deviations caused by critical faults and transforms them to benign faults that can be tolerated by the inherent resilience of the DNNs. Our evaluation on 8 DNNs demonstrates Ranger significantly increases the error resilience of the DNNs (by 3x to 50x), and with negligible memory and performance overheads.