{"id":6416,"date":"2026-04-29T18:07:10","date_gmt":"2026-04-30T01:07:10","guid":{"rendered":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/?p=6416"},"modified":"2026-05-24T08:53:17","modified_gmt":"2026-05-24T15:53:17","slug":"thinking-inside-the-box-injecting-realistic-radiation-faults-in-ml-accelerators","status":"publish","type":"post","link":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/2026\/04\/29\/thinking-inside-the-box-injecting-realistic-radiation-faults-in-ml-accelerators\/","title":{"rendered":"Thinking Inside the Box: Injecting Realistic Radiation Faults in ML Accelerators"},"content":{"rendered":"<p>Bruno Loureiro Coelho, Seyedmani Sadati, Abraham Chan, Alex Hands, Karthik Pattabiraman, and Paolo Rech. To appear in the Proceedings of the <a href=\"https:\/\/dsn2026.github.io\/\">IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN), 2026<\/a>. (Acceptance Rate: 20%) [ <a href=\"https:\/\/www.dropbox.com\/scl\/fi\/ws2vhpub4nslzr516y1zc\/DSN26-Thinking-Inside-Camera-Ready.pdf?rlkey=iro7ekilevakdoyqhshq4ie7m&#038;st=vkws2dtd&#038;dl=0\">PDF<\/a> | Talk ] (<a href=\"https:\/\/github.com\/DependableSystemsLab\/TPU-FI\">Code<\/a>). <strong>Best Paper Award (one of three)<\/strong> <strong>Code Reproducible, Dataset Reproducible<\/strong><br \/>\n<!--more--><\/p>\n<p><strong>Abstract:<\/strong> Machine Learning accelerators are increasingly deployed in safety-critical applications such as autonomous vehicles and space environments, where exposure to radiation poses serious reliability risks. To evaluate and mitigate this risk, we test tensor processing units (TPUs) with neutron, proton, and heavy-ion beams that account for 104 \\textit{million} years of natural exposure on the Earth, and more than 1,000 \\textit{years} of mission in outer space, showing that, contrasting with conventional single-bit models, faults affect localized rectangular regions of tensor outputs. We then build TPU-FI, an experimentally-tuned software-based fault-injector, predicting TPUs failure-in-time rates within, on average, 1.7x of beam results. Through 6 \\textit{million} software fault injections, we analyze how fault type, model architecture, layer type, and input data affect prediction correctness, finding that fully connected and convolutional layers exhibit higher vulnerability and different inputs trigger distinct fault propagation paths, especially in attention-based models.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Bruno Loureiro Coelho, Seyedmani Sadati, Abraham Chan, Alex Hands, Karthik Pattabiraman, and Paolo Rech. To appear in the Proceedings of the IEEE\/IFIP International Conference on Dependable Systems and Networks (DSN), 2026. (Acceptance Rate: 20%) [ PDF | Talk ] (Code). &hellip; <a href=\"https:\/\/blogs.ubc.ca\/dependablesystemslab\/2026\/04\/29\/thinking-inside-the-box-injecting-realistic-radiation-faults-in-ml-accelerators\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":10348,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[64,44,21,4,63,6,7],"class_list":["post-6416","post","type-post","status-publish","format-standard","hentry","category-publications","tag-64","tag-abraham","tag-award","tag-conference","tag-mani","tag-reliability","tag-many-core"],"_links":{"self":[{"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/posts\/6416","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/users\/10348"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/comments?post=6416"}],"version-history":[{"count":7,"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/posts\/6416\/revisions"}],"predecessor-version":[{"id":6452,"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/posts\/6416\/revisions\/6452"}],"wp:attachment":[{"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/media?parent=6416"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/categories?post=6416"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.ubc.ca\/dependablesystemslab\/wp-json\/wp\/v2\/tags?post=6416"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}