{"id":3437,"date":"2016-01-01T20:55:59","date_gmt":"2016-01-02T04:55:59","guid":{"rendered":"https:\/\/blogs.ubc.ca\/karthik\/?p=3437"},"modified":"2017-02-21T21:15:32","modified_gmt":"2017-02-22T05:15:32","slug":"a-systematic-methodology-for-evaluating-the-errorresilience-of-gpgpu-applications","status":"publish","type":"post","link":"https:\/\/blogs.ubc.ca\/karthik\/2016\/01\/01\/a-systematic-methodology-for-evaluating-the-errorresilience-of-gpgpu-applications\/","title":{"rendered":"A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications"},"content":{"rendered":"<p>Bo Fang, Karthik Pattabiraman, Matei Ripeanu and Sudhanva Gurumurthi, <a href=\"http:\/\/www.computer.org\/web\/tpds\">IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016<\/a>. [ <a href=\"https:\/\/blogs.ubc.ca\/karthik\/files\/2014\/06\/2016_TPDS_GPUQin.pdf\">PDF<\/a> ]<br \/>\n<!--more--><\/p>\n<p>This paper supercedes our <a href=\"https:\/\/blogs.ubc.ca\/karthik\/2013\/12\/11\/gpu-qin-a-methodlogy-for-evaluating-the-error-resilience-of-gpgpu-applications\/\">conference paper<\/a>.<\/p>\n<p><strong>Abstract:<\/strong> The wide adoption of graphics processing units (GPUs) as accelerators for general-purpose applications makes the end-to-end reliability implications of their use increasingly significant. Fault injection is a widely adopted method to evaluate the resilience of applications. However, building a fault injector for general-purpose GPU applications is challenging due to their massive parallelism, which makes it difficult to achieve representativeness while being time-efficient. <\/p>\n<p>This paper makes four key contributions. First, it presents a fault-injection methodology to evaluate the end-to-end reliability properties of application kernels running on GPUs. Second, it introduces GPU-Qin, a fault-injection tool that uses real GPU hardware and offers a tunable and efficient balance between the representativeness and the cost of a fault-injection campaign. Third, it characterizes the error resilience characteristics of seventeen application kernels. Finally, it provides preliminary insights on correlations between the algorithmic properties of applications and their error resilience.  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Bo Fang, Karthik Pattabiraman, Matei Ripeanu and Sudhanva Gurumurthi, IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016. [ PDF ]<\/p>\n","protected":false},"author":10348,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2267],"tags":[628456,416327,2835,7090,416309],"class_list":["post-3437","post","type-post","status-publish","format-standard","hentry","category-publications","tag-628456","tag-bo","tag-journal","tag-reliability","tag-many-core"],"_links":{"self":[{"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/posts\/3437","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/users\/10348"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/comments?post=3437"}],"version-history":[{"count":4,"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/posts\/3437\/revisions"}],"predecessor-version":[{"id":3815,"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/posts\/3437\/revisions\/3815"}],"wp:attachment":[{"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/media?parent=3437"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/categories?post=3437"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.ubc.ca\/karthik\/wp-json\/wp\/v2\/tags?post=3437"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}