Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study

Xin Chen, Charng-da Lu and Karthik Pattabiraman, 25th IEEE International Symposium on Software Reliability Engineering (ISSRE), 2014. (Accept Rate: 25%). [ PDF | Talk ] Download DataSet (This paper was chosen as “highlights of ISSRE” in 2019 – one of 26 papers chosen among over 1000 papers in the 30 year history of the conference) (ECE story).

Abstract: In this paper, we analyze a cloud workload trace from Google and characterize job failures. The goal of our work is to improve the understanding of failures in compute clouds. We present the statistical properties of job and task failures, and correlate them with key scheduling constraints, node operations, and attributes of users in the cloud. We also explore the potential for early failure prediction, and anomaly detection for the jobs. We find that there exist several opportunities to enhance the reliability of the cloud based on our results, such as pro-active maintenance of nodes or limiting job resubmissions. We further find that resource usage patterns of the jobs can be leveraged by failure prediction techniques. Finally, we find that the termination statuses of jobs and tasks can be clustered into six dominant categories based on the user profiles.

Comments are closed.