Failure Prediction of Jobs in Compute Clouds: A Google Cluster Case Study

Xin Chen, Charng-da Lu and Karthik Pattabiraman, To appear in the Workshop on Reliability and Security Data Analysis (RSDA), Held in conjunction with the IEEE International Symposium on Software Reliability Engineering (ISSRE),2014. [ PDF | Talk ]

Abstract: Most cloud computing clusters are built from unreliable, commercial off-the-shelf components. The high failure rates in their hardware and software components can result in frequent node and application failures. Therefore, it is important to predict application failures before they occur to avoid resource wastage. In this paper, we investigate how to identify application failures based on resource usage measurements from the Google cluster traces. We apply recurrent neural networks to the resource usage measures, and generate features to categorize the input resource usage time series into different classes. Our results show that the model is able to predict failures of batch applications, which are the dominant jobs in the Google cluster. Moreover, we explore early classification to identify failures, and find that the prediction algorithm provides the cloud system enough time to take proactive actions much earlier than the termination of applications, with an average of 6% to 10% resource savings.