|Project Publications and Talks||Summary||Collaborators||Funding|
In this project, we investigate the reliability of large-scale cloud systems from the application or job perspectives. Our overall goal is to ensure that cloud applications execute successfully despite node and machine failures. Towards this end, we study correlations between job or application properties and their failure characteristics, and attempt to predict when (and whether) the application would fail, so that proactive recovery measures can be taken.
We use the Google cluster workload traces for our evaluation. These contain the job and resource usage information of jobs executing on the Google cluster over a one month time period. Although many details have been hidden in the trace, we find that there are quite a few interesting facts we can learn from the trace. For example, we find that failed and killed jobs consume a larger fraction of resources than those jobs that finish successfully (see left figure), suggesting that failure prediction can be quite beneficial. We further find that there are many extraneous factors that can decide whether a job will succeed or not, and these factors manifest even half-way into the job’s execution. Thus, failure prediction techniques can achieve significant resource savings as they can predict failed jobs much before the end of the job.
We explore failure prediction using Recurrent Neural Networks (RNNs) for batch jobs, the dominant jobs in the Google cluster trace. We consider two classes of predictors: aggressive and conservative, with different ratios of false positives and false negatives. We find that we can achieve high specificity and accuracy for batch jobs with conservative prediction (see right figure), and that this can lead to significant resource savings for these classes of jobs. We plan to explore more diverse workloads in the future and extend our prediction technique for these workloads.
Students: Xin Chen
Other: Charng-da Lu