Predicting Job Completion Times Using System Logs in Supercomputing Clusters

Xin Chen, Charng-da Lu and Karthik Pattabiraman, Proceedings of the IEEE Workshop on Reliability and Security Data Analysis, 2013. [ PDF File | Talk ]

Abstract: Most large systems such as HPC/cloud computing clusters and data centers are built from commercial off-the-shelf components. System logs are usually the main source of choice to gain insights into the system issues. Therefore, mining logs to diagnose anomalies has been an active research area. Due to the lack of organization and semantic consistency in commodity PC clusters’ logs, what constitutes a fault or an error is subjective and thus building an automatic failure prediction model from log messages is hard. In this paper we sidestep the difficulty by asking a different question: Given the concomitant system log messages of a running job, can we predict the job’s remaining time? We adopt Hidden Markov Model (HMM) coupled with frequency analysis to achieve this. Our HMM approach can predict 75% of jobs’ remaining times with an error of less than 200 seconds.