How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-level Fault Injection Method

Yong Yang, Yifan Yu, Karthik Pattabiraman, Long Wang, Ying Li, IEEE International Symposium on Software Reliability Engineering (ISSRE), 2020. (Acceptance Rate: 26%). [ PDF | Talk ] (Code)

Abstract: Anomaly detection in distributed systems has been a fertile research area, and a range of anomaly detectors have been proposed for distributed systems. Unfortunately, there is no systematic quantitative study of the efficacy of different anomaly detectors, which is of great importance to reveal the deficiencies of existing anomaly detectors and shed light on future research directions. In this paper, we investigate how various anomaly detectors behave on anomalies of different types and the reasons for the same, by extensively injecting software faults into three widely-used distributed applications. We use a statement-level fault injection method to observe the anomalies, characterize these anomalies, and analyze the detection results from anomaly detectors of three categories. We find that: (1) the distributed systems’ own error reporting mechanisms are able to report most of the anomalies (from 82.1% to 92.8%) but they incur a high false alarm rate of 26.6%. (2) State-of-the-art anomaly detectors are able to detect the existence of anomalies with 99.08% precision and 90.60% recall, but there is still a long way to go to pinpoint the accurate location of the detected anomalies, and (3) Log-based anomaly detection techniques outperform other anomaly detection techniques, but not for all anomaly types.

Comments are closed.