Paper Details
Reference:
Yongle Zhang, Kirk Rodrigues, Yu Luo, Michael Stumm, and Ding Yuan,
"The inflection point hypothesis a principled debugging approach for locating the root cause of a failure",
In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP'19), Huntsville, ON, Canada, ACM, October, 2019, pp. 131–146.
Download:
PDF ; Talk Slides ; Talk
Abstract:
The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a totally ordered sequence of instructions, then the root cause can be identified by the first instruction where the failure execution deviates from the non-failure execution that has the longest instruction sequence prefix in common with that of the failure execution. Thus, root cause analysis is transformed into a principled search problem to identify the non-failure execution with the longest common prefix. We present Kairux, a tool that does just that. It is, in most cases, capable of pinpointing the root cause of a failure in a distributed system, in a fully automated way. Kairux uses tests from the system's rich unit test suite as building blocks to construct the non-failure execution that has the longest common prefix with the failure execution in order to locate the root cause. By evaluating Kairux on some of the most complex, real-world failures from HBase, HDFS, and ZooKeeper, we show that Kairux can accurately pinpoint each failure's respective root cause.
Keywords:
Distributed systems, root cause, failure diagnosis, debugging
Reference Info:
DOI: 10.1145/3341301.3359650
ISBN: 9781450368735
OCLC: 8877132593
BibTeX:
@inproceedings(Zhang-sosp-19, author = {Yongle Zhang and Kirk Rodrigues and Yu Luo and Michael Stumm and Ding Yuan}, title = {The inflection point hypothesis a principled debugging approach for locating the root cause of a failure}, booktitle = {Proceedings of the 27th ACM Symposium on Operating Systems Principles (\textbf{SOSP'19})}, location = {Huntsville, ON, Canada}, publisher = {ACM}, month = {October}, year = {2019}, pages = {131-146}, doi = {10.1145/3341301.3359650}, isbn = {9781450368735}, keywords = {Distributed systems, root cause, failure diagnosis, debugging} )