The 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) had an interesting presentation on “Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems”.
Below is the summary of the study:
- The study explores the reasons why distributed systems fail in production by analyzing the root causes of around 200 confirmed system failures.
- The failures were reported against open-source projects that are used to implement distributed systems, including Cassandra, HBase, HadoopDistributed File System (HDFS), Hadoop MapReduce, and Redis.
- The key takeaways are:
- 92% of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
- In 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code.
- A majority (77%) of the failures require more than one input event to manifest, but most of the failures (90%) require no more than 3.
- For a majority (84%) of the failures, all of their triggering events are logged.
- More focus on writing and validating error handling code.
- Run integration tests with a small number of nodes.
- Examine your application logs to diagnose failures.
Here is a link to the full paper: (Link)
- The authors developed a tool called Aspirator. It is a tool that checks for trivial bug patterns in exception handlers for Java or JVM compatible programs.