Preventing Production Failures in Distributed Data-Intensive Systems

The 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) had an interesting presentation on “Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems”.

Below is the summary of the study:

  •  The study explores the reasons why distributed systems fail in production by analyzing the root causes of around 200 confirmed system failures.
  • The failures were reported against open-source projects that are used to implement distributed systems, including Cassandra, HBase, HadoopDistributed File System (HDFS), Hadoop MapReduce, and Redis.
  • The key takeaways are:
    • 92% of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
    • In 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code.
    • A majority (77%) of the failures require more than one input event to manifest, but most of the failures (90%) require no more than 3.
    • For a majority (84%) of the failures, all of their triggering events are logged.
  •  Solutions:
    • More focus on writing and validating error handling code.
    • Run integration tests with a small number of nodes.
    • Examine your application logs to diagnose failures.

Here is a link to the full paper: (Link)

Bonus information:

  • The authors developed a tool called Aspirator. It is a tool that checks for trivial bug patterns in exception handlers for Java or JVM compatible programs.

 

Prod error