What approach would we take to test or validate a distributed system?

in software •  6 years ago  (edited)

When distributed systems fail, it’s usually because of one of the following reasons:

  • Concurrency bugs
  • Inadequate failure handling
  • Incorrect input validation (usually leading to security bugs) or flawed threat models

Distributed systems typically don’t have global locking, so concurrent modification of state is more likely to occur. Distributed systems are usually not “fail-stop” so hardware failures and network partitions must be tolerated. Both of these are similar to problems that occur in non-distributed systems too, and the techniques for test and validation are similar.

Model Checking and Proofs

It pains me to say this, but human-written systems proofs are basically useless. They’re almost always wrong. I cover some cases where this has occurred in my talk “Never Trust Any Published Algorithm”:

However, we can test for correctness of the system design if we can model it and identify its invariants. Languages and systems for doing so include:

A limitation of some of these systems is that they do not work directly with the implementation code (though Key and Spin can do so.) So the model may not accurately reflect bugs in the implementation. The model checkers can find cases where the invariants fail, but not prove correctness. Fortunately (?) most real systems have bugs that can be expressed in terms of small use cases.

Failure Injection and Fuzzing

To test the real implementation, it’s necessary to define what its correct behavior looks like, and then try to break that behavior. So again we need some invariant, like linearizability in a database, or system availability in a web service, or “not crashing”. Then we can “throw a lot of spaghetti at the wall and see what sticks” by exploring the behavior of the system when we inject different types of inputs or failures. Chaos Engineering tries to formalize this into a set of practices around observability and testing in production.

  • Netflix/SimianArmy is a suite of tools that induce failures or check for unusual behavior in a distributed system
  • Jepsen is a library for configuring and applying load to a distributed system, inducing failures and network partitions, and looking for inconsistencies in the results. It has been quite successful at demonstrating failures in distributed databases that violate the guarantees made in the documentation.
  • Fault injection libraries like libfiu wrap system calls and cause functions to return errors in predictable, repeatable ways.
  • The code itself can have instrumentation like “crash points” which deliberately crash the system in critical regions. Or “delay points” can be enabled which increases the time code spends performing an operation, increasing the chance of seeing a race condition occur.
  • Tools like netem or Linux Traffic Control can be used to artificially induce network latency, packet loss, or duplication. (There are also various benchmarking systems which can be used to generate large amounts of HTTP requests against a test system.)

Test systems which work mainly on single-node code can still be useful for finding classes of error common in networked systems, such as security vulnerabilities.

  • libFuzzer and american fuzzy lop are tools for discovering test cases that get good code coverage, thereby exploring corner cases such as assertions and buffer overruns.
  • Manticore and other symbolic execution engines build a “virtual machine” on which the program executes that has additional capabilities to help determine correctness. These systems try to provide some of the benefits of model checking while running on the real code.
  • Hypothesis or QuickCheck perform “property-based testing” by trying a lot of different test cases, then finding a minimal example which violates the conditions you specified.

Monitoring and Redundancy

In addition to testing in the lab, most large distributed systems are tested in production as well. Monitoring and tracing capabilities let the system’s designers and operators understand if it is working as expected. Redundant design ensures that a request gets answered even if a software bug or component failure prevents the original server from answering.

  • Systems like Honeycomb can assemble measurements taken in a distributed system into a larger picture of system health, and highlight unusual activity.
  • Rolling out a new version of the software can be done in an incremental fashion, so that the service remains up while the new code is exercised by real users.

All of the above

Robust development tends to use several of these approaches, as well as human-centered ones such as code and design and test plan reviews. These tools and techniques give multiple overlapping ways to find error. For example, concurrency bugs can be found by:

  • a model checker finding a problematic sequence of operations violating the model constraints
  • a symbolic execution engine detecting a potential race when the code is run
  • a property-based tester exploring different interleaving of operations
  • a test harness exposing the inconsistency by injecting delay and failure
  • observing the system in production and highlighting inconsistent behavior

All of these techniques start with some definition— even if it’s “don’t crash” —of what the correct behavior actually is. That’s often the trickiest part in any testing effort; as the system becomes more complex and gets more features, the interactions between them can be harder to describe precisely.

Originally answered on Quora: https://www.quora.com/What-approach-would-we-take-to-test-or-validate-a-distributed-system/answer/Mark-Gritter

The postings on this site are my own and do not represent DataDirect Network's positions, strategies, or opinions.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Very comprehensive review of existing methods. Thank you!
Posted using SteemSigma

Sounds like a question for StemQ! 😉





This post has been voted on by the SteemSTEM curation team and voting trail in collaboration with @curie.

If you appreciate the work we are doing then consider voting both projects for witness by selecting stem.witness and curie!

For additional information please join us on the SteemSTEM discord and to get to know the rest of the community!