When talking about a new discipline, we must be able to explain it as simply as possible and associate it with something meaningful. This is also valid when associated with engineering, and even more specific with software development. When dealing with many complex systems that are built in a structured way, there is also a very strong need to put these to a test. By experimenting with live running deployments of a software product, we can detect weaknesses before they manifest in system-wide outages.
Chaos Engineering is defining this proactive approach of stress testing a system’s capability to withstand turbulent conditions and detecting possible failure points. Despite the complexity that many companies have in their infrastructure, these principles of managed and controlled chaos are helping to build confidence in solutions that are under constant transformation and scaling.
As more and more software companies strongly rely on the availability of their services to serve millions of active users, even the smallest outage can be costly. (Just think about what happens when critical services like transportation, energy or medical systems go down).
When software applications moved to distributed cloud architectures, this challenge became even greater, with many worldwide service providers racing to avoid downtime. And these providers can definitely quantify the cost of such production incidents, depending on the business that they serve.
In 2010, Netflix’s engineering team created an in-house custom solution named Chaos Monkey to provide support and confidence from moving a physical infrastructure to the cloud based deployment.
As mentioned in the Chaos Engineering – System Resiliency in Practice book, “ this simple app would go through a list of clusters, pick one instance at random from each cluster, and at some point during business hours, turn it off without warning. It would do this every workday. The solution of the engineering teams was up to them, but this provided a very good approach to the problem of vanishing instances that AWS had back then.”
In 2011, this Netflix team created additional functionality for failure and the simian army was born. This was meant to keep the cloud safe, secure and highly available by diversifying the types of failure scenarios.
After 2012, when Netflix open sourced the chaos monkey code, many companies adopted the trend. Even jobs as chaos engineer appeared.
Later the project was scaled and improved to Chaos Kong to support failing AWS regions and allowed the Netflix team to improve their run-book for a failover scenario from 50 minutes to just six minutes .
In 2015, the discipline was officially formalized and a manifesto was defined with a few key principles listed here: https://principlesofchaos.org.
Since then, the domain has seen huge improvements and multiple products were born that can help companies handle chaos. All big companies have adopted this as part of their day-to-day processes either by using open source projects, licensed products or platforms as a service.
The goal of chaos engineering is to identify weaknesses before they manifest in system-wide, aberrant behaviours. Systemic weaknesses could take the form of:
• improper fallback settings when a service is unavailable
• retry storms from improperly tuned timeouts
• outages when a downstream dependency receives too much traffic
• cascading failures when a single point of failure crashes
The most significant weaknesses must be addressed proactively, before they affect customers in production, take advantage of increasing flexibility and resources, and have confidence in the production deployments despite the complexity that they represent.
Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices of development and velocity of deployment, but how much confidence can we have in the complex systems that we put into production?
Again as mentioned by O’Reilly in the above mentioned book ”even when all the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments.
Because Chaos Engineering was born from complex system problems, it is essential that the discipline works with experimentation as opposed to testing.”
A few steps to follow
1. Initially begin with defining a baseline, or normal state, as some quantifiable output of your system that indicates expected behaviour .
2. Create a hypothesis that this state will continue in both the control scenarios and the experimentation ones.
3. Introduce parameters that reflect potential and possible real-world events like hardware crashes, network issues, memory or disk resources getting starved, etc.
4. Try to disprove the hypothesis by looking for deltas in the normal behaviour versus the control scenarios and experimentations.
A few principles to follow
As presented in “Chaos Engineering – System Resiliency in Practice”
• Build a hypothesis around steady-state behaviour
Every experiment begins with a hypothesis. The advanced principles emphasize building the hypothesis around a steady-state definition. This means focusing on the way the system is expected to behave, and capturing that in a measurement.
This focus on steady state forces engineers to step back from the code and focus on the holistic output. It captures Chaos Engineering’s bias toward verification over validation. We often have an urge to dive into a problem, find the “root cause” of a behaviour, and try to understand a system via reductionism. Doing a deep dive can help with exploration, but it is a distraction from the best learning that Chaos Engineering can offer. At its best, Chaos Engineering is focused on key performance indicators (KPIs) or other metrics that track with clear business priorities, and those make for the best steady-state definitions.
• Vary real-world events
This advanced principle states that the variables in experiments should reflect real-world events. While this might seem obvious in hindsight, there are two good reasons for explicitly calling this out:
• Variables are often chosen for what is easy to do rather than what provides the most learning value.
• Engineers have a tendency to focus on variables that reflect their experience rather than the users’ experience.
• Run experiments in production
Experimentation teaches you about the system you are studying. If you are experimenting on a Staging environment, then you are building confidence in that specific environment. To the extent that the Staging and Production environments differ, often in ways that a human cannot predict, you are not building confidence in the environment that you really care about. For this reason, the most advanced Chaos Engineering takes place in Production.
This principle is not without controversy. In some fields there are regulatory requirements that preclude the possibility of affecting the Production systems. In some situations there are insurmountable technical barriers to running experiments in Production. It is important to remember that the point of Chaos Engineering is to uncover the chaos inherent in complex systems, not to cause it. If we know that an experiment is going to generate an undesirable outcome, then we should not run that experiment. This is especially important guidance in a Production environment where the repercussions of a disproved hypothesis can be high.
As an advanced principle, there is no all-or-nothing value proposition to running experiments in Production. In most situations, it makes sense to start experimenting on a Staging system, and gradually move over to Production. Of course this also raises the need to have a solid rollback plan and potentially some controlled dry run exercises.
• Automate experiments to run continuously
This principle recognizes a practical implication of working on complex systems. Automation has to be brought in for two reasons:
• To cover a larger set of experiments than humans can cover manually. In complex systems, the conditions that could possibly contribute to an incident are so numerous that they can’t be planned for. In fact, they can’t even be counted because they are unknown in advance.
• To empirically verify our assumptions over time, as unknown parts of the system are changed. Imagine a system where the functionality of a given component relies on other components outside of the scope of the primary operators. This is the case in almost all complex systems. Without tight coupling between the given functionality and all the dependencies, it is entirely possible that one of the dependencies will change in such a way that it creates a vulnerability. Continuous experimentation provided by automation can catch these issues and teach the primary operators about how the operation of their own system is changing over time. This could be a change in performance or a change in functionality (e.g., the response bodies of downstream services are including extra information that could impact how they are parsed) or a change in human expectations (e.g., the original engineers leave the team, and the new operators are not as familiar with the code).
Automation itself can have unintended consequences. The idea is to have this as part of the CI/CD pipelines if possible (run independently or at release time)
• Minimize blast radius
This final advanced principle was added to “The Principles” after the Chaos Team at Netflix found that they could significantly reduce the risk to Production traffic by engineering safer ways to run experiments. By using a tightly orchestrated control group to compare with a variable group, experiments can be constructed in such a way that the impact of a disproved hypothesis on customer traffic in Production is minimal.
How a team goes about achieving this is highly context-sensitive to the complex system at hand. In some systems it may mean using shadow traffic; or excluding requests that have high business impact like transactions over $100; or implementing automated retry logic for requests in the experiment that fail. In the case of the Chaos Team’s work at Netflix, sampling of requests, sticky sessions, and similar functions were added into the Chaos Automation Platform. These techniques not only limited the blast radius; they had the added benefit of strengthening signal detection, since the metrics of a small variable group can often stand out starkly in contrast to a small control group. However it is achieved, this advanced principle emphasizes that in truly sophisticated implementations of Chaos Engineering, the potential impact of an experiment can be limited by design.
All of these advanced principles are presented to guide and inspire, not to dictate.
Types of attacks
This can have multiple approaches from partially terminating the network to simulating Denial of Service (DoS) attacks. A very common usage is the latency attack that induces artificial delays in the communication layer to simulate service degradation. By measuring if upstream services respond appropriately or an entire service downtime. This can be particularly useful when testing the fault-tolerance of a new service by simulating the failure of its dependencies, without making these dependencies unavailable to the rest of the system. DNS type attacks are also used with blocking all outgoing traffic over the standard DNS port 53.
Other approaches can introduce packet loss in a controlled manner, by dropping a specific number of network IP packets or randomly skip transferring some of them at the transport layer with targets of port and host.
Black hole attacks that can be total downtime of a specific service or port communication.
Many types of scenarios in chaos engineering can target behaviour related to CPU, memory, IO or disk availability.
The CPU attack generates high load for one or more CPU cores. It does this by running expensive arithmetic operations across a number of threads (one for each core targeted).
For example, experiments can be run where the CPU core processing is kept at 98% and the performance of the apps or services is assessed during this.
The memory type of attack consumes a set amount of memory, or as much as is available (whichever is lower), and holds onto it for the duration of the experiment. It allocates blocks of memory until it reaches the desired amount, deallocating the memory upon completion.
The disk attack fills up a target block device (specified by the filesystem on which it is mounted) by writing to multiple files. This way the disk is filled up and little storage space is left for usage.
The IO types of experiments generate large amounts of IO requests (read, write, or both) to filesystems that are mounted on block devices. It can run multiple requests in parallel by spawning multiple threads (workers), up to the number of available cores on the target machine.
Process killer attack kills targeted processes over a specific supply interval throughout the length of the attack.
Shutdown issues a system call to shut down or reboot the operating system on which the target is running on.
Conformity finds instances or services that don’t adhere to best-practices or internal company configurations.
Health checks or monitors taps into health checks that run on each instance as well as monitors other external signs of health to detect unhealthy instances. Once unhealthy instances are detected, they are removed from service and after giving the service owners time to root-cause the problem, are eventually terminated.
Janitor ensures that our cloud environment is running free of clutter and waste. It searches for unused resources and disposes of them.
Security is an extension of Conformity Monkey. It finds security violations or vulnerabilities, such as improperly configured AWS security groups, and terminates the offending instances. It also ensures that all our SSL and DRM certificates are valid and are not coming up for renewal.
Localisation (and/or Internationalization – l10n-i18n) detects configuration and run time problems in instances serving customers in multiple geographic regions, using different languages and character sets.
Chaos Gorilla is similar to Chaos Monkey, but simulates an outage of an entire Amazon availability zone. We want to verify that our services automatically re-balance to the functional availability zones without user-visible impact or manual intervention.
Kubernetes is a complex framework for a complex job. Managing several containers can be complicated, and managing hundreds and thousands of them is essentially just not humanly possible. Kubernetes makes highly available and highly scaled cloud applications a reality, and it usually does its job remarkably well. In the recent years it has been the choice for many tech companies and became a wide discussed topic in the industry.
In Kubernetes, chaos engineering targets scenarios inside a cluster like
- pod failure – scenarios inject errors into the pods and will cause pod creation failure for a while. The selected pods will be unavailable in the specific period
- pod kill – terminates the selected pods which would restart/respawn the pods constantly
- container kill – terminates a specified container in the target pods
- network partition – blocks the communication between two pods
- network emulation – regular network faults, such as network delay, duplication, loss, and corruption.
- kernel chaos – targeted for a specific pod it might impact the performance of other pods
- DNS chaos – allows to simulate fault DNS responses such as a DNS error or a random IP address after a request is sent.
- IO, memory, CPU, disk – experiments similar to regular use cases also apply
Of course there are a lot of challenges in a K8 type of deployment and in order to run chaos scenarios, you will need tools that are capable of handling such complexity and potential configurations.
Existing solutions can help you experiment with chaos engineering and vary from open source libraries to platforms that offer this as a service. Gremlin is one of the best products out there offering failure as a service, providing you a chance to sign up for a free account and start stress testing your app or infrastructure with many cool features.
You can also go with an open source library that provides more basic functionalities or even fork an existing repository and custom tailor features and use cases as needed.
Ideally the goal is that such a tool would evolve to be part of a CI/CD pipeline and with every deployment, the new changes are put through multiple chaos scenarios.
One of the most hyped directions in the last few years is the decentralised architecture of platforms and apps, relying on technologies like blockchain. It is very interesting to see how chaos engineering will be used in this context. The use cases will have to cover concepts like smart contracts, transaction validators, proof of stake, proof of work etc. At the same time it could uncover problems with performance or bottlenecks in writing to the blockchain under stress conditions.
“No one tells the story of the incident that didn’t happen.” John Allspaw
Intuitively, it makes sense that adding redundancy to a system makes it safer. Redundancy alone does not make a system safer, and in many cases it makes a system more likely to fail. Consider the redundant O-rings on the solid rocket booster of the Space Shuttle Challenger. Because of the secondary O-ring, engineers working on the solid rocket booster normalized over time the failure of the primary O-ring, allowing the Challenger to operate outside of specification, which ultimately contributed to the catastrophic failure in 1986.
Resilience is created by people. The engineers who write the functionality, those who maintain the system, and even the management that allocates resources toward it are all part of a complex system. Tools can help. Tools don’t create resilience. People do. Chaos Engineering is an essential tool in our pursuit of resilient systems.
1.Casey Rosenthal & Nora Jones, Chaos Engineering – System Resiliency in Practice, O’Reilly Media, 2020
2.Gremlin online documents : https://www.gremlin.com/chaos-monkey/for-engineers/