Controlled Chaos
At the end of 2021, Microsoft introduced Azure Chaos Studio. Chaos Studio is embedded in the Azure Portal and enables Chaos Engineering, which is the act of purposely injecting errors/faults into your resources to see how they respond and act, allowing to test the resilience in a production outage. You basically break your Azure environment on purpose to see how good it works.
NOTE: Azure Chaos Studio is not a simulation of any kind and will inject the errors/faults into your selected resources for the specified duration in your experiment. This means it will cause actual outages on your services and/or applications.
Azure Chaos Studio
Azure Chaos Studio can be found in the Azure Portal and you can test different scenarios with it, for example, you can see if your application will failover to another region or availability zone when it is made unavailable, or add stress to your application to see if it will automatically scale up as intended. With the results you can improve your application and make sure it will become more resilient in general.
Now you know what Azure Chaos Studio is meant for, let's have a look at what we can do with it!
Targets
Within the Targets tab you can onboard all the Azure resources that are currently supported and you want to enable your tests on. When you want to onboard a resource you can do so by selecting it and clicking on the Enable targets button.
This will give two options: Enable service-direct targets, which will work for all resources, or Enable agent-based targets, which are specifically for Virtual Machines and Virtual Machine Scale Sets.
When you have onboarded your resources with which to test, you will see that Managed actions underneath the manageFault column will become blue and available to click on.
When you click on it, you will get a detailed overview in which you can allow specific tests from the Fault Library. Since all tests will be held live, it might be OK to perform stress tests, but no abrupt shutdowns of your application to still guarantee business continuity.
In my case, CosmosDB only has the Fault available to test its failover, so I'll keep this enabled.
Experiments
In the Experiments tab, you can create various chaos Experiments, in order to test the resilience of your applications. You can do so by clicking on the + Create button or with Infrastructure as Code like ARM templates, Bicep files, etc., since an Azure Chaos Studio Experiment is a resource of itself.
When experimenting via the Azure Portal, you will be asked to give your Experiment a proper name and location. I'm only testing for CosmosDB, but if you are testing all services for the specific workload of an application, you can give it a more generic name.
Continuing to the Experiment Designer, here you can create all the scenarios you want to experiment with. This can be done in certain steps and each step can have multiple Branches (Actions). Clicking on the + Add action button will give you the option to add a Fault or a Delay. Delays allow you to wait x amount of minutes before you continue, while Faults cause the actual damage.
All the Faults currently available come from the Fault Library. You can select the Fault needed for your Experiment. In my case, this would still be the CosmosDB Failover, for which you can specify how long this Fault needs to run. In the case of CosmosDB, you can add a readRegion.
Continue with selecting the resource itself and click on Add and when you're happy with your test scenario you can click on Review + create to finish the creation of the Experiment.
Running an Experiment
When your Experiment has been created and is ready to run, you can simply do this by looking it up in the list of Experiments in the Experiments Tab. Click on it, followed by clicking on Start. You will be prompted with the question: You are about to start a chaos experiment that could impact subscription resources or cause serious outages. Are you sure you want to proceed?
Click on OK and off you go!
NOTE: Since Azure Chaos Studio is still in Preview, it will not always work right out of the box. Experiments work with Managed Identities, which should automatically get assigned to the resources you have selected within the creation of the Experiment, but this is not always the case. When this happens, you will need to add the Experiment itself via the Access Control (IAM) of the resources. A Contributor role will do the trick for now since there are no specific Chaos roles yet.
Before I ran my Experiment, my CosmosDB looked as follows:
While running the Experiment, the failover became active and switched to the other region, allowing the database to still be used and still assuring business continuity.
And after 10 minutes the Experiment was successfully completed and the CosmosDB switched back to its original region. Azure Chaos Studio has a lot of potential for Cloud Engineers or old-school Testers to make sure everything keeps working in case of an outage.
What's Next?
Currently I'm swamped in projects and things that still need to be finished. Too much to do, in so little time. Stay tuned and be surprised next week!