Antifragile - Hidden Benefit of Chaotic Systems
Although not related to IT, there are ideas worth considering as we think about our systems in the following book:
Antifragile: Things That Gain from Disorder by Nassim Nicholas Taleb
“Just as human bones get stronger when subjected to stress and tension, and rumors or riots intensify when someone tries to repress them, many things in life benefit from stress, disorder, volatility, and turmoil. What Taleb has identified and calls “antifragile” is that category of things that not only gain from chaos but need it in order to survive and flourish.”
The idea is that a bunch of un-aligned, disorganized systems suffer from a bunch of small, recoverable failures which make the whole more resilient. Whereas large, organized, homogeneous systems may suffer from fewer small failures, they are susceptible to larger failures which can lead to catastrophe.
I have seen some evidence of these patterns at work. I worked for years on a product which required an Intel server running RedHat acting as a proxy, an Intel server running Windows handling connectivity with other systems, a Sun server running the SunONE application server, and a Sun server running Oracle (all before Oracle bought Sun!). When I worked in support, a “Severity 1” server down alert might mean an issue with the application server, and a single client would be out of service.
In 2012, significant upgrades were made to our infrastructure. All of those Intel servers for all of our lenders were consolidated onto VMWare clusters. All of our Sun servers were consolidated onto larger Sun servers. Significant savings were realized in infrastructure expenses, and systems became easier to manage. The number of incidents decreased.
But as we consolidated our infrastructure, an outage now had much greater scale. A “Severity 1” server down alert now meant that multiple customers were out of service. As we consolidated our servers, we also consolidated our incidents. A Sev 1 became bigger and more complex. If we were using the number of Sev 1 incidents as a performance metric, were we counting the same thing?
As we look to the cloud, the potential scale is even bigger - here are a few examples:
- How Microsoft Azure’s service went down due to a February 29th bug
- The Linux leap second issue’s impact on Facebook
- An Amazon US East Region outage
What happens when all applications are hosted by Amazon AWS, Microsoft Azure and Google Cloud? When every server runs Linux on Intel?
Given the choice, I don’t think anyone wants to manage a impossible patchwork 1000’s of systems unsupported by vendors that no one understands with different versions of everything. However, the dangers of homogeneous systems should be considered as we design and assess our systems – there can be strength in disorder!