Enjoying our lemonade: How my team came out ahead from a 4 week outage

2023-06-28 550 words 3 minutes

Contents

I work on a web based product which supports hundreds of thousands of paying subscribers. Our servers were first built out in 2014, before I joined the team. By 2018, our environments were starting to look dated, with some components approaching end of life. I submitted an intake request with our shared-services infrastructure team to modernize it, and assess a move to Azure, but as everything was still supported, and as we had no new requirements, we were unable to make a case, as other lines of business had much more pressing concerns.

In March of 2020, an incident shut down our development environments for an indeterminate amount of time. We were stuck: no API gateway, no environments for QA, nowhere to deploy and test code, no way to progress the product. The team was idle until our infrastructure team restored our environments, and other lines of business were ahead of us in the queue. We did have access to a company-funded Azure sandbox, which had been acquired for developing independent proof-of-concepts.

No environments, no way to progress the product, idle team, and access to an Azure sandbox. Although there was no approved infrastructure project to migrate our product to Azure, our team decided to migrate our .Net web app to Azure PaaS, with the hopes that one day, the project would get green-lit.

4 weeks elapse. Our pre-production environments are restored, and we’re now in a position to continue regular product progression. We’ve made great progress, but we’re not quite running on Azure PaaS, and a number of business-driven priorities are 4 weeks behind. Do we stop, and resume work as we had before the incident? If we resume progression work, how do we keep this Azure PaaS code branch current?

We estimated we needed another 2 weeks to get the app fully functional on Azure PaaS. We met with all stakeholders.

Here’s what we decided to do:

Complete the Azure PaaS migration work (it did take another 2 weeks!)
Replace our on-premise Dev environment with the Azure PaaS environment. Going forward, all deployments would start with our new Azure PaaS dev environment, and then progress to our regular on premise QA environment. This ensured our Azure PaaS work didn’t rust, sitting unused, and allowed us to continue stable development on our on-premise infrastructure.

About 7 weeks after the incident, we resumed our regular product progression cadence.

Fast forward to July 1st 2020, and, unexpectedly, our line of business is sold to another company. We need to plan for a data center migration to the acquiring firm’s data center. The initial plan is to “lift and shift” - move the production on-premise virtual machines and configuration. We propose to the new management team that we think we should move the application to Azure cloud. A couple of meetings later, this becomes the plan.

The migration date is set for November 2020, and is executed smoother than any of us expected, with disruption limited to a single planned service window of a few hours.

That was almost three years ago. For our use case, the Azure environment has proven to be far more reliable and cost effective than our previous on-premise data center. This was all enabled by a four week dev environment outage. We made lemonade with our lemons!