Practise Continuous Integration to mitigate CloudFormation stack deployment pains

Being able to provision cloud resources with just a few lines of code can make Infrastructure-as-Code feel like a superpower. For me, it’s arguably the single most important practice for a development team new to serverless to adopt.

But with this power lies the potential for quite a lot of pain if done wrong. CloudFormation (AWS’s hosted IaC service on which most serverless deployment frameworks rely) has several edge cases which can cause stack deployments to fail.

I recently encountered such an edge case with a client team I’m helping and it was related to DynamoDB’s Global Secondary Indexes (GSIs). A developer on the team was adding two new GSIs to a table to support a new feature he was building. In his own cloud dev environment, this all worked fine, as he’d added each one incrementally as he was writing the code that queried it. However, when he merged to main and the CD pipeline attempted to deploy to a shared test environment, he got an error stating:

Cannot perform more than one GSI creation or deletion in a single update.

Sure enough, if you look at the DynamoDB docs, it states:

You can only create one global secondary index per UpdateTable operation.

Note: There’s another scenario which can cause a similar issue, whereby if you delete and add a GSI to the same table within a single update, you will get an error as this also isn’t allowed.

This was a bit tricky to resolve and fix in the CD pipeline and involved the following multi-deploy sequence of steps:

  1. Ensure code that relies on the presence of the new GSIs is not invoked anywhere from the application code. e.g. if any new integration/E2E tests call code which relies on these indexes, they will fail.
  2. Update the GlobalSecondaryIndexespart of the DynamoDB table definition, to remove one of the new GSIs.
  3. Push this change to main, triggering its auto-deployment to test environment. Wait for async index creation to complete.
  4. Add in the second GSI to the table definition. Re-enable any code/tests which refer to these new GSIs.
  5. Push these changes to main, triggering its auto-deployment to test environment.

This allowed the CD pipeline to then proceed to deploy through to the ungated staging and prod environments since the changes which reached these stages were in small batches.

While these manual steps were a pain to do, at least the problem was caught right at the start of the CD pipeline. And the developer now knows that in the future, if they require multiple new GSIs, they should do each one as a separate pushes to main.

However, in teams that don’t practise regular continuous integration into the main branch and/or who accumulate a large set of changes before deployment to certain environments, this type of scenario could be much more painful to resolve.

Consider a team that does manual QA in a staging environment and production releases only happen once every two weeks or once a month. In this scenario, separate developers may have introduced new GSIs for the same table within this period and (assuming each developer only included one new GSI) these would’ve been successfully deployed to staging, so there’s no sign of a problem yet. QA engineers will then do their testing and at the end of the iteration the deployment to production will be attempted. But the changeset is now so large that the CloudFormation deployment will fail. 😢

Now comes the task of identifying all the new GSI changes and the code which depends on them, paring them right back and applying them one by one, all while making sure all the test cases still pass in each intermediate state so the CD pipeline can proceed to the next stage. Messy. 😫

This DynamoDB GSI issue is just one specific example of the problems with applying large CloudFormation changesets to an environment. In general, my recommendation is to get your IaC changes into production ASAP and in as small a batch as possible, even if the feature the change is supporting is still a work-in-progress. This minimises the need for any time-consuming and error-prone manual resolutions for failed deployments.

Join daily email list

I publish short emails like this on building software with serverless on a daily-ish basis. They’re casual, easy to digest, and sometimes thought-provoking. If daily is too much, you can also join my less frequent newsletter to get updates on new longer-form articles.

    View Emails Archive

    Free Intro Call

    Book a free 30-minute introduction call with me to see how I can help your team with serverless.

    Select a time for our call

    🛫 Serverless Launchpad

    Ready to start building your new AWS serverless project but need help with getting everything setup?

    The Serverless Launchpad is a done-for-you DevOps service installed in under a week. You get a leading-practice multi-account AWS environment, a scaffolded codebase and architecture including the common AWS serverless services, isolated cloud environments for individual developers, automated delivery pipelines right through to production and much more. Everything is IaC, extensively documented and handed over to your developers.

    Learn more >>