Practise Continuous Integration to mitigate CloudFormation stack deployment pains

Being able to provision cloud resources with just a few lines of code can make Infrastructure-as-Code feel like a superpower. For me, it’s arguably the single most important practice for a development team new to serverless to adopt.

But with this power lies the potential for quite a lot of pain if done wrong. CloudFormation (AWS’s hosted IaC service on which most serverless deployment frameworks rely) has several edge cases which can cause stack deployments to fail.

I recently encountered such an edge case with a client team I’m helping and it was related to DynamoDB’s Global Secondary Indexes (GSIs). A developer on the team was adding two new GSIs to a table to support a new feature he was building. In his own cloud dev environment, this all worked fine, as he’d added each one incrementally as he was writing the code that queried it. However, when he merged to main and the CD pipeline attempted to deploy to a shared test environment, he got an error stating:

Cannot perform more than one GSI creation or deletion in a single update.

Sure enough, if you look at the DynamoDB docs, it states:

You can only create one global secondary index per UpdateTable operation.

Note: There’s another scenario which can cause a similar issue, whereby if you delete and add a GSI to the same table within a single update, you will get an error as this also isn’t allowed.

This was a bit tricky to resolve and fix in the CD pipeline and involved the following multi-deploy sequence of steps:

  1. Ensure code that relies on the presence of the new GSIs is not invoked anywhere from the application code. e.g. if any new integration/E2E tests call code which relies on these indexes, they will fail.
  2. Update the GlobalSecondaryIndexespart of the DynamoDB table definition, to remove one of the new GSIs.
  3. Push this change to main, triggering its auto-deployment to test environment. Wait for async index creation to complete.
  4. Add in the second GSI to the table definition. Re-enable any code/tests which refer to these new GSIs.
  5. Push these changes to main, triggering its auto-deployment to test environment.

This allowed the CD pipeline to then proceed to deploy through to the ungated staging and prod environments since the changes which reached these stages were in small batches.

While these manual steps were a pain to do, at least the problem was caught right at the start of the CD pipeline. And the developer now knows that in the future, if they require multiple new GSIs, they should do each one as a separate pushes to main.

However, in teams that don’t practise regular continuous integration into the main branch and/or who accumulate a large set of changes before deployment to certain environments, this type of scenario could be much more painful to resolve.

Consider a team that does manual QA in a staging environment and production releases only happen once every two weeks or once a month. In this scenario, separate developers may have introduced new GSIs for the same table within this period and (assuming each developer only included one new GSI) these would’ve been successfully deployed to staging, so there’s no sign of a problem yet. QA engineers will then do their testing and at the end of the iteration the deployment to production will be attempted. But the changeset is now so large that the CloudFormation deployment will fail. 😢

Now comes the task of identifying all the new GSI changes and the code which depends on them, paring them right back and applying them one by one, all while making sure all the test cases still pass in each intermediate state so the CD pipeline can proceed to the next stage. Messy. 😫

This DynamoDB GSI issue is just one specific example of the problems with applying large CloudFormation changesets to an environment. In general, my recommendation is to get your IaC changes into production ASAP and in as small a batch as possible, even if the feature the change is supporting is still a work-in-progress. This minimises the need for any time-consuming and error-prone manual resolutions for failed deployments.

Join daily email list

I publish short emails like this on building software with serverless on a daily-ish basis. They’re casual, easy to digest, and sometimes thought-provoking. If daily is too much, you can also join my less frequent newsletter to get updates on new longer-form articles.

    View Emails Archive

    🩺
    Architecture & Process Review

    Built a serverless app on AWS, but struggling with performance, maintainability, scalability or DevOps practices?

    I can help by reviewing your codebase, architecture and delivery processes to identify risk areas and their causes. I will then recommend solutions and help you with their implementation.

    Learn more >>

    🪲 Testing Audit

    Are bugs in production slowing you down and killing confidence in your product?

    Get a tailored plan of action for overhauling your AWS serverless app’s tests and empower your team to ship faster with confidence.

    Learn more >>