The naive Lambda-as-transaction-coordinator pattern

There’s a recurring pattern I see from developers writing Lambda functions handling API Gateway or AppSync requests which perform a form of distributed transaction. The code looks something like this:

export const handler = async(event) => {
	const systemAResult = await updateSystemA(event.orderData)
	const systemBResult = await updateSystemB(systemAResult.orderId, event.orderData)
	return { body: JSON.stringify({ orderId: systemAResult.orderId }), statusCode: 200 }
}

While this implementation is certainly a quick way to update two systems in response to a single event, there is one key problem — what happens when you encounter a partial failure whereby the update to System B fails after System A was successfully updated? Your data is now inconsistent between the two, which could introduce significant bugs to your users and also be difficult to manually rectify after the fact.

Overly optimistic or hurried developers may discount this as being very unlikely, especially if the systems being updated are reliable AWS services such as DynamoDB. They may add a try-catch and log the error before returning an error to the user, but this isn’t enough and is just inviting future pain. A robust design needs to account for this potential for failure so that if an error does occur, the data will be left in a consistent state across all services.

So what does such a robust design look like for this scenario?

There are a few potential solutions which are dependent upon your use case and the specific services being written to:

  1. If System A is an AWS service which has built-in Lambda event triggers (such as S3 or DynamoDB Streams), then the update to System B can be performed asynchronously in a separate Lambda function with built-in retries.
  2. Move the entire transaction logic to happen async from the user by having the API’s Lambda handler asynchronously start a Step Functions state machine and then return an “in-progress” acknowledgment to the user. The state machine will co-ordinate the two updates and build in robust error handling, retries and a compensating action to manage partial failures.
  3. If both the updates really must be performed synchronously (e.g. so that the user can be immediately provided with error feedback), then consider using a synchronous Step Functions express workflow which will allow a compensating/undo action to be built in, albeit with less scope for multiple retry attempts due to the 30 second time limit imposed by API Gateway.

In general, if you write a single Lambda function which is coordinating updates to multiple services, then that’s a bit of a smell that your solution is not robust and risks affecting data integrity.

Join daily email list

I publish short emails like this on building software with serverless on a daily-ish basis. They’re casual, easy to digest, and sometimes thought-provoking. If daily is too much, you can also join my less frequent newsletter to get updates on new longer-form articles.

    View Emails Archive

    🩺
    Architecture & Process Review

    Built a serverless app on AWS, but struggling with performance, maintainability, scalability or DevOps practices?

    I can help by reviewing your codebase, architecture and delivery processes to identify risk areas and their causes. I will then recommend solutions and help you with their implementation.

    Learn more >>

    🪲 Testing Audit

    Are bugs in production slowing you down and killing confidence in your product?

    Get a tailored plan of action for overhauling your AWS serverless app’s tests and empower your team to ship faster with confidence.

    Learn more >>