Strategies for syncing denormalized data in DynamoDB
When working with DynamoDB (and indeed with most NoSQL databases), you may find that you need to copy certain data fields into different locations within the same table, or even into a different table. This duplication is often referred to as denormalization.
The reason this is needed is because, unlike RDBMS databases, DynamoDB doesn’t support performing joins in a single query. Therefore, in order to get all the data you require in an efficient read operation, you need to store it in the same place as the other data being fetched with it.
There are a few different ways you can implement this copying of data which we’ll cover below.
Example use case
To illustrate each strategy we’ll use the example of an application which uses a Single-table design data model and has User and Organization entities, each stored in separate partitions.
The “master copy” of the User data is stored under a USER-{userId}
partition. But we also need to store the User details of the Organization owner under a ORGANIZATION-{orgId}
partition.
So whenever the user updates their displayName
(say through an AppSync API mutation or API Gateway request), there are 2 separate DynamoDB items we need to update.
Implementation strategies
Strategy 1: In a single atomic transaction
This involves using the TransactWriteItems
operation to perform all the necessary changes (Put, Update, Delete) within a single atomic action.
Pros:
- Copies of data are always consistent
- No need for separate Lambda function
- Easy to reason about in codebase as all updates are kept together in single function
Cons:
- Increased user latency. Writing transactions will be marginally slower than a standard put/update but an added latency will occur if you need to perform a Get/Query before doing the transactWrite in order to fetch all the item primary keys to be updated
- Can’t be (easily) performed in “constrained code” environments such as AppSync VTL resolvers and StepFunctions tasks, so you’ll probably need to do it in a Lambda function
- If a particular data field can be updated from multiple sources (say different API endpoints), then this transactional logic will need to be carried out in each handler. This can be mitigated by keeping all the transactional updates in a shared module, but developers still need to know to use this.
- The
TransactWriteItems
API operation has a limit of 25 items that can be written in single transaction. So if you have more copies than this (e.g. when copying parent root data into several child entities), you’ll lose the consistency benefit and have to code the API requests into batches.
Strategy 2: Asynchronously in a DynamoDB streams handler
This involves the API handler code simply updating the master copy of the User item in the DynamoDB table and a separate Lambda function being used to trigger off a DynamoDB stream and perform the required “copies”.
Pros:
- Low-latency for user
- Guaranteed to run irrespective of what source triggers the update of the master copy item
Cons:
- Slight delay in updates to master and duplicate copies
- DynamoDB Streams are noisy and don’t allow filtering (see Pros and cons of DynamoDB streams). This can result in complex handler logic in the same function if you have several different denormalised data items. This is particularly an issue for single-table design data models.
- Risk of infinite recursion bug with stream handlers if you accidentally update the master copy again
- Harder to reason about in codebase as the master copy changes are separate from the duplicates
Strategy 3: Asynchronously via an EventBridge handler
This involves the API handler code updating the master copy of the item in DynamoDB and then publishing a USER_UPDATED
event to EventBridge. A separate Lambda handler would subscribe to this event and perform the required “copies”.
Pros:
- Doesn’t require DynamoDB reads before performing the write
- Can maintain single-purpose Lambda functions
Cons:
- Slightly slower user-facing latency given extra network call to EventBridge
- Slight delay in updates to master and duplicate copies
- Can’t be (easily) performed in “constrained code” environments such as AppSync VTL resolvers and StepFunctions tasks, so you’ll probably need to do it in a Lambda function
- If a particular data field can be updated from multiple sources (say different API endpoints), then this EventBridge publishing logic will need to be carried out in each handler. This can be mitigated by keeping all the denormalized updates in a shared module, but developers still need to know to use this.
- Rollback code may be required — Since this is effectively performing a distributed transaction (a write to DynamoDB and EventBridge) within a single Lambda function. We would need to implement a try-catch block when writing to EventBridge and (in the situation that a transient error occurs in EV), add code to rollback the DynamoDB update and then return error to user. It’s highly unlikely for it to fail but if it does and there is no rollback code, then the data in DynamoDB will be inconsistent
- Harder to reason about in codebase as the master copy changes are separate from the duplicates
Deciding between these strategies
The pros and cons of each strategy will have different weights depending on your use case.
My default approach would be strategy 1 as it has the least moving parts, and when all other factors are (almost) equal, I like to optimise for greater code maintainability. But if your context requires a very fast user response and you need to perform several reads to gather the data items to be updated, you may opt for 2 or 3.
Further reading
Other articles you might enjoy:
Free Email Course
How to transition your team to a serverless-first mindset
In this 5-day email course, you’ll learn:
- Lesson 1: Why serverless is inevitable
- Lesson 2: How to identify a candidate project for your first serverless application
- Lesson 3: How to compose the building blocks that AWS provides
- Lesson 4: Common mistakes to avoid when building your first serverless application
- Lesson 5: How to break ground on your first serverless project