How to ensure data integrity in your DynamoDB applications
If you’re a developer coming to DynamoDB for the first time, a fundamental shift is that there’s now a greater onus on you to ensure the integrity of the data from inside your application’s code. In this article, we’ll explore the different concerns you as a developer need to consider when writing data to DynamoDB.
“Integrity” here means that the application data stored across all your DynamoDB tables is both accurate and consistent. Let’s explore these two properties in turn to see where and when they come into play.
Accuracy considerations
Accuracy is concerned with ensuring that items within your DynamoDB tables are complete and that individual attributes on each item have the correct value in the correct type and format for the domain entity that they represent.
DynamoDB provides three types of write operation: Put, Update and Delete. If you’re writing code to perform one of these operations (either on a single item or a batch), there are a few obvious accuracy checks that you’ll need to perform in your code:
- Ensuring required fields are set for Puts and Updates.
- Ensuring fields adhere to certain types (numbers, strings, arrays, etc)
- Ensuring fields adhere to certain formats (emails, ISO date strings, UUID/ULIDs, etc)
Another accuracy concern is uniqueness constraints. Perhaps you can only create a new instance of an entity if another one doesn’t already exist with a same value for a certain field, let’s say an emailAddress
for a User. You may get this for free from DynamoDB if the emailAddress
field is part of the table’s partition key schema definition. However, if it’s not, you will need to perform this check in your code (some examples of how to do this are here).
A new accuracy concern that DynamoDB introduces for developers is that of composite key fields. These are attributes whose values are dynamically calculated before writing, and whose purpose is to act as the partition key or sort key for a table or one of its GSIs. Composite fields are almost always a string, are particularly prevalent in single-table designs and often have a generic name such as pk
, sk
or gsi1pk
. They involve concatenating a “core” field from the domain item (e.g. username
) with string constants or other core fields to give a value like PROFILE#paulswail
. So if you’re writing code to perform an Update operation, you need to consider if one of the fields that you’re updating is also part of a composite field, and if so, update it also.
Some libraries that can help you with these accuracy concerns in the Node.js ecosystem are dynamodb-onetable, dynaglue and dynamodb-toolbox.
Consistency considerations
Once we’ve addressed the above accuracy concerns in our code, we also need to consider potential consistency issues.
Now DynamoDB provides eventual consistency for reads, with the AWS docs stating: “data is eventually consistent across all storage locations, usually within one second or less”. But while the DynamoDB service takes care of replicating writes across regions and to GSIs for us, there is one consistency concern that we need to address in our code: that of denormalisation.
With DynamoDB (and other NoSQL databases), it’s quite common to need to duplicate data for a specific domain entity across multiple items for the purposes of supporting efficient reads for different access patterns. This means that our code, when performing a Put, Delete or Update to a specific domain entity, needs to consider all the locations where this data is stored and write those items also. This can sometimes require first performing a Query operation to lookup all the keys of items where these copies exist and then using a BatchWriteItems operation to update all the copies.
Alternatively, sometimes it makes sense to perform any writes for denormalised copies of an item asynchronously from the update “master” copy in order to reduce user-facing latency. However, this disconnect in the master write and the denormalised copies adds extra overhead to your codebase as the logic is now spread out.
Use a data access layer to manage these concerns
I’m always reluctant to introduce extra layers of indirection into my code and generally don’t adopt techniques such as hexagonal architecture as a matter of course in my serverless apps (which I know many folks favour). However, one area where I make an exception to this is for a data access layer. Reading through the different concerns described in the previous sections, you’ll see there are a lot of things that could go wrong and given that data integrity is a critical concern, having it channeled through a centralised location makes it much easier to manage and enforce.
Here are some key features of the data access layers that I write (in Node.js):
- Each domain entity usually has its own data access module in its own file.
- The module exports public functions for each access pattern for that entity (e.g.
createUser
,changeUsername
,getUserByUsername
). - The function implementations use the DynamoDB DocumentClient SDK to perform the requisite API calls.
- Lambda handler functions, or indeed any other types of library modules used in the codebase, never reference the DynamoDB SDK directly but instead always go via the relevant data access module.
- The data access module enforces all the data integrity concerns listed in the previous sections around accuracy and consistency.
- The data access module is concerned purely with data access and integrity and is not responsible for performing any other business logic checks such as authorization checks.
What about functionless integrations to DynamoDB?
A data access layer works well if all your data access is performed from your own code that you write in the same general purpose programming language like Node.js. However, AWS provides functionless direct service integrations into DynamoDB from other services such as AppSync and StepFunctions. These integrations bypass Lambda functions and instead use their own domain-specific languages (such as Velocity Template Language or Amazon States Language) to allow you to map received inputs into DynamoDB write operations. However, crucially, these DSLs do not allow you to insert your data access layer module in between, so all your integrity checks are also bypassed.
So while there are many benefits of using functionless integrations, because of the criticality of ensuring data integrity in most apps I build, a simple rule of thumb that I’ve come around to is: If your use case involves writing to a DynamoDB table, use a Lambda function.
There may well be exceptions to this (e.g. if you have a really simple data model with no denormalisation or composite fields) but I think this is safe default position to take. And as your app grows, so will your data access patterns, and what’s now a really simple data model may require techniques such as denormalisation and composite fields in the future. At that stage, locating all data access code scattered across your codebase and refactoring it to go via a data access layer will be time-expensive and risky.
Other articles you might enjoy:
Free Email Course
How to transition your team to a serverless-first mindset
In this 5-day email course, you’ll learn:
- Lesson 1: Why serverless is inevitable
- Lesson 2: How to identify a candidate project for your first serverless application
- Lesson 3: How to compose the building blocks that AWS provides
- Lesson 4: Common mistakes to avoid when building your first serverless application
- Lesson 5: How to break ground on your first serverless project