Photo by Sarah Kilian on Unsplash

How to handle flaky tests

Paul Swail・6 min read

TestingSoftware Engineering

Flaky tests are automated tests that are non-deterministic. That means they may pass or fail when executed against the same build artifact or deployed system.

If you’ve ever retried the execution of a failed test run in your CI/CD pipeline tool without any code or config changes in order to get a failing test case to pass, that’s an indicator that you have a flaky test case.

Flaky tests can be worse than no tests. As well as burning time in getting your code into production, you can start to lose confidence in your test suite and question if the effort to write and maintain them are worth it.

Flaky tests are much more common with integration and E2E tests than with unit tests. If you’re building serverless apps, you’ll need integration and E2E tests for a sufficient confidence level, so you’ll likely encounter a flaky test sooner or later.

Let’s look at a few common causes of flaky tests within cloud-based applications along with recommendations for how to fix or avoid them.

Slow tests causing timeouts

Integration and E2E tests go over the network and may involve multiple steps, and therefore will be slower. This can result in your test runner timing out and failing the entire run (for example, Jest.js has a default timeout of 5 seconds for individual test cases).

Test cases which need to poll a data store as part of an assertion step are particularly vulnerable to timeout issues as not only are they slow but their duration is highly variable (see this example of polling CloudWatch logs). Predictably slow is ok, highly variably slow is harder to deal with.

Here are a few techniques to mitigate this form of flakiness:

Set a higher default timeout for all E2E tests (either via the test runner CLI argument or in a shared configuration file) so you don’t need to do it in the code for each individual test suite. I find that setting it to 15 seconds is usually a good default that allows slower tests to still override, while still pushing me to try and keep my test durations short.
Minimise the amount of steps in each individual test case. As far as possible (when testing backend APIs at least), E2E test cases should only test one endpoint as part of the System Under Test for that case (it may need to call others to setup data).
Use polling instead of a naive “sleep then check” process when querying for the presence of eventually consistent data (e.g. data that was created by a cloudside Lambda function asynchronously triggered by your test case).

Concurrent test suites acting on shared stateful resource

This is a particularly insidious form of test flakiness. If one of your tests is failing occasionally and you still have no idea why even after multiple re-reads of the code for the test case and the component it’s testing, there’s a fair chance you have a concurrency issue. Maybe this test was always passing before but now is starting to fail, even though you’re sure nothing in its SUT has been changed recently.

Certain test runner frameworks allow test suites to be run in parallel and often this is the default behaviour. For example, when you run the Jest.js CLI, it works as follows:

Find individual test suites (where 1 suite == 1 file) to be executed, typically using a glob pattern
For each test suite, a new worker process is created for and execution of each suite begins in parallel. process.env and other global memory stores (such as module imports) are isolated between suites
Within each test suite file, individual test cases are executed in series.

Go here to read a worked example of a concurrent test execution causing flakiness that I uncovered when building the Serverless Testing Workshop.

Here’s how you can reduce the risk of hitting a data concurrency issue in your tests:

Ensure your test suite only acts on a partition of the shared stateful resource so you can guarantee unique access to this partition. If your application’s data model has a naturally occurring partition entity (e.g. a tenant, an organisation, an account) and the System Under Test sits underneath this entity, then have each test suite create data within its own partition entity. This means that although concurrent suites may be hitting the same DynamoDB table, they should not be hitting the same partitions and thus their reads and writes should not interfere.
Quarantine test suites that simply must be run in isolation and run these separately after the rest have finished. This is a good solution whenever the System Under Test requires solitary access to all members of global resource or root-level entity (e.g. to all the clubs in the database, as per my GET /clubs use case). In Jest, you can supply a --runInBand CLI argument to tell it to run any test suites it finds in series. You can combine this with either the --testNamePattern or --testPathPattern argument to discover the quarantined suite files either by test name pattern regex (e.g. describe('getClubs [ISOLATE]') { ... }) or a file name/folder pattern.

Time-based logic

This is the classic example most cited for non-deterministic code issues. You’re building a feature that relies upon the current date/time.

An instance of this is with Wait states in AWS Step Functions. These allow an orchestration to wait either for a relative amount of time to pass, or until an absolute date-time is reached before proceeding.

The solution to this is to ensure that the System Under Test is parameterisable so that it can be fed a time/duration argument from the test case, instead of hardcoding it into the SUT.

API rate limits

If your test involves calling a rate-limited API, either as part of the setup or invoked the SUT itself, then you may encounter occasional failures due to a rate limit being hit.

An example of this is with apps that use AWS Cognito. A common test setup step is to create a new user for running the tests. But Cognito’s SignUp and AdminCreateUser API calls have a 50 requests/second quota. If you have scores of test suites each creating several users all in parallel, this limit could be hit.

There are a few solutions to this (with different pros and cons):

Minimise the amount of API calls made from a single test suite (e.g. create 1 user for an entire test suite file that all tests within that file can use).
Avoid the rate-limited API altogether within your test run by creating well-known seed data already set up that your tests can assume the existence of (e.g. a set of 10 users that your tests can reference). I generally dislike this approach as it makes the test harder to understand for new developers on the team.
Run each test suite in series to spread out the burst of API requests (e.g. by using the Jest runInBand option). This is heavy-handed as it could slow down your entire suite if the rate-limited API is needed in the majority of your test suites.
Manually add throttling/wait steps to your tests to slow them down. Last resort.

Interference from accumulated test data from previous runs

In this case, your test runs fine the first time, possibly even the first several times, but it eventually starts failing. A cause of this could be because it was making assumptions about the data store behind the SUT which no longer hold because test data has accumulated over successive test runs.

The solution to this one is pretty simple: each individual test suite is responsible for creating and deleting all the data entities that it creates. An even simpler solution (if your tests are running an environment solely dedicated to automated testing) is to have the entire database cleaned down before every test run begins.

Conclusion

Flaky tests can be a real PITA and time suck for you and your development team. Hopefully the symptoms discussed in this article and their potential solutions will save you some time in diagnosing and fixing the cause of your flaking test.

Originally published May 12, 2021.

Free Email Course

How to transition your team to a serverless-first mindset

In this 5-day email course, you’ll learn:

Lesson 1: Why serverless is inevitable
Lesson 2: How to identify a candidate project for your first serverless application
Lesson 3: How to compose the building blocks that AWS provides
Lesson 4: Common mistakes to avoid when building your first serverless application
Lesson 5: How to break ground on your first serverless project

How to handle flaky tests

Slow tests causing timeouts

Concurrent test suites acting on shared stateful resource

Time-based logic

API rate limits

Interference from accumulated test data from previous runs

Conclusion

Other articles you might enjoy:

Free Email Course

How to transition your team to a serverless-first mindset

🩺
Architecture & Process Review

🪲 Testing Audit

Slow tests causing timeouts

Concurrent test suites acting on shared stateful resource

Time-based logic

API rate limits

Interference from accumulated test data from previous runs

Conclusion

Other articles you might enjoy:

Free Email Course

How to transition your team to a serverless-first mindset

🩺 Architecture & Process Review

🪲 Testing Audit

🩺
Architecture & Process Review