How to test your EventBridge integrations

TestingAWSEventBridge

So you’re building a PubSub messaging workflow on AWS using EventBridge. The ability to decouple the publisher’s logic from the subscriber’s is a powerful architectural pattern both in terms of system scalability and the maintainability of your codebase.

But with these benefits comes an obstacle—how do you test your EventBridge integrations?

Relative to unit testing and most other forms of integration testing, EventBridge provides a few unique challenges:

  • PubSub is inherently asynchronous. Your test can’t just make an API request and then perform an assertion on the response.
  • EventBridge has no persistent storage of events that you can query via its API.
  • How do you control the side effects caused by multiple downstream subscribers when publishing to EventBridge from your tests?

But fear not, there are techniques to work through these challenges which you can implement in your test suite. In this guide, we’ll cover:

  • What test cases you need to write for your EventBridge integration
  • Methods for verifying correct behaviour of publishers and subscribers
  • Approaches to managing side effects of your test runs

How can your EventBridge integration fail?

As with any form of integration or E2E testing, the first step to uncovering what tests to write is to understand what could go wrong. Put another way, what are the potential failure modes of your integration?

This requires knowledge of the functional and operational behaviours and limitations of each component participating in the integration. In our case, that means the EventBridge service itself, components which send events to it (publishers) and components which receive events from it (subscribers).

The three high-level failure symptoms to look for are:

  1. Publisher fails to send event to EventBridge
  2. Subscriber does not receive event that it should have
  3. Subscriber receives event but fails to process it correctly

Before we walk through each of these, consider the following use case:

You’re building A REST API for a Sports Club Manager app using API Gateway backed by Lambda functions. The API has an endpoint that allows users to join a club. You have an immediate requirement to send notifications to existing club members when a new member joins.

To keep your notification logic decoupled, you decide to implement this by having the joinClub Lambda function triggered from API Gateway publish a MemberJoinedClub event to EventBridge. You then have a separate notifyMemberJoined Lambda function which subscribes to this event type and is responsible for delivering the notifications to the correct users.

Now let’s look at how we can test for each of the above three failure symptoms with this concrete use case.

Publisher fails to send event to EventBridge

Here are a few potential causes of this issue:

  • Name of event bus misconfigured or not set in Lambda environment variable
  • Lambda function doesn’t have IAM permissions to put events to EventBridge
  • Bug in code that invokes the AWS SDK (e.g. it’s very easy to forget to add the .promise() call in the AWS JavaScript v2 SDK)

This may seem like a simple scenario to test for seeing as there’s only a single binary outcome to verify—either the event got to EventBridge or it didn’t. However, this is actually the hardest test to implement as we need to set up some extra infrastructure to help verify the delivery of events to EventBridge.

There are several different ways to implement this, each with its own pros and cons. My preferred approach is to provision an auxiliary SQS queue in our test environment which is subscribed to all events in our EventBridge event bus.

Testing the publishing of an EventBridge event

The test case invokes the System Under Test (in our case the POST /clubs/{id}/join API Gateway endpoint) which should then cause the message to be published to EventBridge. The test case then polls the E2ETestQueue until it finds the matching message.

Since this queue is purely for use in automated testing it can have a low value set for its MessageRetentionPeriod so as to automatically clean up messages without incurring an accrual of unprocessed messages which would slow down successive test runs.

Subscriber does not receive event that it should have

The next category of failure we’ll look at is when an event is published but a subscriber isn’t triggered. In our use case, this would mean that the notifyMemberJoined Lambda function isn’t triggered.

This could be caused by a few things:

  • The Lambda function wasn’t hooked up to EventBridge at all
  • The filter rule used in the event pattern to connect the Lambda to EventBridge was misconfigured, e.g. it used the wrong source or detail-type

To write a test to check for this symptom, we need to publish an event to EventBridge and verify that the Lambda function was indeed invoked. We do not care (for this test case) what the function does nor if it succeeds.

In order to verify that the Lambda function was invoked, I check for the presence of a CloudWatch log statement. The diagram below shows how such a test works for our use case (the System Under Test is highlighted in pink).

Testing the triggering of an EventBridge event subscriber

If you’re using JavaScript, you can use the aws-testing-library NPM package to check for the presence of a CloudWatch log statement.

Here’s a test for our use case that uses this library’s toHaveLog Jest extension function:

it('is triggered whenever MemberJoinedClubEvent is sent to EventBridge', async () => {
  // Arrange: create event matching rule that will cause triggering of Lambda function
  const evt: MemberJoinedClubEvent = {
    member: {
      user: {
        id: `notifyMemberJoinedTest1_${uuid()}`, // ensure this data is uniquely identifiable to each test run
        email: 'clubMember1@example.com',
      },
      role: MemberRole.PLAYER,
      club: { id: 'a12345', name: 'Belfast United' },
    },
  };

  // Act: send event to EventBridge
  await publishEvent(evt, EventDetailType.MEMBER_JOINED_CLUB);

  // Assert: Check the unique data is present in the CloudWatch logs for the Lambda function
  const expectedLog = evt.member.user.id;
  await expect({
    region: AWS_REGION,
    function: lambdaFunctionName,
    timeout: 20000, // needs a high timeout to account for variable latency in shipping logs from Lambda to CloudWatch
  }).toHaveLog(expectedLog);
});

A couple of things to note here:

  • The test relies on the Lambda function logging out a unique piece of data at the start of its invocation, in this case the new member’s user ID.
  • Due to the delay in logs being available from the CloudWatch API, the toHaveLog function polls CloudWatch until it finds the expected log statement instead of the more naive approach of just delaying and querying once, which could lead to flaky test runs. The configurable timeout puts an upper limit on how long it polls for.

Because of the CloudWatch log polling, this type of test will be slower to run. But since you only need one such test case for each Lambda function that subscribes to EventBridge, I find the trade off to be worth it.

Subscriber processes event incorrectly

The final category of failure we’ll look at is when the subscriber receives the event but fails to process it correctly. Unlike the other two categories, the cause of this type of failure is mostly specific to your use case and will require writing multiple test cases to cover different failures.

For our Sports Club Management app example, here are a few potential causes of why our notifyMemberJoined Lambda function might not work correctly:

  • URLs/ARNs to downstream services are missing or misconfigured in the function’s environment variables
  • It doesn’t have IAM permissions to call these downstream services
  • There’s a bug in the code, e.g. it selects the wrong members to send notifications to due to a bad database query
  • The function isn’t idempotent. EventBridge guarantees at-least-once event delivery so there may be a very rare occasion where your function is invoked twice with the same event. If this would be particularly bad for your use case (e.g. cause data inconsistencies), then you need to cater for this inside your handler code and write a test case for it.

To implement your tests cases, you should invoke the Lambda function directly without going via EventBridge as this is faster and also gives you access to the response payload. How you verify the correctness of the function depends on what services it integrates with, but here’s a list of common techniques I use in the assert phase of my test cases (most preferable first):

  1. Query the downstream system via its API to make sure the data was correctly sent to it. This is possible for any persistent datastores such as DynamoDB, S3, etc.
  2. For services that you can’t query (e.g. SES for delivering email), have the Lambda function return a payload with IDs and/or status codes returned by the downstream service. These can then be verified in the test case and EventBridge will ignore any response payloads after it invokes the Lambda function.
  3. Add logging code to the end of the Lambda function when it completes without error. Then your test case can poll CloudWatch Logs via its API for the presence of a specific log statement.

This approach to integration testing subscriber Lambda functions isn’t specific to EventBridge integrations and can be applied more generally to Lambda functions that are asynchronously triggered by any AWS service, e.g. SNS, SQS, DynamoDB Streams.

Managing side effects & data sprawl

PubSub integrations inevitably involve writing data somewhere. And because of their one-to-many nature, managing the data sprawl that a single test case can generate can be tricky. Publishing a single event could cause several downstream subscribers to write to multiple downstream services.

Here are a few negative impacts of leaving such side effects unaddressed in a pre-production automated testing environment:

  • Cost of storing data in AWS that isn’t cleaned up will increase over time (most AWS services that persist data bill by the GB-hour)
  • Cost of calling third-party APIs, many of which bill by number of requests
  • Future automated test runs may eventually fail due to test data being left behind from previous runs (e.g. a storage limit may be hit)
  • Notifications could be sent to real people (e.g. by email/SMS)
  • Financial transactions might be triggered
  • Build up of automated test data causes a mental load, e.g in the UI, for a human user

There are several techniques you can employ to help deal with these side effects (these list items aren’t mutually exclusive):

  • If the data creation and accrual has zero or a negligible cost (this is very rarely the case), just let it happen and do nothing.
  • If the writing of data is core to the System Under Test (as opposed to being a side effect) and this data has one of the above costs, then have the test suite clean it up in its teardown phase.
  • If the data being created is not core to the SUT, then prevent it from being created by inserting some data into the test event payload such that downstream subscribers won’t process it.
  • Run the tests in a transient isolated environment that is spun up and tore down for each test run (e.g. as a CI/CD pipeline stage). This ensures all data will be removed from AWS data stores without needing to rely on individual test suites cleaning up their data. This approach slows down the pipeline though as provisioning a new stack is time-consuming and it also doesn’t prevent data from being created in non-AWS third-party services.
  • Disable the EventBridge subscription rule (either programmatically in your test or via IaC when creating the test environment) to prevent costly subscriber functions from firing altogether whenever they aren’t the SUT.

Which of these techniques you employ depends on the nature of your downstream processing and impact (financial or otherwise) of the side effects.

Conclusion

Let’s recap on the key points we’ve covered in this guide:

  • EventBridge is a powerful tool for building scalable and maintainable serverless systems, but testing it is more difficult than synchronous integrations.
  • The three high-level failure symptoms to write tests for are:

    1. Publisher fails to send event to EventBridge
    2. Subscriber does not receive event that it should have
    3. Subscriber receives event but fails to process it correctly
  • Use an auxiliary SQS queue to verify publishing of events
  • Use CloudWatch log polling to verify triggering of subscriber rule
  • Invoke subscriber Lambda function directly to verify correct processing
  • Be cognisant of any side effects that subscribers to the events you create in your test cases could introduce. Consider some of the techniques listed in the previous section to help mitigate the effect of these.

If you’re interested in diving deeper into EventBridge testing along with other AWS services, check out my 4-week Serverless Testing Workshop. The workshop is a mixture of self-paced video lessons alongside weekly live group sessions where you will join me and other engineers to discuss and work through different testing scenarios. A 30% early-bird discount is available if you sign up by April 27th and you get instant access to the course materials as soon as you sign up.

Originally published .

Other articles you might enjoy:

Free Email Course

How to transition your team to a serverless-first mindset

In this 5-day email course, you’ll learn:

  • Lesson 1: Why serverless is inevitable
  • Lesson 2: How to identify a candidate project for your first serverless application
  • Lesson 3: How to compose the building blocks that AWS provides
  • Lesson 4: Common mistakes to avoid when building your first serverless application
  • Lesson 5: How to break ground on your first serverless project