Capacity planning for serverless workloads
“Surely we don’t need to worry about capacity planning any more now that we’re building our applications using serverless?”
The promise of serverless is that your cloud provider gives you services that can scale themselves based on demand without any input from you, the application builder. This in turn frees up you and your team to spend more time on the higher-value tasks involved in the building of your app.
But how close is the current state of serverless to this ideal? Before we answer that, let’s start with a definition…
What is capacity planning?
In the context of custom software development projects, capacity planning is the process of performing an upfront analysis on an application/workload before it’s first released into a production environment with a view to identifying the requisite infrastructure resource types, numbers, sizes and other scale-related attributes.
The goals of capacity planning are twofold:
- System Availability and Performance — To ensure your system has sufficient infrastructure resources to stay up and fast under peak load in production.
- Cost Estimation — Predicting costs to your organisation of procuring and operating these resources, both in terms of upfront purchasing cost and ongoing charges.
In this article, I’m going to focus primarily on the first goal (although it’s arguable that the second goal is even more relevant in the serverless world giving the variability of its pay-per-use pricing model — this topic will be a future article).
Capacity planning requirements for AWS services
Let’s take a look at the main services in the AWS serverless ecosystem. For each service, I’ll specify the main units of scale and associated limits that you should be aware of, as well as any knobs available to you to manually tune the service. (All figures are taken from US East 1 region and are correct at time of writing — Feb 2020).
The Lambda service will automatically scale concurrently executing Lambda functions in your AWS account on-demand by spinning up underlying containers where the function will be executed. If a request is received while another execution is in progress, another container is spun up. The latency in this spinning-up process is referred to as a “cold start”.
- ↕️ Core Unit of Scale: Function executions per time
- 🎚Setting: Memory allocation — How much memory will your function need? It’s important to note that this dimension is coarse-grained as it also controls CPU and network allocation proportionally, which will ultimately control how long your function will take to execute. 1024MB is a good allocation to start with for all your functions (and is the default used by the Serverless Framework). If you expect a particular Lambda function to receive a high load, then give more consideration to this setting.
- 🎚Setting: Provisioned Concurrency — You can pre-allocate a pool of “pre-warmed” Lambda containers at an individual function level if you have a requirement for extremely low-latency and cannot tolerate any cold starts.
- 🛑Limit: Maximum Concurrent Executions: 1000 (soft limit that can be increased with request to AWS Support).
- ↕️ Core Unit of Scale: API requests per time
- 🎚Settings: RateLimit and BurstLimit — Throttle requests at the HTTP method level.
- 🛑Limit: Throttle limit per region— 10,000 requests per second (soft limit)
DynamoDB tables can be run in one of two modes:
- On-Demand — Auto-scaling on reads and writes is managed for you. Less capacity planning is required, although you may still see throttling issues if the autoscaling isn’t sufficiently fast to cope with your load. See here for more info on strategies to help with this.
- Provisioned — You need to preconfigure your table based on the number of reads and writes you expect it will need.
↕️ 🎚Core Units of Scale:
- Read Request Unit — “One read request unit represents one strongly consistent read request, or two eventually consistent read requests, for an item up to 4 KB in size. Transactional read requests require 2 read request units to perform one read for items up to 4 KB. If you need to read an item that is larger than 4 KB, DynamoDB needs additional read request units.”
- Write Request Unit — “One write request unit represents one write for an item up to 1 KB in size. If you need to write an item that is larger than 1 KB, DynamoDB needs to consume additional write request units.”
- 🛑Limit: Concurrent requests per table— 40,000 read request units and 40,000 write request units (soft limit)
- ↕️ Core Unit of Scale: Published messages per time
- 🛑Limit: Subscribers per topic — 10 million (soft limit).
- 🛑Limit: Messages per second — Currently the AWS SNS docs have not published this limit 🤔
- ↕️ Core Unit of Scale: Transactions per time
- Standard Throughput (reading and writing) — “Standard queues support a nearly unlimited number of transactions per second“.
- FIFO Throughput — 3,000 messages per second with batching, 300 without.
- Messages per queue (backlog) — Unlimited
- Messages per queue (in-flight) — 120,000
Kinesis Data Streams
Kinesis is probably the “least serverless” of the services listed here in terms of auto-scaling. And therefore it probably requires the most thought with respect to capacity planning. It has the concept of a “shard” which as a serverless developer you really shouldn’t have to care about! Anyway, here’s the definition from the AWS docs:
A shard has a sequence of data records in a stream. When you create a stream, you specify the number of shards for the stream. The total capacity of a stream is the sum of the capacities of its shards. You can increase or decrease the number of shards in a stream as needed. However, you are charged on a per-shard basis.
You need to manage the number of allocated shards yourself. If a stream’s throughput is higher than the allocated number of shards allowed then subsequent read and write requests will be throttled.
- ↕️ Core Unit of Scale: Transactions (reads and writes) per time
- 🎚Setting: Shard Count — Number of shards the stream uses.
- 🛑Limit: Shards per stream— Unlimited
- 🛑Limit: Shard Input Data per second— 1MB or 1,000 records for writes per shard.
- ↕️ Core Unit of Scale: Event transactions per time
- 🛑Limit: API Put event requests— 400 requests per second (soft limit)
- 🛑Limit: Put event requests from other AWS services (non-API)— unlimited
- 🛑Limit: Invocations— unlimited
↕️ Core Units of Scale:
- Workflow executions started per time
- State transitions per time
- 🛑Limit: Maximum open executions per account (Standard workflows)— 1 million
- 🛑Limit: StartExecution API calls/second — 1,300 for Standard Workflow; 6,000 for Express Workflow
- 🛑Limit: StateTransition API calls/second — 5,000 for Standard Workflow; Unlimited for Express Workflow
Integrating with non-serverless services
Another reason for performing capacity planning is if your application integrates with downstream services or systems that are not self-scaling or have a throughput limit (e.g. an RDBMS database or a rate-limited third party API).
In these cases, the “infinite” scaling of services such as Lambda could actually harm your overall system by flooding these downstream services with more requests than they can handle. There are patterns to mitigate this such as the the Scalable Webhook pattern which involves putting an SQS queue in front of Lambda functions that talk to such less scalable services to throttle the rate at which requests are sent to them.
You will need to understand the capacity limits of these downstream services in order to set the throttling threshold appropriately.
Conclusion — is upfront capacity planning needed for production serverless workloads?
If you have a small internal app that uses API Gateway->Lambda->DynamoDB and you know it will only ever receive a handful of requests each day, then you probably don’t need to do any further analysis.
The longer answer is still yes, but the amount of planning required will depend upon the specific AWS services you are using and also any downstream non-serverless or third party services you are integrating with.
- Is Capacity Planning Required for Serverless? — Presentation by Richard Gimarc & Amy Spellmann
- How does proportional CPU allocation work with AWS Lambda? — Mustafa Akin
- I’m afraid you’re thinking about AWS Lambda cold starts all wrong — Yan Cui
- Throttle API Requests for Better Throughput — AWS Docs
- Auto-scaling Kinesis streams with AWS Lambda — Yan Cui
Other articles you might enjoy:
Free Email Course
How to transition your team to a serverless-first mindset
In this 5-day email course, you’ll learn:
- Lesson 1: Why serverless is inevitable
- Lesson 2: How to identify a candidate project for your first serverless application
- Lesson 3: How to compose the building blocks that AWS provides
- Lesson 4: Common mistakes to avoid when building your first serverless application
- Lesson 5: How to break ground on your first serverless project