AWS Step function vs. AWS Lambda benchmark

Benchmarking latencies of AWS Lambda executions and AWS Step Functions using SDK integrations

Featured on Hashnode

Looking into the AWS ecosystem of serverless services, AWS Step Functions is one of my personal most favorite services. I recently had a chat with some colleagues about a potential use case of Step functions in favor of AWS Lambda. While we discussed the general concept of AWS Step Functions, one of my beloved colleagues argued towards the usage of AWS Lambda like

Let us use AWS Lambda because a workflow described as a state machine sounds like it is much slower.

I could neither substantiate this statement nor could I contradict it. So I started to examine the original assumption "Step Functions is slower than Lambda" with facts. Time for a benchmark!

For me, the results were crystal clear 😆

One does not simply

Just kidding! Let us first get a common understanding what AWS Step Functions and AWS Lambda are. If you are familiar with these services, you can jump right into the section about the test setup and results.

By the way: the source code is also available for you on Github.

🤹 What is AWS Step Functions?

AWS Step Functions was published in 2016 as a serverless orchestration service. I think the following definition of AWS Step Functions explains very well, what kind of problems AWS Step Functions solves:

Step Functions is a serverless orchestration service that lets you combine […] AWS services to build business-critical applications. Through Step Functions’ graphical console, you see your application’s workflow as a series of event-driven steps.

Step Functions is based on state machines and tasks. A state machine is a workflow. A task is a state in a workflow that represents a single unit of work that another AWS service performs. Each step in a workflow is a state. Source: What is AWS Step Functions? - AWS Step Functions

State machines can be invoked both asynchronously and synchronously. Step Functions itself offers several ways to invoke your state machine, for example:

  • via an explicit StartExecution call using your favorite AWS SDK,

  • on each HTTP request hitting your AWS API Gateway,

  • as a destination in your Amazon EventBridge event bus

Typical use cases for AWS Step Functions cover data processing, machine learning, microservices orchestration or governance and security automation. Since the launch of the AWS SDK service integrations, you can use out-of-the-box working integrations with every service that is supported by the AWS SDK. This offers you a huge number of new opportunities to integrate with AWS services without writing a single line of code.

While creating a new state machine you can decide between two execution types named “Standard” or “Express”. Each type has several characteristics and strengths. While standard workflows are a good fit for long-running workflows, Express workflows are a good fit for high-traffic workloads, data streaming or mobile application backends.

⚡️ What is AWS Lambda?

Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring and logging. With Lambda, you can run code for virtually any type of application or backend service. Source: https://docs.aws.amazon.com/lambda/latest/dg/welcome.html

Don’t get me wrong, I am also a big fan of AWS Lambda. But since AWS announced the game-changing SDK service integrations for Step Functions, I start to think more about typical use cases for AWS Lambda. To use AWS Lambda more for the things that it is amazing at in the future.

Or to quote Eric Johnson at serverless office hours:

Use Lambda to transform not to transport

⏰ Benchmarking latencies

The goal of this benchmark is not to say that service A is better/worse than service B. Each service has its strengths and weaknesses. What we want to achieve is, getting a better understanding of what kind of latencies we can measure for AWS Step Functions and how this compares to a similar integration based on AWS Lambda.

General setup

We want to measure the time it takes to read from and write data to Amazon S3 both from a state machine and an AWS Lambda function.

We test the behavior in two different versions. Version 1 simply writes to S3. Version 2 extends this by executing a GetObject operation afterwards. The code of the Lambda function is written in javascript.

const AWSXRay = require("aws-xray-sdk-core");
const AWS = AWSXRay.captureAWS(require("aws-sdk"));
const S3 = new AWS.S3();
const bucketName = process.env.DestinationBucketName;

exports.lambdaHandler = async (event, context) => {
  try {
    console.log("EVENT: " + JSON.stringify(event));
    const key = "lambda/" + event.requestContext.requestId;
    await S3.putObject({
      Bucket: bucketName,
      Key: key,
      Body: new Date().toISOString(),
    }).promise();

    await S3.getObject({
      Bucket: bucketName,
      Key: key,
    }).promise();

    const response = {
      statusCode: 200,
      isBase64Encoded: false,
    };
    return response;
  } catch (err) {
    console.log(err);
    return err;
  }
};

The state machine workflow is similarly straightforward and chains the same Amazon S3 calls as the AWS Lambda function.

State machine graph

Both the AWS Lambda function and the state machine can be invoked via an API Gateway. All experiments are triggered using Apache Bench with the following parameters.

ab -n 15000 -c 1 https://hash.execute-api.eu-central-1.amazonaws.com/Prod/invoke-lambda/

-n configures the total amount of requests that are triggered - in our case 15.000 -c is the number of concurrent requests - in our setup 1

I decided to use this setting because I want to generate a moderate stream of load for both integrations.

X-Ray is activated on all integration layers so that we can get a complete trace from the API-Gateway down to S3.

Experiment 1 - Writing to S3

The first experiment focuses only on the execution of a PutObject without reading the files afterwards. The automatic Amazon CloudWatch dashboards for AWS Lambda, AWS API Gateway and AWS Step Functions are a good starting point to provide us with valuable insights.

Let us first start with analyzing the Apache Bench reports. The complete reporting is available on GitHub. Here are some highlights:

  • The state machine was able to process all requests 539 seconds faster compared to the lambda function.

  • The state machine was able to process 2.07 more requests per second

  • The mean time per request for the state machine is 35.92 ms lower than the AWS Lambda based integration

API Gateway latencies

A closer look into the Amazon CloudWatch dashboard underlines what Apache Bench tells us. While observing the complete length of the benchmark we see that the average latency of Step Functions is constantly below AWS Lambda.

Average latencies on API Gateway

Both integration types indicate a drop in latencies indicating some kind of cold start behavior. While the drop of Step Functions on average is more significant compared to AWS Lambda.

When we take a closer look into the 99th percentile, we see some more spikes but in general a similar result over time.

99 percentile latencies on API Gateway

Statemachine and AWS Lambda function execution

Let us now jump into the next integration layer and take a look at the duration of the AWS Lambda function and the state machine itself. Not very surprised that the state machine is very much faster - in the end around about 60% compared to the duration of the Lambda function.

Statemachine and lambda execution

The AWS Lambda function runs with the default memory settings of 128MB and a default timeout of 3 seconds. Depending on the concrete use case, fine-tuning your memory settings might have a significant impact on the lambda metrics.

Downstream service latencies

I was very much surprised to see, that the connection between Step-Functions and S3 seems to be much more efficient. Looking at our X-Ray service map and traces the average latency between Lambda and S3 is 63ms compared to the integration with Step Functions of 28ms. It may be a coincidence that the relative difference is also almost 60%. Or it might reveal, that Step Functions does some optimization handling the AWS client SDK under the hood.

X-Ray service map experiment 1

Experiment 2 - Write and read from S3

I was interested to know if the amount of work a state machine has to cover, impacts latencies and execution times compared to my AWS Lambda function. Hence we extended our experiment to also read data from S3 after writing it.

Again, let us first check the report from Apache Bench:

  • The state machine was able to process all requests 1287 seconds faster compared to the lambda function.

  • The state machine was able to process 3.01 more requests per second

  • The mean time per request for the state machine is 85,83 ms lower than the AWS lambda based integration

API Gateway latencies and execution duration

Long story short, the results are comparable to the ones from the first experiment. But it is interesting to see, that the gap between the state machine and the Lambda function is getting bigger. Some factors will influence this, like the chosen implementation and runtime of the AWS Lambda function.

💡 Please check out the awesome article of my AWS Community Builder fellow Alexandr Filichkin about a performance comparison of the different lambda runtimes.

The AWS Lambda function is not able to get closer to the latency behavior of the state machine implementation.

API Gateway latencies experiment 2

The AWS Lambda function needs almost double the amount of time to write and read data from/to S3.

execution duration experiment 2

Also interesting to see, is that the latency between my AWS Lambda function and Amazon S3 seems to slightly increase compared to the first experiment on average. AWS Step Function keeps on optimizing the connection to Amazon S3 🤩.

xray service map experiment 2

Conclusions

Based on the things I learned, what would I answer now if someone states

Let us use AWS Lambda because a workflow described as a state machine sounds like it is very much slower.

My general answer would be: measure first. My specific answer on the comparison of AWS Step Functions and an AWS Lambda function is, that this is not true in all cases. Our little experiment revealed some interesting insights:

  • AWS Step Function scales and is much faster in our setup compared to my AWS Lambda function.

  • In this experiment, the state machine shows more efficient communication with S3 compared to my custom code implementation.

  • When we compare the Step Function implementation with AWS Lambda it is obvious that we do not have to write custom code to achieve the same results.

  • The new capabilities of the Step Function Workflow Studio and SDK service integration lower the barrier to achieving the same result in this use case while reducing time-to-market.

But be cautious in generalizing the test results. There is a lot you can do to optimize your AWS Lambda functions to optimise for performance efficiency. Your results might also differ in other use cases. These results should not disband you from creating additional benchmarks including your specific use cases to measure what is important to you.

Please also keep in mind if you really have to optimize for performance and take into account if it is also possible to implement your use case asynchronously.