AWS Step Function vs. AWS Lambda benchmark

🥊 It is time for a battle again. After I published the first part of my comparison, I was overwhelmed by the amount of feedback I received. May it be comments on my post, or discussions on Twitter or LinkedIn.

The fact that the initial post triggered a lot of inspiring discussions is very valuable. While reading through your feedback it was kind of obvious that there is a need for a second part.

I received a lot of feedback about optimizations for AWS Lambda and people are curious about how this affects the performance in comparison to our state machine. We will also take a closer look at the perspective of costs to get a more complete view of how the services differ - here we are.

Like in our first part, again all experiments are triggered using Apache Bench with the following parameters.

ab -n 15000 -c 1 https://hash.execute-api.eu-central-1.amazonaws.com/.../

-n configures the total amount of requests that are triggered - in our case 15.000 -c is the number of concurrent requests - in our setup 1

⚠️ IMPORTANT: it is important to consider, that the results from apache-bench are not 100% accurate. The measured throughput depends on the hardware and network capabilities of my local workstation. For upcoming benchmarks, I consider using something like CloudShell. But apache-bench gives some very early feedback and potential indications. Hence we use these results in combination with the Lambda duration and Step-Function execution duration.

🔋 Optimizing our Lambda function

So what is the goal of our upcoming experiments? We want to apply some optimizations to our Lambda function with a clear focus to decrease latencies. Based on the feedback I got, there were two main approaches for optimization:

Reusing downstream HTTP connections by activating keep-alive settings.
Improving overall execution performance by increasing the allocated memory.

Reusing Connections with Keep-Alive in Node.js

For short-lived operations, such as in our case writing and reading to and from S3, the latency overhead of setting up a TCP connection might be greater than the operation itself. To activate HTTP keep-alive you simply have to set an environment variable in your Lambda function configuration.

Environment:
  Variables:
    AWS_NODEJS_CONNECTION_REUSE_ENABLED: 1

In case you already use v3 of the AWS JS SDK, this setting is enabled by default. For v2 you have to explicitly activate it.

Let us deploy the change and start our first test. Let us first start with analyzing the Apache Bench reports. The complete reporting is available on GitHub. Here are some highlights:

The Lambda function was able to process all requests 43 seconds faster compared to the state machine.
Both the state machine and the Lambda function were able to process round about 7 requests per second
The mean time per request for the Lambda function was 131ms and 134ms for the state machine.

Looking at these results, this little tweak of activating TCP keep-alive helped a lot to speed up the Lambda function. In terms of end-2-end performance and latency, both solutions are now very close to each other.

Let us take a closer look into CloudWatch and X-Ray to confirm the observations.

latencies with keep-alive

The average execution time of the state machine is 46.4ms and Lambda performs at 49ms.

x-ray service map with keep alive

Here things are still looking interesting. The Lambda function duration on average still has some ups and downs during the execution of the test while the duration of the state-machine is stable. Both solutions show some cold-start behavior while it seems that the state machine needs less time to become "warm".

But in total the impact on the Lambda function performance is very impressive compared to the results in the first part.

Give the Lambda function some RAM

But the question is: how much memory does my Lambda function need? The range is quite large from 128 MB to 10.240 MB. There is an awesome open-source tool called "Lambda Power Tuner" that helps you to determine your memory settings based on different strategies like speed, cost or balanced.

If you use "cost" the state machine will suggest the cheapest option (disregarding its performance), while if you use "speed" the state machine will suggest the fastest option (disregarding its cost). When using "balanced" the state machine will choose a compromise between "cost" and "speed"

Source: Lambda Power Tuner @ AWS Serverless Application Repository

In my case, the "Lambda Power Tuner" suggested 256 MB as the "Best cost" and 2048 MB as the "Best Time".

lambda-power-tuner-output

Awesome, now we have a good start for the final tests.

Best time setting

As we aim to reduce latency, let us first start with the proposed "Best Time" setting of 2048 MB memory and let us have a look at the apache-bench metrics:

The Lambda function was able to process all requests 81 seconds faster compared to the state machine.
Both the state machine and the Lambda function were able to process round about 8 requests per second
The mean time per request for the Lambda function was 121ms and 127ms for the state machine.

Compared to our first test, there is some improvement but it seems to be marginal on average. Let us try to get some more insights using CloudWatch and X-Ray.

cloudwatch-latencies-2048

For the most parts, the duration of the Lambda function is just below the execution time of the state machine. The average execution time of the state machine is 45.1ms and Lambda shines with 41.8ms.

xray-service-map-2048

What would happen, if we set our memory configuration to the setting considered as "Best cost"? Let us review the results in the next chapter.

Best cost setting

In short again our apache-bench metrics:

The Lambda function was able to process all requests 155 seconds faster compared to the state machine.
The state machine was able to process 7.5 requests per second while the Lambda function processes 8 requests per second
The mean time per request for the Lambda function was 122ms and 132ms for the state machine.

CloudWatch and X-Ray results also confirm very close results.

cloudwatch-256

The average execution time of the state machine is 54.8ms and Lambda is just in the lead with 50.5ms.

xray-256

💰 Cost comparison

Based on the scale of my test, the AWS Cost Explorer was not really helpful as the load I generated was too low. The AWS calculator is a helpful tool to better compare the costs of both services.

The estimate is publicly available if you want to have a detailed look.

I calculated 5 million invocations per month per service. Based on our test results, I was able to determine very precise values for the parameter that influences pricing like Lambda invocation duration/state-machine execution or consumed memory. The monthly costs are:

8 USD for AWS Lambda with 2048MB memory (Best time)
1.83 USD for AWS Lambda with 265MB memory (Best cost)
5.52 USD for the AWS Step Function express workflow

💡 Conclusion

In this part, we covered some important aspects like options to improve the performance of a Lambda function. I think it is again very important to mention, that this benchmark should not be interpreted as "use Step Functions whenever you can".

My goal was more to raise discussions about the importance of not building your decision based on hypotheses or rumors. Make your decision based on data to make the best of all kinds of decisions you can make.

I would again like to point out a quote from Eric Johnson at serverless office hours:

Use Lambda to transform not to transport

Or in my words: the best code is the code that is never written.

☝️ And here comes the thing and this is very important to keep in mind:

BOTH SERVICES ARE AWESOME.

If you need to write a Lambda function, you will be able to solve a lot of problems. But depending on what you want to achieve, AWS Step Functions give you a lot of power to get the same results without writing ANY line of code, while making up your mind about things like TCP keep-alive or how to figure out what the best memory setting is. In all tests, AWS Lambda showed the well-known cold-start behavior that is something you should keep in mind. AWS Step Function also needs some warm-up time but it is not comparable to AWS Lambda cold-starts. There was an interesting discussion around this on Twitter:

https://twitter.com/diegosantiviago/status/1453733187666857985

It only remains to say: happy coding AND happy orchestrating! 🥳 I hope that my analysis and approach to decision-making help you in deciding towards or against one of these services for your individual use cases.

AWS Step Function vs. AWS Lambda benchmark - Part 2

🔋 Optimizing our Lambda function