Takeaways from the State of Serverless Report
I recently discussed the methodology, findings, and key takeaways of The State of Serverless Report by Datadog with Stephen Pinkerton and Darcy Rayner.
On a recent episode of Serverless Chats, I spoke with Stephen Pinkerton and Darcy Rayner of Datadog to dig into The State of Serverless report, which was released at the end of February 2020. After frequently fielding customer questions about the topic, Datadog looked at its data and customer use cases, and examined how they were using serverless. Datadog's report is a way to break it all down, but it's also an opportunity for its customers (and serverless users alike) to see how other people are using serverless in a data-driven way. I discussed methodology, findings, and key takeaways with Stephen and Darcy, and thought it'd be worthwhile to consolidate and share that insight.
Behind the Methodology
Context is important, so before diving into the findings, let's take a step back and look at how Datadog defined some of the classifications in the report. Most importantly, this report only includes data from Datadog customers, which are likely more cloud savvy than your average company. Also, even though the serverless world extends beyond AWS, the report only includes their AWS customers.
The report also specifically focuses on Lambda usage as an indication of serverless adoption. While both Stephen and Darcy agreed that serverless is much more than Functions-as-a-Service (FaaS), it's still a good baseline to use. When it comes to its reference of Lambda adoption, Datadog includes any account in AWS running more than five Lambda functions a month, which it considers the point at which it's regularly being run.
But this doesn't mean that only large customers would end up qualifying; Datadog's broader definition of AWS usage includes both anyone who's currently using Lambda, but also any organization that has more than five EC2 instances running in a given month. Given that five EC2 instances is relatively low, this gives a nice broad perspective. When Datadog defines customers in this sense as "small," "medium," or "large," it refers to the scale or "footprint" of the other infrastructure that they're running. Now that we've got the basics covered, we can make our way to the findings and key takeaways.
Finding #1: Half of AWS users have adopted Lambda
When I asked Stephen about this statistic, he speculated that growing Lambda adoption comes down to the need for speed. Teams can move a lot faster with Lambda, especially because from a development perspective, there's less red tape to cut through and teams can hit the ground running. For many, getting a Lambda function approved is much easier that other types of compute infrastructure.
Lambda usage by Datadog customers has more than doubled in two years -- a trendline that seems to indicate that by 2022, the vast majority of customers will be using Lambda. This makes sense for a number of workloads and use cases, but Stephen thinks this trend isn't just limited to Lambda. Instead, he thinks people are starting to realize the value of other serverless services like databases and message queues. The whole ecosystem is definitely here to stay, but the way that teams are running code is certainly changing.
Finding #2: Lambda is more prevalent in large environments
This finding seems like it could be due to a few factors: cloud sophistication and the serverless learning curve. Darcy observed that when there's a large organization with several teams, there is this broader movement towards microservices, having team ownership boundaries of services, and giving engineers more autonomy. With that, there's an increased likelihood of a few teams experimenting with and adopting serverless, with that being the ultimate gateway. Compare that to a smaller company, which may end up with more unified technology, but with less of a chance that they'll have the time or resources to adopt new technologies.
The learning curve in getting people started with serverless could be another factor. Small organizations can't take the same risks as larger enterprises, which often have the luxury of experimenting within smaller teams without having to implement company-wide at first. Many developers will first experiment with serverless on their own before bringing it to their teams.
Finding #3: Container users have flocked to Lambda
Clearly, customers are not going to abandon containers altogether, but the idea of being able to use Lambda functions to do some of the workloads has become increasingly appealing to companies. 80% of Datadog's AWS container users have at least five Lambda functions running. Given these findings, it seems it's become less important where you're actually running your code. So if you're already running in a microservice architecture, then it's much easier to adopt serverless.
However, the data doesn't suggest there's a reduction in the use of containers while people are migrating their compute to Lambda functions. When coupled with findings from another Datadog report on containers, both Lambda function usage and container usage are growing steadily. Darcy speaks to the fact that there are so many opportunities for organizations to migrate from traditional internally-hosted infrastructure or even older cloud infrastructures to containers and serverless.
Finding #4: Amazon SQS and DynamoDB pair well with Lambda
This one does seem a bit obvious. You would think people who build serverless applications would want to use tools that are serverless themselves, or at least, tools that play really, really well with Lambda. But the finding is interesting, because it seems like even though pay-per-use downstream services are very popular with Lambda, there are still a lot of people using it to connect to relational databases like MySQL.
So it's clear that companies aren't necessarily moving away from SQL to DynamoDB. Darcy notes that it can come down to comfort level for some people, as they tend to prefer different databases for different solutions. He thinks we'll see the story of using relational databases and having managed relational databases become easier and easier, and more of the scaling overhead being taken away from engineers.
It seems as though many Datadog customers are starting to embrace the idea of asynchronous thinking, especially seeing that SQS and Kinesis and SNS are so popular as Lambda triggers. Darcy noted that an event-driven microservices revolution is happening. It takes a long time for large organizations to really buy into that idea, but it's still a growing trend in general.
Finding #5: Node.js and Python dominate among Lambda users
Since the number of "large" clients are seemingly adopting Lambda faster, I would have expected them to favor languages like Java. Instead, Python and Node.js are by far the most popular. This popularity could be because they're not compiled; you can launch them so quickly, and the findings show that the cold start time was a factor, which corroborates that idea.
Darcy explained that use cases with Node and Python are really all over the map, but in particular, it's very common to see a background job utilizing these languages. He said that some workloads aren't necessarily entirely appropriate for services like Lambda, and something like ECS or Fargate would be more appropriate. I agree that not all workloads are a great fit, but as Lambda gets better, the use cases increase.
Finding #6: The median Lambda function runs for 800 milliseconds
While half of them run for less than 800 milliseconds, those longer running ones seem to suggest that there are other tasks they might be performing. Darcy thinks that people have probably decided to use Lambda in a way that is running more computationally heavy workloads. He anticipates that some of it is trying to convert a square workload into a circular hole -- which you could boil down to a form of misusage.
Another part of this finding was that one-fifth of Lambda functions run for 100 milliseconds or less, which is interesting, because of course, that is the AWS unit of billing. I know there have been some calls (myself included) to get that granularity down maybe 50 milliseconds as opposed to 100 milliseconds.
Finding #7: Half of Lambda functions have the minimum memory allocation
Speaking of misusage, improper memory allocation would be near the top of the list. Darcy says that with memory allocation, it probably comes down to miseducation. People don't spend a lot of time thinking about how to optimize these services. Serverless has removed so much of the thinking about overhead and infrastructure that he thinks developers are just putting things in Lambda and not even spending the time to tweak it and reduce latency, and potentially cut costs.
Stephen says that they frequently get the question: "How do I optimize my Lambda workloads?" He says even though Lambda lets people upload code to a cloud provider and hand off the responsibility for running it, there are still knobs that can be turned to adjust performance. However, there's a lot of miseducation or complete lack of education around what these knobs actually do.
Finding #8: Two-thirds of defined timeouts are under one minute
This is one to note as a best practice: setting good timeouts. With Lambda functions, you're paying while they're processing. If you have something that hangs for some reason, and it just keeps running, then obviously you're paying for something you don't need. The fact that developers are setting low timeouts is likely due to defaults or API Gateway use cases, but overall, this finding is probably a good thing, and there's likely more granularity in there. The one thing that concerned me, however, was that a lot of timeouts were set to 15 minutes, the maximum.
I personally tend to lean more towards this being used as a crutch, because I think the mentality there is: "I can let it run for 15 minutes. I'll let it run for 15 minutes." But I wonder about concerns with denial of wallet (DoW) attacks, this idea of flooding Lambdas or API Gateway with requests so that they just keep running. Stephen said that Datadog does get some security concerns occasionally similar to this, split between timeouts and concurrency limits, where they're both things that can help to avoid starving resources in your account.
Finding #9: Only 4 percent of functions have a defined concurrency limit
I think for a lot of internal workloads, setting a function's concurrency probably isn't necessary, but not setting it on the ones that are customer-facing or processing off of queues, seem a little bit scary to me. Darcy speculates that it's the promise of serverless and its scalability, and teams not asking themselves the right questions or planning for maximum capacity to their systems. And this promise could be a detriment, because teams might want to keep scaling and scaling, but nothing is infinitely scalable. Everything has limits at some point.
I'd guess that maybe this is an education thing too, because your average developer won't necessarily know a lot about the underlying infrastructure. When it comes to things like Lambda functions, there's just a lot of questions that that developer probably never had to ask before. Distributed systems in and of themselves are very difficult, and now when you start talking about breaking it down even further into all of these small building blocks, being able to understand where all the failures are happening is a hugely important thing.
While there's a ton of information to sift through, it seems like no matter what technologies companies are using out there, serverless is likely a part of it. To listen to Stephen, Darcy and me continue the conversation, including the topic of "what comes after serverless," listen to our Serverless Chats episode.
Listen to the episode:
Watch the episode: