Serverless Platform Engineering

How does the discipline of platform engineering apply to serverless and companies that embrace the serverless-first mindset?

Posted in #serverless, #cloud

I'm a huge fan of Charity Majors, and I always learn something new when she shares her thoughts on anything that has to do with cloud and engineering. So when I came across thisĀ Twitter post, I grabbed some šŸæ and got ready to have my mind blown. I was not disappointed.

If you need a primer on this "new" field, check outĀ Platform Engineering: What Is It and Who Does It?Ā We should start by saying the concept of "Platform Engineering" certainly isn't new. Netflix has been talking about it since 2017, and plenty of other companies have been building internal developer platforms for years. But suddenly, it seems to be all the rage. I think this is mostly because of good marketing, and some emerging companies that are starting to sell this as a "product".

There is a lot that you can learn from digging through Charity's Twitter thread. She also wrote an excellent blog post with the premise thatĀ The Future Of Ops Is Platform Engineering. This gives a much more nuanced take on the evolution of DevOps and where the role of Platform Engineers and Ops Engineers differ. So, I don't want to spend too much time on the broader topic. However, what is interesting to me is how this concept applies to serverless, and maybe more importantly, companies that embrace the serverless-first mindset.

Can't we just build our own internal serverless developer platform?

When I think about serverless-first companies that have created internal developer platforms, the two examples that always come to mind are Liberty Mutual (with their Serverless Enablement Team) and Lego (with their Platform Squad). Now, I haven't kept tabs on how these teams have evolved, but when they first started, their missions seemed to be quite clear:

Provide developers and teams with the resources they need to quickly deploy well-architected serverless applications with the appropriate guardrails in place.

This made a lot of sense to me, and considering that the vast majority of services they used were "serverless", one might think that the amount of "Ops" required would be dramatically reduced as well. We're not talking #NoOps here, but offloading much of that operational complexity frees up resources to work on things that move the business forward and help developers deliver better software faster. I know both teams faced all kinds of implementation and adoption challenges, which is to be expected. But even though their approaches were very different, the successes they found were because smart, specialized teams (with C-level support) iterated until they found what worked for their companies.

But here's the thing. The vast majority of companies don't have the resources (or foresight) to hire dedicated teams to build internal developer platforms. Most of that responsibility falls on the developers, who end up spending a large portion of their time learning and configuring cloud infrastructure. As these teams grow, they quickly find themselves with a patchwork of tools, configuration files, and deployment pipelines, all heaped into a giant mountain of technical debt. I don't think this is particularly unique to serverless, but I think it's exacerbated by the paradigm shift.

Serverless makes guardrails harder

Serverless lets us build faster than ever. It makes things that seemed completely impossible just a few years ago completely possible today. But just as easily as I can spin up a globally distributed, highly-available API backed by one of the most scalable databases ever built, I can also just as easily DDoS my own application with a recursive Lambda function. As I said before, you have the ability to make very bad choices when setting up serverless (or any cloud) infrastructure. It would be nice if we could enable sensible guardrails (that don't involve hiring a platform team) as well as reduce the overall complexity and cognitive load, all without taking away too much flexibility.

When using serverless cloud infrastructure, the code a developer writes is often directly tied to the primitives it uses. If you've ever looked at a complex serverless application's architecture diagram, you'll likely see a myriad of managed services glued together by multiple functions, queues, streams, event buses, state machines, and more. Not only does the developer need the ability to provision these services and connectors, but their code needs to be written in a way that appropriately interfaces with them. It's hard to put guardrails on this type of required flexibility without blocking the developer at every turn.

The other reason it's hard to limit serverless developers is because of this idea of "configuration over code." When a Lambda function fails, or timeouts, or throttles requests, you can't write defensive code to handle those exceptions. The cloud has to do it for you. And the only way to tell the cloud how you want to handle it is through configuration. This creates a necessary bifurcation of your business logic that developersĀ must be able to control. Take this a step further and add VTL mapping templates to API Gateways for direct service integrations, configure resolvers in AppSync, or build entirely functionless workflows using Step Functions. Put too many blockers in place, and you'll kill developer creativity and productivity.

Finding the right balance

As Charity said, "If you draw the line a little too far to the left, you won't be able to support enough product differentiation to succeed. A little to the right, and the maintenance costs will drown you and put you out of business." I'm not entirely sure where that line is for serverless applications, but I whole-heartedly agree that generic solutions likely aren't the answer. Right now, it has to be much more nuanced and specific to the organization.

Charity also said, "The beautiful thing about infrastructure is that there comes a point where you can stop treating infrastructure code like code, and instead treat it like trusty little black box building blocks that nobody has to waste any cognitive capacity on." That sounds like the holy grail to me, and I think it's even more possible with serverless given the number of patterns that have emerged over the years.

Of course, there are still the questions of control, responsibility, and who owns what. How much of the platform can we abstract away into these "little black box building blocks" knowing that there are hidden complexities beyond our control, yet still critical to our workloads? What happens when we reach their limits? What happens when they fail, as everything does all the time?

This is a hard problem to solve. I know because my team has been working on a solution for almost 2 years now. But ultimately, I think the goal should be to bring the full power of the cloud directly into the hands of the everyday developer. That requires both guardrails and guidance, as well as the right abstractions to minimize cognitive overhead on all thatĀ undifferentiated stuff.

I too worry that a "platform engineering" product could get this wildly wrong, limiting developer creativity in any number of ways. But at the same time, I think that cloud complexity has made it very difficult for companies to do this on their own.