Advanced serverless workflow patterns and best practices

Hello everyone. Good afternoon. Hello to those of you watching in the overflow room. Anyone watching online in the future? Hello to you as well. This is API 401 - Advanced Servius Workflow Patterns and Best Practices. It's my favorite topic to talk about. My name's Ben Smith. I'm a Principal Developer Advocate at AWS. This is my fifth Re:Invent.

But really what I'm more interested in is who you are and why you've come to this particular talk. I know your time's very valuable at Re:Inventing. It's very difficult to get around sometimes. So I'm gonna make some assumptions about you if that's ok.

I'm gonna assume that you're here because you're interested in CUS. Maybe you're building services applications in production, maybe you're already building with AWS Step Functions and you want validation on what you're working on. You want to learn what's new in the world of Step Functions maybe in serve in general and you want to learn maybe some new patterns to implement in your Step Functions, workflows. These are all good reasons to be in here today.

Now, since joining AWS, I've been building predominantly with Step Functions and that's given me the opportunity to travel all over the world, to talk to customers who are successful with Step Functions and to talk about some of the challenges they're meeting.

Now here is a uh consolidation of all of that learning in one landing page. And indeed, that's what this talk is as well. It's taken all of that learning and to deliver that to you in a one hour talk, I'll put this slide up again at the very end of the talk in case you missed it.

So as I said, I've been building a lot with Step Functions and I found that when you start to build everything as a workflow, you tend to think and plan everything out as a workflow as well. So this is my agenda for the next 60 minutes as a workflow.

I'm gonna talk about why I think you should build with Step Functions first, maybe even always, sometimes, maybe not, but mostly always, I'm gonna talk about the two different modes that you can run your Step Functions, workflows in that's as a standard workflow and as an express workflow.

I'm going to spend a lot of time talking about how you can reduce the cost of your serverless applications when you're using Step Functions. Seems like a big theme of the keynote earlier on this morning. So I was pleased to see that. I think developers should be thinking about cost when they're building with cloud services, right?

I want to talk about extensibility with Step Functions as well. How you can make sure your applications are easy to extend going forward. And I'll be showing you patterns throughout the whole talk that you can implement yourselves patterns for things like managing failures and working building applications that can handle really high scale.

And really every talk I hope that you've been to at Re:Invent this week will be in service to at least some of these goals, right? To help you deliver more value to your customers, to build applications that are reliable and resilient to faults and failures that scale easily and are simple.

I think it's quite easy once you get to a certain point at AWS to draw very complex architectural diagrams to look clever. But the real magic is when you manage to simplify things again and of course, to be able to do this as quickly as possible, everything is serve these days, it seems there were some new service announcements two days ago and it's becoming more and more difficult I think to define what CUS is when everything has a serve version.

This is my take on it. I see it as a kind of pyramid of managed services where at the base of that pyramid, you have the more manual services, the more customisable things like EC2. And then as you go up this pyramid, you get to the really managed services, things like Lambda App, Runner Step Functions. API Gateway and when you're building service applications in your AWS account, it often consists of more than one of these managed services.

It's not just going to be one Lambda function and you'll need to pass data between these services and you can do this using APIs using events, using messages, topics cues many different ways. But the challenge is when you're building service applications that are relying heavily on these managed services, how do you then understand how your application is actually held together? Right? How do you track, inspect and visualize your application? And this is where this service AWS Step Functions instantly kind of solves this problem.

This is a service for building out workflows on AWS that orchestrate pretty much all the other AWS services. So here is an animation of me building a workflow in real time where I'm selecting various actions in one side of the screen and I'm dropping that into my canvas. Now, these are actions like invoking a Lambda function or getting something from a DynamoDB table. I drag that into my canvas. I organize it with decisional logic with choice statements with map states with parallel states with branching.

I can catch errors, I can make retry statements and I can double click on any one of those states and configure it with the editable form on the other side of the page there, I can also of course export the definition that comprises my workflow. It's a JSON language called ASL Amazon States Language. I can export that and drop that into my infrastructure as code templates and I can commit that to my git repositories, share that with my other developers in my organization and build that into my automated deployment pipelines.

Now, Step Functions is right at the top of that manage service pyramid. It's pay per use, it scales automatically. It has this neat drag and drop interface. But the real special bit about Step Functions is that it integrates natively with over 220 other AWS services. And the way it does that is to use that services SDK action and Step Function runs that action directly from itself. It doesn't spin up a Lambda function or anything like that.

So what this means is you're able to orchestrate over 11,000 actions and maybe even more which I'll talk about in a moment. So I mentioned earlier, I think you should build Step Functions first. It should be the first thing you go to when you build your next microservice. And here's why I say that this is based on the most typical customer story that I hear when people start building with a Lambda function with server in general.

The first thing they go to is Lambda, which is reasonable. And let's say I wanna make a really simple microservice to just get an item from a DynamoDB table. So the first thing I'm gonna do is I'm gonna choose my runtime. So I'll choose Node.js. I'm gonna bring in the AWS SDK for Node.js and I'm gonna bring in the Doc Client for DynamoDB. So I can marshal and unmarshal data in DynamoDB.

Then I create my params object. So here it's things like the name of the table, the partition key, the sort key, then I create the actual query. So now I'm gonna use a Doc Client to access the put item action. I drop my params object in, I turn it into a promise. I put that in a try catch block, I return the data and I return any errors. And then I have to learn all about the Lambda exports handler have to find out what the context object is. What the event object is a new try catch block. I run my query items, I stringify the response and I return any errors in my catch statement, 20 lines of code to get an item from DynamoDB when I already knew the partition key and I already knew the sort key haven't even configured any retries in here yet.

Let's try the same thing with the Step Functions workflow done simply drag in the get item action from my Step Functions workflow selection. I still have to configure the params objects. So I have cheated a tiny bit. But what you have here is an application, two different versions of exactly the same workload equally performant. Arguably the Step Functions mode uh is more performant, by the way. And we can get into that in a moment.

I can still configure my retry statements. I can still catch any errors and send downstream to a dead letter queue. And this is not just a pretty picture in a powerpoint slide. This is how it looks in the service as well. So I come back to this in a day, a week, a month's time. I've been building other things. I've been context switching. Which one of these two examples is gonna be easier for you to understand and reason about there will be some hardcore coders out there saying the code example and that's that's also fair enough.

So with Step Functions, I start sending executions to my workflow. And this allows me to have a kind of single pane of glass where I can see every execution that failed and every execution that succeeded. I can drill down into any one of them. And I can see exactly where it failed. I can see the error me message bubble up to the top. These are things that are quite difficult to do with serverless applications that are running managed services distributed over sometimes multiple AWS accounts.

There's whole business models that are invented to solve this problem. What I can also do is I can actually inspect the data as it passes through each individual state. So I can see what the data looked like before it entered the state. And when it left the state again, this is quite difficult to do if I were to build this as a asynchronous event driven application to kind of chase my data as it moves around my application through multiple logs.

Let me go back to the Lambda example here. So let's say I went with Lambda cos Lambda's great. It's uh scaling automatically in response to events and I'm happy with that code that I had before. And now I want to expand this microservice to do more things. Now, I want it to essentially become a rest CRUD API, I want it to create update and delete items in my DynamoDB table as well.

So I've learned about API Gateway. I know I can make a, an endpoint a url to access that Lambda function and authenticate those requests. That's great. So I set up API Gateway and I send all requests to this Lambda function because I'm comfortable with this Lambda function. Now, I know how it works. There's a few things here. The first is that now this Lambda function is doing multiple things, right? It's creating updating, deleting each one of them requires its own permission, its own, IAM uh action to be allowed. So it's quite a loosely permissive function at this point.

Then I have to configure my Lambda function with its timeout and its memory allocation. And this is the things that affect the cost of my Lambda function. And it might be different depending on which one of those actions that it needs to perform. I also have to add routing logic in my function to say based on this inbound request from API Gateway, do I create update or delete? And I've got to write that as code in my function if I do it in this way.

So I'm adding more and more code to my application, more places where things can go wrong instead of leaning into this managed service way of doing things. This is what most people at AWS will suggest for you to do as the next step. And this is a good next step. By the way, it's ok to stop here, it's reasonable.

So what I've done is I've split out my Lambda function into individual functions. Each one of those functions is now performing its individual action that's associated with its IAM permission, create update or delete. What that means is I can take that routine logic out of my big Lambda function and put that into API Gateway and API Gateway can decide which one of these to call, right? So that's good. I've reduced the code. I can also configure the timeout and the memory allocation and the security permissions more tightly for each individual function. So they're probably more performant as well.

So what's the problem? Well, now I've got 60 lines of code. So I haven't really reduced the amount of code in my application. That's 60 other places. It's very similar code. The only thing that's different is the action that it's performing and the, the param uh the param object and how that needs to look. So I've got duplicate code. Imagine this application grows and grows and this is exactly what we see with customers, it spreads out and before you know, it, it's very difficult to track and inspect when something goes wrong.

Here's the first pattern that I wanna show you. It's the REST CRUD API with API Gateway and Step Functions. No Lambda function in this, right? So here I still use API Gateway to authenticate and authorize every request. But I send every request directly to my Step Functions workflow. The first state in my workflow is a choice state which says based on the shape of this inbound request. If it's a get post patch, if it's got parameters, query parameters, I'll decide which branch to take. And then I use the Dynamo DBs SDK to run that action directly in DynamoDB. No Lambda invocation costs, no cold start time, no code to manage exactly the same result.

The reason I show you this, that cos we came up with that little pattern when we built an application called S and Espresso

"Has anyone used or heard of ser and espresso already? Oh, wow, that's about half the room. So s espresso was a demonstration application that was built by myself and some other people in the serve developer advocate team for re invent of two years ago. We wanted to have a demo that people could enjoy using that was they didn't even know they were using a demo, right? It allows you to order a cup of coffee from your mobile phone. And it's here this week in the expo hall, people really enjoy using it. It's free coffee at the end of the day.

So the first time we launched that at re invent two years ago, we had this order manager service as one of the various micro services. And this was a rest crud api that had uh an api gateway multiple end points that allowed us to do actions on our dynamo db table. And we did that via a lambda function. What our lambda function also did was to run some other actions like start another step functions, workflow. So it wasn't just doing things with dynamo db. So this is what we used at re invent of two years ago.

We got through that re invent, but we had some issues with things like latency. We had some errors. We had some problems trying to debug that in production with, i think we did about 2000 orders on that day and many, many uh i think every order goes through one of these api s seven times. So we couldn't debug it in production. We had some issues. And when we got back home from re invent, we took the whole thing apart and we rebuilt it as this workflow.

So here you'll see there's a lack of land of functions apart from in one place where this function is actually doing some compute. It's sanitizing the order. It's checking that the order that's just come in on this api request is something that actually exists in our, in our menu, right? Because believe it or not, people tried to find a way to order things that didn't exist. This is just as performant, more performant, in fact, because we chose to run it as an express workflow. And i wanna say, i wanna kind of explain why we chose that that mode next.

So when you build with step functions, there's two modes that you can choose to run your workflow as you can run it as an express workflow or as a standard workflow. And it's really important. I think if you're gonna use step functions to understand the difference and how the two modes are built.

Standard workflows, they're the original step functions workflow, they're long lasting, they run for up to one year. In fact, they're asynchronous, they have an exactly once execution model. So you can be sure there's no duplicate invocations with the same payload, express workflows higher throughput. So it transitions through the individual states much faster, totally different billing model at least once execution model. So you could get duplicate invocations. If you're sending multiple requests per second, it can run synchronously. So what that means is you can invoke your express workflow through something like api gateway, your workflow will run and it will return that response back through api gateway, which is one of the reasons we chose to run the workflow that i showed you earlier as an express workflow, we could send in a request via the api it took about half a second to run and the result would come back as the response on our api gateway call.

Now, here's the interesting bit. They can only run for up to five minutes. So if your workload takes longer than five minutes, it will time out. No response. I wanna show you uh an example of how we can calculate the cost difference. Ok?

So here's uh a kind of shopping cart workflow and it does some simple things like put a message on to an sqs topic when an order is ready to be paid for. Then it pulls a dynamo db table to check that the order's been processed. Put some notifications onto an ss ns topic. Pulls another dynamo db table to check that check that the payment's been taken before running a series of item potent lambda functions before finally a final notice to ship the package. And i run this 1000 times as an express workflow and 1000 times as a standard workflow. Cos that's the kind of fun guy. I am and i want to show you the results because they're interesting and they're useful for understanding the cost, right?

So running this 1000 times as a standard workflow cost me about 42 cents as an express workflow. It cost me one cent exactly the same workflow definition, exactly the same results. And in fact, the express workflow on average finished half a second quicker. So i wanna show you why here's how you work out the cost of your standard workflow. It costs 2.5 cents for every 1000 state transitions. So a state transition is when the data moves from one state across to the next. So in this workflow here, i've got 17 state transitions on the happy path. So if i want to work out the cost, if i run it 1000 times, i do 17 times 2.5 cents. And that's how i got my 42 cents express workflows.

So your express workflow is charged based on how long the workflow takes to complete to the nearest 100 milliseconds at the given memory allocation. And the memory allocation is not something you can figure, but it is something you have uh impact on. Let's work out what the cost was first. So the first thing i need to get here is the memory cost and you can find this after you run any workflow or in a cloud watch dashboard, it goes up in 64 megabyte increments. And so my workflow here was very small. The input payload was very small. It's the definition and the input payload are the two things that affect the memory allocation. So this is the, this is the memory allocation that my workflow needed. It's the smallest one and this is the price per 100 milliseconds for the 1st 1000 gb hours. It's a very small number. In this case, my workflow took 11 and a bit seconds to run. So i put that into my calculation and that's how i have the final cost of my express workflow as one cent.

Now, let's say i were to run this a million times, which is not unreasonable. If i'm running applications at large scale, you'll see, maybe you can't see that cos it's quite difficult to see but the the two lines start to separate very quickly, right? My standard workflow would cost $420. My express workflow, $12.77. It's doing exactly the same thing producing exactly the same results.

We launched express workflows in 2019 to kind of solve the problem of people were challenging us that step functions were expensive. Most workflows that we see customers running complete within five minutes. So this is low hanging fruit. This is something you can immediately go home and try change your workflow to an express mode and see what the new cost would be, you might be asking why would i ever bother using standard workflow? If express is so much cheaper and faster.

Well, if your workload takes more than five minutes, then you'll need to use a standard workflow. If you need the exactly once execution, then you'll need to use a standard workflow. So there is a decision to make there for sure, but there's still things you can do many things you can do. And the first thing you can do if you have a standard workflow that needs to remain a standard workflow. When you go back, i would suggest this pattern. The nestor extract those states from your workflow that you can run separately as an express workflow. Things that transition quickly that are and potent that don't take long to complete and save that as a separate express workflow call that from your parent workflow. Let me show you what i mean this workflow here. It has a loop, a polling loop, let's say sometimes that polling loop takes more than five minutes. That would mean i would have to run this as a standard workflow. What i do, i strip out those item potent lambda functions. I save them as a separate express workflow and i call them synchronously from my standard parent workflow. And to work out the new cost, i just put the two costs of those two workflows together. The key thing is that my standard workflow now only has 14 state transitions. Remember the cost is all about state transitions for a standard workflow. So the price of that has dropped straight down to 30 cents. My express workflow takes barely a second to complete. So that doesn't even really affect the overall cost of my application. So that using this combined method, i can get the same result with a cost of 30 cents per 1000 executions.

Great. The key takeaway is if you have a standard workflow, you want to reduce the number of state transitions. And if you have an express workflow, you wanna reduce the duration that the workflow takes to complete or the memory that it needs by reducing things like the definition size or the input payload. But i want to stress that this is just the cost of the orchestration bit. That's just the step functions cost, right? There's still a cost associated with all of those services that you're orchestrating. So this simple example here, i'm gaining an item from dynamo db that's taking up re capacity units and there's a cost associated with that. Then i have my first state transition. That's my step functions cost. Then i'm invoking a lambda function. That's a lambda cost net state transition back to my orchestration cost and so on. So you want to reduce the reliance on other aws services wherever you can.

Again, another low hanging fruit here is to look at workflows where you have these small lambda functions that are performing data transformations, not running compute but moving data around and reshaping it. We have something in step functions called intrinsics. And these are little helper functions built into step functions itself that let you do simple data transformations for things like manipulating arrays or p pushing strings together or doing simple mass operations. I'll give you one example. The most requested intrinsic function until recently was to generate a unique id. Until we introduced this couple of months ago, you had to use a lambda function with a bunch of code. So here i'm pulling in some package and i'm using my exports handler to produce a unique id. Now i can use the states dot uu id intrinsic and it produces exactly the same result. And if i use an intrinsic, i can just add that intrinsic function to the parameter of an existing state. So i don't have to add more states into my workflow to do this.

Why is this good? Just to remind you why there's no invocation delays because i'm there's no lab, a cold start time or anything like that. There's no cost of lambda. Ok? We don't charge you for intrinsics, we charge you for state transitions and there's way less code to manage one line versus four or however many you have in your lambda function. So the point is try and use intrinsics wherever you can.

There's another thing you can do. We have here a polling loop that i mentioned earlier. This is because this is an asynchronous task that i'm sending off. I need to check when uh an order's been approved. There's actually another one down here. Now, this is a common problem with synchronous applications that have asynchronous tasks, right? You need some way of knowing when that's complete and often people will use polling because it's really easy to implement. We've been polling since we were small. My daughter's really good at polling. I have a story for you. Not long ago, i took my daughter on the boat to france. I live in the south coast of england in a little city called brighton. And we took a train to the boat to this town called new haven. Lovely town. And we got on the boat and as soon as the boat left the port, my daughter asked me, she's seven, by the way, she said, daddy, how long is this gonna take? I thought, oh, i hadn't checked five hours and my daughter proceeded to poll me for status updates every 10 minutes of this journey to find out when we're gonna arrive. Really efficient. Really annoying. Right. It's the same with your step functions, workflow because every time you go around this polling loop, that's three more state transitions. It's costing you money. Ok"

Now, you could change the wait time, you could wait longer, maybe instead of 10 minutes you can wait one hour. Well, that would mean that me and my daughter were waiting at the port in France one hour before we got off the boat. Right. You don't get a timely response. If you do that and if you turn the wait time down, then it's even more expensive.

So you can use something called the callback pattern. So this would be like me and my daughter getting on the boat and her saying to me, "Hey, dad, can you let me know when we get there?" And I say to her, "Yep, I'll do that." And then we can just look at our phones for the next five hours and arrive happy. Not really.

So this is how you do that in Step Functions. There's two patterns in this talk which is in every single thing I build either one or the other or both. And this is the first one. This pattern uses Step Functions and his partner in crime, the Batman to his Robin Amazon EventBridge. These go really well together. And what we do here is we are creating a workflow that's orchestrating a series of milestone events where they need to happen in a certain order. You can use this for existing applications, even applications that are not running in the cloud, you can still orchestrate them using Step Functions.

It goes like this, I use the Emit and Weight pattern for Milestone One. It sends an event onto my Amazon EventBridge event bus. And along with that event, it sends this unique task token, which is generated by the Step Function service. I route that event to the microservice that handles Milestone One. But I also route my task token to some storage like a DynamoDB table.

When Milestone One is finished running, what it needs to run, it needs to grab that task token and call back to my Step Functions workflow and say "Right, we're, we're done." You can carry on to Milestone Two and the workflow will then unpause. Now you don't wait for that pause time and it can wait for up to one year. You don't wait because there's no state transition, right? And then the pattern continues.

Now, if you don't wanna wait for one year, of course, you can set a heartbeat or a timeout limit and when that limit's reached, you can catch that with Step Functions and it will take a different path, maybe an escalation path. Ok?

So let me put that back into my workflow. What I'm gonna do is I'm gonna take out those polling loops and on the state before the polling loop. So in this case, it's an SQS queue, I'm gonna add this little syntax that says "Wait for task token." And what that means is my workflow will pause there until it receives the task token back from another service and to look at how that affects my costs.

Well, now I only have eight state transitions. So now my overall cost is just 20 cents. So we've just applied some of the features and functions of Step Functions there to rapidly and fairly easily, I think, reduce the cost of my workflow.

Excuse me, I want to talk about extensibility. Now, one of the challenges people come to me with Step Functions is that, oh, you're just building a, a monolith as a workflow. And that's a, that's an interesting thought. But I think as long as you stay within the, the context of your microservice that you're orchestrating and use events to communicate across those microservices. That's completely fine.

And I wanna show you how I wanna exemplify this with this application that we built called Serverless Video. Has anyone heard of this or seen this in another talk this week or about half the room again? Cool.

The Serverless Video is a live video streaming application built by the same team that built Serverless Expresso entirely on Step Functions and EventBridge and some other serverless services. It allows us to stream directly from the Expo hall. We have a little booth there where we're interviewing customers and partners and AWS speakers. The stream goes live, you can interact with the stream in various ways. But when the stream finished, that's when the interesting stuff happens.

I want to show very high level architecture. Now, this architecture is just a few microservices that are interacting with each other by passing events onto EventBridge and consuming those events. There's two important workflows in all of this, one of them is the Post Processing service. And what this does is it takes the video streaming files, TS files and it, the workflow puts them together into a single MP4 file. When it's finished doing that, it puts an event onto the bus that's picked up by this Plugin Manager service.

And what the Plugin Manager service does is it uses that Emit and Weight pattern to add all sorts of functionality to each video that gets streamed on Serverless Video. What we did is we asked all the speakers in the Serverless and AppSync track at re:Invent this week to build a plugin for Serverless Video.

So we built the core, we built that Emit and Weight workflow and we said please go and add some functionality, anything you can think of. And people came back with things like title generation plugins, metadata tag generation, summarizations, people built transcript plugins, translation plugins, leaderboards, notifications, all sorts of interesting things that we can integrate into Serverless Video.

This is how we did it. We built this workflow that would emit a series of milestone events when that MP4 file was ready. The first milestone event was this one. It's called "PreValidate" event. We just made up the name PreValidate and we told our plugin developers we want you to register for this plugin in your plugin template. Make a rule that routes this event for plugins to do with moderation. Ok, plugins that check the duration of the video, check the language is appropriate to be published. Things like this.

All the responses of all of those plugins were then when they're done, sent back to the workflow and passed on to the next event, which is the "PostValidate" event. And here we wanted plugins for data reporting and kind of internal notifications, the data from that is collected back and then passed on to the "PreMetadata" plugins. And here we want things for transcripts, filters, tags and leaderboards.

And then finally we have a "PostMetadata" event hook and these are for things like titles and external notifications and translations. And this is how we were able to scale out the functionality of Serverless Video. An application we built in only a couple of months by allowing various speakers that attend re:Invent to extend the functionality using this Emit and Wait pattern.

Let me drill down into one of those plugins and how it works. So here's our Step Function workflow. Here's our milestones, right? PreValidate event is the first one. We have a bunch of plugins that are ready to receive their PreValidate event. That event gets fired out this plugin here picks it up. It's a moderation plugin, it uses Amazon Rekognition to check. There's no bad images in this video.

When it's finished, it returns "valid" true video is good to be published it sends an event back to the event bus and we have a little Lambda function waiting to pick up all these events and route them back to our workflow along with that task token that was emitted. So the workflow knows where to resume the workflow waits for all those other plugins to come back and then it moves the payload along.

Now, the first question our plugin developers had for us is ok. We've built the plugin. How do I know it's gonna work when you deploy it onto your account? And we thought, yeah, good question. We don't really want to give you access to our account. How will we know it works?

So we what we did, we said here's the payload that your plugin will receive. Here's the event payload that you need to work with. And as long as you return the response that we're telling you to return for each event, then it should work. And we thought, nah that's not really good. Let's build a Step Functions workflow to test that each of these plugins work. And I think there's a lot of potential for using Step Functions as a kind of integration or unit tester. Right?

Let me show you what we did. First thing we did was decide the characteristics of every Serverless Video plugin. So every plugin needs to be registered for one of these four milestone events. And if it was registered for something other than that, we'd throw an error.

Then we would emit the event, they had to route that event and return a valid response. So we'd check that the response was valid. If they didn't return a response within the time limit, which is 30 seconds, then we would say this is not good enough. We're gonna throw an error and every single plugin that was submitted to us, we ran through this simulator to check that it was integrated.

And then we went 00, let me show you how one plugin works. So this is the moderation plugin. This is the happy path. It receives the event along with the task token, it runs, it returns the task token back along with a valid response. We know this will integrate successfully.

Then we thought it's a bit annoying to keep logging into an AWS account and deploying a plugin and hit play on that Step Functions workflow. It'd be great if we could automate that. So we came up with this, we used GitHub Actions to deploy every plugin that was submitted to us as soon as the pull request was made.

So that plugin gets automatically pushed out along with the simulation workflow. And then the GitHub Action uses the AWS CLI to start the workflow and pause it for the response because it's a asynchronous workflow. And when the response comes back, it's either still running or it's succeeded or it's failed.

So, what that meant is whenever we opened up our Step Functions GitHub repository, the next day, we would see a list of plugins that have pull requests and we could instantly see for every plugin if it would successfully integrate or not. And we could go ahead and merge it directly into our account. And it's all because we use Step Functions as an integration tester.

This handsome chap is Werner Vogels CTO of Amazon. He was doing the keynote this morning and he's well known as having said that everything fails all the time. Good news for Step Functions cos it's really good at managing failures. And there's a couple of patterns that I wanna show you.

The first is the Saga pattern. This is a, a well known pattern and you can use this in Step Functions to sequentially roll back a series of events in the order in which you need. So let's say I have a holiday booking system where I book a hotel and I book a flight and I book a car. Keep it simple. Now, if any one of those fails, I can capture the failure and I can roll back sequentially in the order at which they were made. It's really visual pattern. I think when you use Step Functions for it, the next one is Circuit Breaker, another commonly used pattern.

So here I prevent a downstream service for being called if I know that the circuit or the application is broken or unhealthy. So, what I do is I maintain the overall status of my application in something like a DynamoDB table. And every time I run this workload, I check my DynamoDB table to check is my circuit open or closed. If it's closed. ie good. I go ahead and I run that downstream service. In this trivial case, it's just a Lambda function.

Next execution comes in, it's a circuit open or closed, it's closed. Ok. Run the Lambda function. This time it failed. I catch the failure. I update the circuit. And then the next time my application runs, I know that the circuit's open and I can take some kind of escalation path. So this will prevent me calling downstream services that maybe don't throttle quite as quickly as some of the managed services on AWS.

Do I mentioned earlier that in that Service Video example, we have this workflow that converts multiple TS files of about eight seconds long into a single MP4 file for each video stream.

"And we did that in a workflow and we did that with a lambda function that uses ffm peg to create an mp four file. And it worked for the most part. But we found when the streams got longer, 2030 40 50 minutes, it would start timing out cos lambda has a 15 minute time out.

So we thought uh huh, we'll use step functions, we'll catch the time out and then we'll deploy an ecs task on fargate and we'll run it the same code and ecs fargate won't time out and we'll get a nice mp four file. And we were pretty happy with this.

The more executions we sent to this workload, we realized there's a sweet spot when our lambda function times out and that was with videos that were about 20 minutes in length. So what we did is we preemptively queried the length of the video. And if it was more than 20 minutes, we'd use ecs. And if it was less than 20 minutes, we'd use lambda again. I think this is a really interesting pattern. I'm using step functions to decide which compute substrate to run my workload on.

Oh the interesting thing again is that the code here is almost identical in my ecs task and my lambda function. The only difference really is that my lambda function of course, needs an exports handler.

Customers were asking us for the capability to reri a workflow from the state that failed and not have to go through all the prior states. And you know why? Right? It's because if you go through the prior states, you have to wait for them to complete again and you incur the cost of all of those state transitions.

So we launched this about two weeks ago. It's called red drive. And what this allows you to do is exactly that if a state in your workflow fails, you can now restart it from that failure. Now, these are the errors that you didn't catch things like incorrect. I am permissions something broken in your, in the service downstream that you need to resolve.

There's some important things to know though. You can't change the definition of your workflow. When you restart, it has to be exactly the same workflow and you can't change the input payload. It has to be exactly the same input payload. But this was i think our most requested feature for the last 12 months.

I want to show you some patterns for running step functions at scale. Now, the most simple way of doing this is to just send lots of executions to a particular workflow and they'll all spin up and run independently. So if any of those fail, you don't have to manage the failure, it won't affect the other failures, right? The other, the other uh workflows that are running, they're all independent.

Another way of running things in parallel and step functions is the uh the parallel state. So the parallel state allows you to run a fixed number of branches for a given input. So each one of these branches has the same inputs given to it. Once they're all finished, the results are put into an array and sent on to the final state.

And here's a little tip that i discovered because the parallel state a bit of a, an odd one, but it's got some, some cool things you can take advantage of if you're trying to pass, uh, an object or pass some data from the top of your workflow all the way down the chain to something at the bottom. you'll probably want to be doing it like this. right. you'll just pass it along and pass it along and pass it along. this gets tedious and this very prone to error.

Well, what you can do is you can use the parallel state to preserve that thing that you want to pass along, right. So i have one branch to just hold the data, the other branch to do all the work. And then at the end of my pa uh my parallel state, both of those things are available to me to the next step. So i don't have to keep passing that through my workflow.

Another thing to know with the parallel state is that if any of those states inside that parallel state error out and you don't catch the error, then your whole workflow will error. So just a a best practice, make sure you catch any errors you can with your parallel state.

Then we have something called dynamic parallelism. Now this is basically a sub workflow within your workflow. So here i can provide my, it's called the map state. I can provide my map state with an array of items and then each one of those items will execute in parallel so i can do things like fan out.

So here's an example. We built a web application called serves land. Has anyone ever used serverless land.com? Oh, about a third of the room. That's great. So serverless lands, a great resource for learning more about anything to do with serverless serverless land.com.

The first thing we did on cus land.com when we built it about three years ago was to just have a place to aggregate all the new aws blog posts that were about cus and this workflow is what did that?

So the first thing this workflow does is it uses a lada function to grab an rss feed of all the new blog posts on a given day that creates an array of blog post, url s that it hands to the map state and then in parallel it scrapes or scans each one of those url s that creates an object for each blog post, a metadata object which we then save to github before rebuilding the site. it's a static site.

This worked great. We improved it though with this example of the scatter gather pattern or the fan out fan in. So here i'm still scanning for all of my uh url. si still provide each one of those url s to scrape the metadata, but i move the github state outside of my map state. So i'm only writing to github once this is because i'm not incurring a state transition. every time i write to github,

The customers were asking for more parallelism. They were the map state only allows you to run 40 items at a time. So people were embedding map states within map states within map states to get more parallelism. Pretty clever but uh pretty awkward to manage.

So last year, we introduced something called the distributed map state and this allows you a max parallelism of 10,000. So we went from 40 to 10,000 with this new state. It interacts really nicely with s3. So you can point it at your s3 bucket. And you can say for every object in my s3 bucket, i want you to run these work, these workflow steps in parallel or you can point it at a file in your s3 bucket like a json file or a cs v file. And you can say i want to run these steps for every item in this cs v file.

Here's a nice example, visual example of using the distributed map state. So this is an application that creates a gif animation for every 30 seconds of an mp four file. I dropped my mp four file into s3 that starts the workflow. The first thing the workflow does is it uses a lambda function to say ok, based on the size of this mp four file, if i have to make a gif animation for every 30 seconds, here's an array of where every one of those gif animations needs to start and finish. I save that array in s3, i pass that to my distributed map state. My distributed map state in parallel runs a lambda function to create each gif animation. And here's the result. A silly picture of me giving a thumbs up.

Why am i showing you that this is the other pattern, by the way that i'm using all the time? Because what this is doing what it means is it doesn't matter if that mp four file is 10 seconds or 10 hours. This will take about the same amount of time to complete because i'm running it in parallel. This is what cus applications is all about is breaking down tasks into smaller and smaller pieces that you can use the power of the resources available to you from aws to run things in parallel, right?

Another thing to know about the uh distributed map state is that every execution of a distributed map state run is actually its own separate workflow execution with its own payload limits. So this comes with some cheeky bonuses as well. And i stumbled on this by accident.

Here's an application i wrote to use github api every day to get statistics about a number of blog posts on a w uh not blog posts, repositories on aws, uh github org, hundreds of repositories every day. We wanted to scan the data so that we could keep that you only get, i think you only look back two weeks on github. And after a few months, i wanted to run some calculations on that data to analyze it a little bit.

And so i created this other workflow to scan all the data that was in my dynamo db table and pass that along to my maps state to run didn't work because the payload was too big. Right step functions has a max payload limit of 256 kb. And i'm trying to pass in nearly 3000 objects and they're very large objects directly in to my map state.

So i thought, oh, that's the first time i've hit a scaling or a a limit issue with step functions. So we had to think and what we did is we broke down the task into smaller and smaller tasks that we could run in parallel and we got it working almost immediately.

What we do instead is we scan for each unique github repository url that creates an array. We put that array into a parallel process, distributed map state which then scans for each item of that particular github repo in parallel. So that produces about 90 items each which we then hand to another distributed map state. So exactly the same workload that i was trying to do before just broken down into smaller tasks that running in greater parallelism. It's the fundamental reason you should be using, i think all the resources available to you on your uh aws.

Here's a great resource for learning more. This is s 12 d.com/workflows. And here you have well over 100 different step functions, workflows, different patterns and best practices, every single pattern. I've showed you today, i've either added to this or taken from this as an example of what people are doing. These are contributed by aws experts, by customers and by partners, you can browse by use case browse by service. You can view the infrastructure as code template. Probably if you're googling some kind of issue with step functions in uh cd k or s am or terraform or cloud formation, you've probably been to this website or if you're looking for an a sl definition, you've probably landed on this website already.

And what you can do is you can deploy any one of these workflows directly into your aws account with that launch stack button. Here's that extra resources page here where you can get downloadable templates, blog posts, many many videos, some really good workshops that are quite in depth that will take a few hours for you to complete. And some code samples, literally hundreds of code samples you can use as a jumping off point.

I was going to end there. But then some stuff came out this week with step functions that i just wanted to mention in case you missed it right. There's even one that's not on there because it came out today and i wasn't sure if i was gonna be before that announcement or after.

The first one is red drive, which i mentioned earlier. You can reri a workflow step, a failed state from the point of failure, right? That came out two weeks ago.

The next is the http state. What this allows you to do is from your workflow. You can call out to any public url and you can authenticate that call with various in various ways. So this means you can integrate your own api ss asap is that you're using things like stripe adobe, whatever it might be salesforce or anything that you've already created an api for maybe your own api with api gateway. You can call out directly from step functions. This was launched two days ago.

Oh along with that came this test api which i think is the biggest launch that went out this week. And what this allows you to do is to test any individual task on its own. So you don't have to run the whole workflow to see what will happen. You can select any task, you can give it an input payload and you can click play and you can see what the results look like. You can see if you didn't give it the correct i am security permissions. You can track the input output path processing and you can do that directly from workflow studio, the test api.

And then finally, there's this new bedrock integration which allows you to invoke bedrock directly from step functions. There's no la mbda involved in that not only can you invoke the model with a prompt, you can actually customize your model. So you can do prompt chaining with step functions. It's brand new. So i've not seen many examples yet, but i'm quite excited to see what people come up with.

And then the final one that was announced just this morning is you can actually use the step functions workflow studio in your id e now through app composer. So app composer is now available in your id e app composer has an integration with workflow studio. So you can spin up something like uh viz code and you can use the workflow studio from within your id e.

I didn't want you to miss any of those new launches without uh you know, at the end of this talk. So i just wanted to at least let you know about them. I'm happy to talk after with any questions that you have, but that's all for now. So thank you very much and enjoy the rest of your week. Thanks."

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值