Advanced serverless workflow patterns and best practices

最新推荐文章于 2024-09-30 17:19:07 发布

taibaili2023

最新推荐文章于 2024-09-30 17:19:07 发布

阅读量396

点赞数 8

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134590006

版权

This is advanced s workflow patterns and best practices. My name's Ben Smith. I'm a Principal Developer Advocate for Servius at AWS. I'm actually based out of our London office and I'm absolutely buzzing to be able to give you this particular talk. It's actually a title that I wanted to submit to the reinvented myself. And it's because I've spent the last two years building um predominantly with step functions and talking to advanced customers who are using step functions. And I wanted to pass on some of that learning to you. So that's really what this whole session is all about.

My background is also in service orchestration and workflow automation and I was a web developer for about 15 years before joining AWS as well. This link here is a landing page that covers everything I'm gonna talk about over the next hour or so. So you have here um deployable templates, s am templates cd k, templates, lots of blog posts about some of the uh features I'll talk about, you have lots of, you know, introduction videos, instructional videos, some really good online workshops, including the servi espresso workshop and a new step functions workshop. And some code samples there as well. Um I'm not going to be putting up lots of links throughout the the show, but I will give this slide again at the very end in case you didn't manage to catch it here.

So one of my goals for this talk is to have you thinking how you can build your next application or your next microservice as a workflow. And to that end, I'm even doing my agenda as a workflow. I think of every new problem as a workflow at the moment because I'm spending so much time building with step functions. So this is what I'm gonna be covering today. Gonna be talking about why I think you should use step functions first when you start building your next application. And I would even go so far as to say why I think you should use step functions always. Then we'll talk about different modes that you can run step functions as a standard workflow or an express workflow. I'll spend a little bit of time talking about the step life cycle and how you can transform data as it passes through an individual state. I'll spend quite a lot of time about optimizing your workflow for reducing costs. And then we'll speak about some patterns for handling errors. And then at the end, I'm gonna talk about a brand new feature release that was featured today in werner's keynote for using step functions for um super high scale uh scalability. So I'll save that for the end.

These are the goals that I kind of want to accomplish that I want you to walk away with. Right. I want you to get more value out of the service itself and hopefully you can pass that value on to your customers. I wanna help you build applications that are reliable and resilient to faults and failures, applications that scale automatically nice and simple to reason about simple to build and to be able to do it as quickly as possible. And really every talk that you come to at reinvents should help kind of fulfill at least one of these goals for you.

This is how I think about building service applications. It's all about using the individual service for the the unique purpose for which it was developed, right? Moving your, your code into servers uh services that was accidental slip up, not servers. Um this is a pyramid that the way I think of uh services on aws where you have the the services at the bottom of the most managed or sorry, uh the least managed for you. And as you move up the pyramid, the services become more and more managed and you're able to move uh to let aws handle that, undifferentiated um heavy lifting for you.

And when you're building applications in this way, relying on services rather than big uh monoliths of code, you have to more often than not rely on more than one service, right? And so then the challenge changes to how do you pass data between these services? How do you orchestrate them? Do you use an api approach? Do you use a kind of asynchronous event driven approach which matches really nicely for service applications? And this is where step functions really help solve that challenge of understanding how your application is put together when it's spread out across lots of different resources and different services in aws. It helps you visualize build, inspect and orchestrate the passing of data between those different services.

And this is a 300 level talk on step functions. But for anyone who's very new to step functions or just interested in what it is, these next couple of slides are for you, right? So step functions is a service on aws that lets you create workflows. And so these workflows allow you to move the output of one step to the input of the next step. And you can arrange that with conditional logic branches, parallel states loops in the terms in form of map state, you can add weight states.

And here is a gif animation of me building out a simple workflow where i'm picking my action from a column on the left. And i'm dragging that into my design view and i'm adding logic to it and then i'm configuring each individual action with a column on the right with these nice editable forms.

Now step functions is serverless. So it's pay per use. It's a fully managed service, it scales automatically. It has this nice drag and drop interface, which i think is actually one of the nicest uxs on, on the console on aws. But of course, you can also build step functions as code using a sl that's amazon states language, which is really um json with a sprinkle of step function syntax on top of that. So what that means is you're able to take that a sl you can commit that to repositories. You can share that with other developers on your team. You can make poor requests, you can add that to your deployment pipelines and you can build out work workflows locally as well as just in the aws management console.

Now, what's really special about step functions is that it integrates natively with over 220 aws services and it does that by using their sdk action. So that means there's over 9000 actions that you can invoke directly from your step functions workflow. There's no lambda involved or anything like that. The step function service is invoking the sdk directly for you. So this is why i think you should use step functions always when you're building your application or at least you should start with the step functions workflow.

Traditionally, at aws, when we talk about getting into the kind of serverless architecture, you'll normally hear people say start with the lambda function understand the event driven model, understand the lambda event handler, then introduce another service for state, maybe s3 or dynamo db, use the sdk to interact with dynamo db. Then maybe add api gateway. So you have a front door to your business logic and then add messaging with uh ssnsm uh maybe queuing with sqs, maybe eventbridge and so on and so forth. And then you'll hear people say and then when you know all of that, put them together in a workflow and orchestrate them, somehow i say coming it from the other way round, start with a step functions, workflow and start dragging in your actions as you build, it makes more sense. And i'll, i'll show you why i think that makes more sense.

This is the uh the more traditional way that we speak about building surplus applications, right? I have a lambda function and i just want to get an item from a dynamo db table. So what do i need to do? It's my first day building a service application. Well, i need to pull in the sdk right to uh interact with dynamo db. I'll probably use the dot client library so i can marshal and i martial data from dynamo db. I need to set up my parameters to tell dynamo db what table i'm gonna interact with. So i've got my table name, my partition key and my sort key there. Then i'm gonna set up my uh my query so here it is a trat block where i use a doc client. I promise to fire the call so i can use it synchronously. I wrap that inside my trat block and i return any errors. And then i've got my lambda exports handler. So i've got my event object, my context object, i have a new trat block. I call my query items uh um function, i stringify the response because i probably want to stringify that if i'm returning from lambda and i return any errors. And that 20 lines of code is just to get something from a dynamo db table where i already knew the partition key and the sort key. And i've had to go through and understand what the event object is, what the exports handler is and write all that 20 lines of code. And that's 20 places where something can go wrong because it's, it's my code, right?

I can get a lot of value straight away by moving this same workload into a step functions workflow that looks just like this. It's just a get item with a single step workflow. It's just as scalable. It's comparable in terms of cost and performance. In fact, it probably is performing quicker because there's no cold start times involved. There's no initialization phase, there's no lambda functions, right? I can still configure retries, i can still configure catch any errors and send that to a dead letter queue. And this is not just a nice diagram in a powerpoint presentation. This is how it actually looks in the service.

So if i come back to this in six months time, i've built lots of other things. I've been context switching. I need to edit my application or extend it. Which one of these two do you think would be easier to pick up and understand it's the workflow, right? You can look at that and see exactly what's happening with this application.

Then with the workflow version, i start sending executions to it. I can see all the failed executions and all the successful executions aggregated nicely. I can drill down into one of those failures and i can see that failure bubble up and i can understand exactly what at what part of the workflow it failed based on this nice visual graph here, i can also click through each individual step in my workflow and see what the input and the output of that step look like. These are things that are really difficult with surplus applications, with resources spread across accounts, maybe multiple accounts, understanding how the data is transformed as it moves through each service, but it's built in to the workflow.

So let's go back to this other version where i'm building in lambda. And this is a very common pattern that we see customers getting into. I was doing a workshop on monday and uh someone i was speaking to had fallen into exactly this. Hello if you're in the crowd. Um this is what we see happen a lot. Right. I built my land a function. It's doing something. It's getting an item from a dynamo db table, for example, and it's scaling automatically and i'm really happy with it because it's handling all those requests. So i'm getting more requests from the business for my little microservice to do more things. So i add the ability for this lander function to create items in dynamo db to update to delete cos i've got to move quickly, right? I've got to keep up with the request requirement of the business. I put api gateway in front of it. I have this front door to my application. I wrote all requests to lander. So now in my land, a function, i need some routing logic to say if the request look like this, then do a create but if it looks like that, then do an update. This is more code in my lab. a function.

Now then all the uh configuration to my function is applied no matter which one of those requests is running. So it's quite a loosely permissive i am function, right? It's it's got i am permissions to create update and delete regardless of which one of those requests, it needs to do the performance setting. The memory setting is applied to the whole function regardless of which one, the duration and time out settings are applied to all of it. And you can get little tweaks in optimization and cost if you were to separate this out to multiple functions. So this is what people at a w will often say to do. And this is good. This is a step in the right direction for sure.

Now i've moved all the routing logic out of my land function and into api gateway. So api gateway is configured to know which one of these three functions to invoke based on their requests. Right? I've split the code out across three functions, very similar code doing slightly different things, create update and delete with different permissions. So now i can configure them with different performance settings and memory and time out. But if i come back to this in a few weeks or months time, i need to go through my infrastructure as code template to figure out which lambda function i need to edit, which lambda function is causing the error. Then i need to drill down into the code to understand what the code's doing again. And there's no visual representation of what this looks like, right? This is just a drawing.

This is the first pattern i want to show you and it's a rest crud api based on step functions, workflows. So here i still have api gateway in front of my workflow and that's handling authorization authentication. Maybe it's uh still throttling for me and it's rooting away any unwanted requests but anything that's valid gets sent on to my workflow where the first step in that workflow is to say, ok, which branch do i go to next? Do i get put post or patch? And then it runs that uh action directly in dynamo db? So there's no code involved in this entire little uh rest api i could of course add other branches here that did have maybe lambda functions or something to do with a s3 bucket or whatever my application needed. But that would be nice and easy to do because i could open this up and understand exactly what my application's doing. Right. there's no, there's no um lambda cold start time. The performance is probably better. The cost is probably similar.

We built this, has anyone tried to serve an espresso booth in the expo hall? Couple of hands. Cool. It was mentioned this morning by uh vinna vogels in his keynote, which was, which was pretty awesome. So serving espresso is a, a pop up coffee shop that we built for re invent last year. And the idea was that we wanted to build a completely production ready serverless application that kind of solves a real problem at re invent that people could come and use and we could talk to them about the application they've just used. And so we came up with s espresso. One of the micro services in server espresso was this thing called the order manager service. And this was responsible for maintaining the state of each individual order that's submitted into the system when people order their coffee. And it was pretty straightforward. It's just an api gateway url that um routed onto a lambda function that would interact with the dynamo db table and update, create, edit the order according to what it needs to do. And there's some compute going on in here. It would have to sanitize the order as well. But this was quite slow. We were getting latency. People were waiting a few seconds to see what their order number was. When it was generated, it was causing some weird flickering issues with our web socket connection to the screen that showed all the orders running and we just weren't happy with the overall performance of it. We got through re invent, we wanted to make this better. So we use this rest crud api for server espresso. We rebuilt the whole order manager service as a workflow. Now, when we need to update something or check why an execution failed, we can just open up the workflow execution, see what path was taken and drill down into the individual step to find the error and to extend it, it is just as simple. We can just add a new branch and extend it that way. So this is more performant. Um it costs less money and it's much easier to build out for the future.

Now, we chose to run this as an express workflow and that's why i want to talk about why we made that decision next, right. So there's two ways you can run a workflow in step functions, you can create an a standard workflow and you can create an express workflow. And standard workflows are what came first. That's the original step function workflow type. These are long lasting. In fact, they can run for up to one year. Ok? They're asynchronous. So when you trigger a standard workflow, you don't hang around and wait for the response cos it could be a year

So you need to fetch that response by some other means and they have an exactly once execution model. So if you send a payload to invoke a standard workflow, you can be sure there won't be any duplicate um invocations of that payload.

Then we have an express workflow. On the other hand. So this was introduced, I think in 2019 and these are much higher throughput. So they transition through states much faster. They have a different cost model which I'll talk about in a minute. They have an at least once execution model. So if you uh trigger an express workflow with a given input, there is a chance that it will have a duplicate um execution.

So you need to make the steps in your express workflow item potent. So that no matter how many times you run it with a given input, it will always give the same result. Now express workflows can be run synchronously or asynchronously. And this is why one of the reasons why we chose to use that as the order manager service because we could send a request via API Gateway, run it synchronously. API Gateway can wait for up to 29 seconds for the response. And so we could send the response back via API Gateway to the user right back to their phone. You can also run it um asynchronously as well.

Now, here's the most important thing, they can only run for up to five minutes. So if you have a workload that takes longer than five minutes, then uh you're gonna have to choose the express workflow instead. So I wanted to show you what this looked like with some real data to compare express workflow and standard workflow together.

And this is a kind of simple implementation of an ecommerce workflow where let me see if I can get this to scroll. So in this, you do things like a proven order request, you poll a DynamoDB table to check that the request is ready, then you notify that there's a successful order that's about to go out and then you poll to check that the payment is done. And then there's some Lambda functions that run to update itinerary and warehouse and stuff like that high level version of an ecommerce workflow, right?

So I saved this as a standard workflow and I saved it as an express workflow and I created a simulator to submit 1000 orders to this workflow to see what the results were. And then a CloudWatch dashboard to kind of see what happened. And this is what I found, right?

So the standard workflow average duration was 11.9 seconds and the express workflow average duration was 11.3 seconds. So it's a little bit faster on average. That's kind of interesting, but it's not mind blowing on its own. It's what we would have expected. This is interesting though, right? The standard workflow for 1000 executions costs 42 cents. Whereas the express workflow costs one cent.

Now there's a slight caveat to that, which is the standard workflow does have a free tier. So we're assuming we've gone past the free tier. I'm not factoring in the cost of any downstream services, any Lambda functions, just the cost of the workflow orchestration piece, right? So there's a big difference there. And this is important because I hear from customers sometimes that Step Functions is expensive. But I think it's just a misconception because there's so many features in Step Functions. So many parts of the service that you can do to optimize for cost. Um which is why I think where I think some customers are going wrong and why that it has this kind of misconception as being expensive.

Here's how I worked out the cost of my standard workflow, right? It's very, very easy to work out uh how much the standard workflow costs to run. It's based on the number of state transitions. Ok. So this workflow on the happy path, it has 17 state transitions. That's when the data moves from 11 state to another, pretty straightforward. And you pay $0.025 for 1000 state transitions.

So to work out the overall cost, you get the number of state transitions and you times that by the number of execution and you times that by naught point naught naught, naught, naught naught 25 and there's the final result 42 cents, right?

The express workflow is a little bit harder to calculate. So this is actually based on the duration of your workflow. How long it takes to complete um at a given memory allocation. So how much memory do Step Functions need to to complete that workflow? Right?

So this is how I calculated that you get the execution cost, you add that to the duration cost and you times that by the number of requests. Now the duration cost is made up of the execution cost to the nearest milliseconds times by the memory cost.

Now the memory cost, this table is taken from the AWS website um and this is where I say you need a certain amount of memory to run Step Functions, right? Think of it as RAM and this is based on the size of your workflow definition. So the size of the ASL and the size of the input payload that you send as it moves through the workflow.

And so you can get this from a CloudWatch dashboard. This was the memory configuration. It jumps up in 64 megabyte increments and this was what my workflow was using. It was the lowest one here. This is the price per 100 milliseconds for the 1st 1000 GB hours. It's this 0.00 naught something something 1042.

So the final result is one cent, right? Ok. Let's say I run that a million times and extrapolate out the cost. So now it gets very interesting and very different. The standard workflow would cost about $420. The express workflow would cost $12.77 which when you put them together is kind of amazing the difference. It's just the same definition. I'm just running it in a different mode, right?

So the question is, why wouldn't you always run something as an express workflow? Well, there's maybe three play three times when you wouldn't, right. The first is if the workload takes more than five minutes, then you can't run it as an express workflow on its own. If you wanna use the sync or the wait for task token call back pattern, then you can't run it as an express workflow. And if you actually require exactly one execution, then again, you have to run it as a standard workflow.

But there's still a lot you can do here. How about both? Why not use both workflows in the same workload? And this is the next pattern I wanted to show you. So this is the nested pattern where you have a standard parent workflow invoke a synchronous express child workflow. So you get the cost benefits of the child workflow and the time benefits of the standard workflow.

So come back to my example, see this bit here, this is polling to wait for something to finish. So let's say sometimes this took more than five minutes, maybe it requires some human interaction, for example. So if that took more than five minutes, I would have to run this as a standard workflow.

So here, here's what I would do. I would go ahead and take these eide and potent Lambda functions out of my workflow and save them as a separate express workflow. And then I would invoke them with the SDK for Step Functions because Step Functions can trigger Step Functions with this start execution action. And so now I've embedded kind of one workflow inside the other and we can have a look at the cost.

I won't go through it in detail because uh you know, for obvious reasons. But the thing to know is that my standard workflow now has less transitions only has 14. So the cost of my standard workflow has gone down. The cost of my express workflow is almost nothing, right? Doesn't even uh affect the overall cost when I add them together. So the total cost is now just 30 cents. I'll come back to that example in a moment. But first, we'll have a little look at the Step life cycle.

So when you invoke your Step Functions workflow to start executing, you need to give to it a payload of some sort, right? You also have available to your workflow at any point, something called the context objects. And the context object contains things like the execution ID. The time the workflow was started um the ID of the state machine and stuff like that.

Now you can access the context object using the $2 sign and you can access the event object or the payload object using a single dollar sign. And there's various points in the execution of each task where you can use these two objects to manipulate and transform the data as it moves through your task, right?

You can do that in the input path in the result selector in the result path and in the output path. And this is all within one single state. So let's say I invoke my workflow and this is my inbound event, but I only really want to pass everything inside this white box to the actual DynamoDB call that I'm gonna make, right.

So then in my input path, I do input path dot dollar and then I do dollar dot invoking event. So I'm saying everything inside the object called invoking event, event is what I want to use on my next call to the AWS service. So that's used in the call.

Then I get a response from DynamoDB cos that's the service I'm using in this particular call. And this is a long typical response from DynamoDB. But let's say I don't want all of that. I just want three values. I want the status code, I want the request ID and I want the request date.

Well, I can construct a brand new object just based on those three values which you can see here in the gray box. And you can see the what it evaluates to in the blue box below. This then gets passed on to the result path and in the result path, I tell it where I want to put this new, these new three values, I'd say I want to put it inside something called detail dot lock info. And you can see there it's inside detail dot lock info.

And then finally on the output path, I can say what I want in all of this to pass on to the next state. So I say I want to pass on everything inside the detail block and that then becomes what gets passed on to the next state and the cycle continues again.

So why, why is that useful? Right? Well, it's useful because you can use those different points to reduce, manipulate, transform and change the payload as it moves through your workflow. And that will help you reduce the cost and help you keep your data nice and tidy and easy to understand.

So when you need to reduce the cost of a standard workflow, you have to reduce the number of state transitions. That's all that's really um costing you is the number of times you transition from one state to the other.

Now it's of course, it's also useful to reduce the number of downstream services for the total cost of your application with my express workflow. What I wanna reduce is the amount of time my workflow takes to complete and the amount of memory it needs in order to complete that workflow.

So the way I do that is to reduce the payload as it passes through each state and that will reduce the memory that's needed. Here's some other things you can do, right? You can use JSON path filters to pick out different parts of the payload. So for example, if I have an input payload of from a library, and I just want every book with an ISBN number, I can use a filter like that and I will just get a filter of all the books with an ISBN number.

I can choose just the third book in the array. I can choose just books with an author. I can choose just books with a category fiction and a price less than 10 and so on and so forth, right? It's just another simple way of picking out data from that payload.

I can also use something called intrinsic functions. So intrinsic functions are inherent to the Step Function service that allows me to form simple transform and simple compute functionality without having to invoke another service like a Lambda function.

So typically customers were having to use Lambda to to do really simple things like concatenate two strings together or generate a unique ID. And so customers were asking for a way to do that in Step Function. So they wouldn't have to have an additional task and an additional start up time for calling a Lambda function.

So that's what intrinsic functions are for. And about a month ago, we introduced 14 new intrinsic functions to the library for things like managing arrays, JSON manipulations, encoding and decoding data. There's things for math calculations, there's things for simple string operations as well. And this one's really useful. There's one for generating a unique ID and the way you use them is just to use this magic word states dot and then after that, you put the name of the intrinsic that you want to use.

So states dot array will let you create a given array de depending what the input you give it is. Uh you can use states dot array contains to look for an item in a give an array states dot UUID is the one to generate a unique ID and so on and so forth. So these are ways of transforming your data and doing simple calculations and simple manipulations without using Lambda as a kind of glue function.

Ok. Back to the workflow. Here's some other ways I can optimize this workflow using some of the features of Step Functions, right? This polling mechanism here. So this is an interesting one. I'm triggering an asynchronous service and I'm having to go around this loop x amount of times until I get an answer that I'm happy with before I can continue. Right. There's another one down here as well where I'm waiting for a payment to be processed.

So this is three state transitions times however many times I need to go through the loop. Now, this is not good. We don't need to do it in this way. This is not just a, a problem with Step Functions. This is how asynchronous events work, right? The client sends a request, it gets an acknowledgement back and it carries on in the background and there's no other way of returning what that response is without requesting it somehow.

So two common ways around this one is polling and polling is very easy to implement. It's so easy that my six year old daughter is doing it to me every time we go out the house just last week, we were on a train to London to go to the theater. And as soon as the train pulled off, she starts asking me, dad, are we there yet? And she's pulling me every six minutes until we arrive in Victoria train station. And it's really effective. She definitely got the results she wanted. But it's very noisy. Right?

And in terms of Step Functions, it's also very noisy cos you're paying for that loop each time for each of those state transitions. So there's a better way of doing this both for my sanity and for your Step Functions cost. And that's to use this callback mechanism, right?

So this would be equivalent of us getting on the train and then her asking me, dad, can you tell me when we're there? And I say sure thing, let's just look at our phones for an hour and then when we arrive, I would say good news. Now, this can also work with Step Functions. And this is the next pattern I want to show you it's called the callback pattern or the call me. Maybe this is really powerful. I think this is um where Step Functions really starts to excel.

So here's a very simple implementation of that where I put something on an SNS topic and it might be hard to read there. But I have um the action is publishing on an SNS topic and then in red you can see it says dot wait for task token. So as well as putting the, the payload on the SNS topic. I add a unique task token to that SNS topic as well. And then Step Functions will wait on this task until it receives that task token back from some other service.

"And it will wait for up to a year. Normally you'd use a Lambda function with that task token. And this SDK call called SendTaskSuccess to say to your Step Functions, "OK, here's the token you're waiting for. You can go ahead and continue now."

And this is the pattern where this just expands out into all sorts of interesting use cases, right? The emitting way, imagine you have a workload where there's individual milestones that have to be met before you can progress - like a production line like serving espresso.

So here I get to, I start at Milestone One and I emit an event onto my EventBridge event bus along with my task token and my workflow will pause there. I have a rule in my EventBridge to route that event to a DynamoDB table. I'm gonna save that for later on, cause I'm gonna need that to resume my workflow.

I have another rule presumably which will do whatever Milestone One needs to do. So that will invoke some other microservice to do what Milestone One needs to do. And when that's finished, it will grab the task token back from DynamoDB and say, "OK, Step Functions, Milestone One is done. You can go ahead." And so it rolls straight onto Milestone Two and the pattern continues.

I save the task token for Milestone Two. But let's say, I don't wanna wait for up to a year. Let's say if nothing happens within a certain time, I need to do something right. Again, serve an espresso - if your coffee is not fulfilled within 15 minutes, we, we turn off the QR code, we stop more orders coming in because it means the baristas are too busy.

So you can set something called the heartbeat, heartbeat seconds. So this would say only wait for 20 seconds. And if you don't hear back within 20 seconds, then go down this route, throw an error, catch the error and admit that that there's been a timeout, for example.

So it's a really interesting pattern I think. So I put that back into my workflow here. I can chop away six states plus however many times it loops loops round and all I need to do is on the state before I add a weight for task token. And then obviously, I need a separate Lambda function to call back into that at some point.

Now, I've removed a whole bunch of states from my standard workflow. So now that has just eight state transitions. So it costs just 20 cents to run and my express workflow still costs almost nothing on the scale of things.

So the overall cost now of this optimized workflow that started at 42 cents is now just 20 cents. And I haven't even applied any intrinsics or Jason path processing. I could go on and on with this to reduce this down more. But I want to talk about managing failures first.

This is one of Vogel's, he's the CTO of Amazon.com and he's was the keynote speaker today and he's well known for having said everything fails all the time. And Step Functions has this brilliant mechanism of dealing with failures, right? And it's very visual too.

So this is the pattern called the Saga pattern. It's a quite a commonly known pattern, but Step Functions executes this really well, let's say I have a holiday booking system where I book a hotel and I book a flight and I book a car and these are sequential things that if they finish on the happy path, then my workflow exits.

But let's say my book a hotel task fails and I need to roll that back so I can catch that failure and unbook that hotel. But if the book a hotel passes and I go on to book a flight, but that fails. Well, I've already booked my hotel so I need to roll both back and I can sequentially roll them back by catching the canceled flight and then I can cancel my hotel and the same if I were to book a car, so I can kind of waterfall back down or roll back down depending on which part of the workload has failed really elegant pattern for capturing and rolling back.

This is another interesting one, this is called the circuit breaker. So here I have a DynamoDB table that maintains the overall status of my circuit or my application. And it's good for preventing multiple requests to a microservice or to a lambda function when I know that function is not going to work, right?

So I have the overall state in my DynamoDB table. The next thing, the first thing I do is I get the, I get the status of my circuit and then I query the status with a choice state. And if the circuits closed, meaning my application is good, I go ahead and I run my Lambda function.

If my Lambda function doesn't fail, I exit out and everything's good. The next request comes in a moment later, it gets the overall status of the circuit. It runs the Lambda function. But this time the function fails, I capture that fail. And I update the circuit to say, "Oh, the circuit's open now, we have an error."

Next time my request comes in, I get the overall status of the circuit and I know that my circuit's open. So I don't go ahead and invoke my Lambda function and don't incur the cost of whatever that may be. Maybe I admit an event to an administrator that something's broken and needs to be looked at circuit breaker pattern.

OK. Step Functions at scale. There's different ways of scaling out with Step Functions. You have parallelism and concurrency and they kind of used interchangeably I think. But parallelism, I think with Step Functions at least can be referred to as given the same input but to different steps or to a different workflow. Whereas concurrency is the same steps but with a different input each time, here's what I mean.

So let's say I have a workflow given a SL definition and I send lots of invocations to this workflow. Now each process is unaware of the other processes. So if one of those executions fails, it shouldn't affect the other executions. So this is great for remediation workflows for non compliant resources, for example.

And really the only limit here is the account quota which you can often get increased, then you have something called the parallel state. So this will execute a fixed number of branches in parallel. And what's important to know here is that the input that goes into the parallel state is handed to each branch within that parallel state.

So they all get the same input, they all run and the parallel state will wait for all of them to complete before exiting. And it will kind of aggregate the various results into a final array which is what then gets passed on to the next state.

Now, if any one of them fails and you don't capture the failure, then the whole parallel state will fail out, right? So that's bad. So you wanna capture the failures and then they will be part of the aggregate array that then gets passed on to the final state.

So the point of that is capture the failures wherever you can, then we have something called dynamic parallelism. So here you're able to do a kind of subset of steps or a sub workflow to a sort of scale out and achieve something faster. And this is the map state right in Step Functions.

Now, the idea is that you, you run this map there, you run these workflows against an input array and you can either do that sequentially by setting what's it called? It's called the max concurrency control. You can set that to one and it will kind of loop through one at a time or you can set it to zero and it will do up to 40 at the same time. And you can set anything in between one and 40.

Basically, we use this pattern for a website called cervi land.com. Has anyone anyone used cervus land for clus knowledge? Oh, great. Quite a few. So cervus land was built by myself and some other people on the CS developer advocate team to kind of aggregate blog posts and code samples from aws all about cus.

And when we first launched it, all it really had was an aggregate of all the cus blog posts in one place. And in order to do that, we made this workflow, right. So what this workflow does is it scrapes an rss feed here and then it generates an array of results which it passes on to a map state, you can take a glass of water, so it passes on an array of url s which we're gonna go ahead and scrape.

And then this map state up to 40 at a time will reach out and grab some information from those url s using a scrape metadata, lambda function. It then passes on that metadata. Here's an example of it has things like the title of the blog post, the thumbnail, the tags that the blog post is focused on and it will send that on to a github repository, which is what where we host our static jason files that comprise cus land.com.

And when the map state is finished, it will then trigger a rebuild using aws amplify. Now we can improve that with this pattern, the scatter gather. So we branch out to multiple resources at the same time and then we branch back in and reduce. You could call it the map reduce as well, right?

So we still get the list of all the different blog posts. We still scrape the metadata concurrently, but then we move the safe to github action out of our map state. So we only have to invoke that once. So there's less state transitions which as we know is what costs money.

So we move that out and our, our application is just as performant, but it's more optimized now for cost. And then finally, we still run the sdk to trigger the amplified bill.

Now I mentioned the map state has a maximum currency of 40 but there's interesting ways you can get around that right. Let's say you wanted a max con currency of 32 or let's say you wanted a max concurrency higher of 1024. So what you can do is you can set this map state to be a concurrency of 32 and you can embed that in another map state and you can set that concurrency to 32 and you can scale out in that way to achieve a higher level of concurrency.

But customers have been asking for more ways of building applications with Step Functions for much larger arrays to process concurrently. And this was the launch that was mentioned earlier on in dr Werner vogel's keynote and it's this new distributed map state. This is a big theme for Step Functions this year. This is a big launch.

So the distributed map state has up to 10,000 parallel executions. So we've gone from 40 to 10,000 parallel executions in your in your map workflow. So this is great for large scale workloads. It plays very nicely with s3. You can iterate over millions of s3 objects, things like json files, cs v files, movie files, images, and you can still do whatever you need to do inside that map state, invoke lambda functions, et cetera.

Here's some information that will help you to kind of explore the distributed map state, right? The map state now has two modes. There's the in line mode and the distributed mode. So in line mode is what you formally knew map state as distributed mode is this is this new version.

If you choose to run your maps state as distributed, then you have two options for how it executes. So if you have a distributed maps state, you can say i want this to execute as a standard workflow or as an express workflow. And given everything we've been talking about earlier, I would suggest you default to have it run as an express workflow if that makes sense for your workload, right?

You can also set up a batcher so you can process items in batches, then it has something called an item reader. Now the item reader allows you to set an object in s3 or a bucket in s3 to bring in as an array to your map state. So this is really useful for kind of circum navigating the payload limits of Step Functions which is 256.

That's the payload limit. But if you go directly to the object in s3 or to the bucket in s3 and it just iterates over them one at a time, then you're not passing that through as a payload. Right. There's also something called the results writer, which does the same thing, but lets you write back to s3 directly from the map state itself.

This is how it looks like in the console. So you can see with the, the example on the left for you, that's how it looks like in workflow studio when I'm building out my map state. The other side is how it looks like when it's running. So you can see the progression as it iterates through each of the individual items. And you can see things like the success rate, the failure rate and you can drill down into your map state to look at that individual workflow execution and see what happened with that.

Here's a nice demonstration of the map state. So one of the best things about my job is that I get to play with some of the features and services before they get released. So I had a good amount of time to play around with this distributed map state. And I created this thing called a serverless gift generator.

So what this does is you drop a movie item in a bucket in s3 like an mp four or dot mov and it triggers the workflow and the workflow will then generate a gif animation for every 30 seconds of the movie which you can put into an application like a, a movie timeline scrubber, for example, which is this website on the right here.

So I drop my, my video file into s3 which triggers my workflow and I have here the key which is the movie file name and the bucket, which is the location of the s3 bucket. Then I use ffm peg in a lambda function to figure out. Ok? If each gif animation is 30 seconds, then how many gif animations will I need for this particular file? And where should that animation start and end?

And I create an array of metadata for each gif animation which I save into s3. Then my map state starts and it pulls out that array and it runs an individual lambda function for each gif animation. And that individual lambda function again uses ffm pg to generate the gif and then it exports that back to s3.

And here's the result right now. What's incredible about this is this, this example here is just a little 22nd clip, but I could drop a five hour movie into my s3 bucket and it would this workload would take about the same amount of time to execute because I have a lambda function generating my gif animation for each gif animation because of how it's able to scale out up to 10,000 concurrent executions.

This is a really good resource if you want to dive deeper and learn more about serverless applications using workflow. So this is s 12 d.com/workflows. And here is a collection of Step Functions workflows contributed by experts at aws and by advanced customers as well. And they're sorted by complexity by use case by the services that they integrate with you're able to.

So this is the circuit breaker workflow. I think you're able to look at the workflow image. You can look at the infrastructure as code template. So we have s a terraform and cd k. So if you're having problems with building your workflow using infrastructure as code, there's probably a similar solution in here, you can look up and they all come with an a sl definition as well.

So any tricky issues you're having with a sl you'll probably find a similar example in here too. And you can deploy these directly into your aws account using that launch stack button in the top, right. So that's a great jumping off point and it covers all the patterns that I've spoken about here today.

Now, you do still have a couple of hours to try out serving espresso. It's in the expo hall in the venetian and in the expo hall there is the whole developer advocate team has been manning that in between sessions to not just have a coffee but talk you through some of the patterns that i've been showing here, some of the patterns that servi espresso uses and to dive deep and discuss step functions, eventbridge and cus applications in general.

So if you do have time and you're going back that way, i do encourage you to come and check out servi espresso. Here's that final slide again. If you didn't manage to catch it at the beginning where you can get downloadable templates, you can get blog posts, workshops and code samples which covers everything that i've been speaking about here today.

And that brings me to the very end. Thank you very much for listening and i hope you enjoy the rest of your time at re invent. Thank you."