Scale interactive data analysis with Step Functions Distributed Map

最新推荐文章于 2024-08-22 20:49:13 发布

李白的朋友王维

最新推荐文章于 2024-08-22 20:49:13 发布

阅读量97

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134834840

版权

All right. Hey, everybody. Thanks for uh making your way down to the Mandalay Bay on a Thursday afternoon. My name is Adam Wagner. I'm a Principal Serverless Solutions Architect at AWS. Uh really excited to talk with you today. I'm joined by Roberto. I'm Roberto Eal, Senior Director of Software Engineering at Vertex Pharmaceuticals. And we're going to talk about scaling interactive data analysis with Step Functions Distributed Map.

Just to give you an idea of kind of how the the talk will flow. I'm gonna do a little bit of uh introduction to the problem space that we're talking about, do a Step Functions overview and a little bit of a deep dive into Distributed Map in the abstract. And then Roberto is gonna come back up and talk about how they're using this to accelerate drug discovery uh at Vertex Pharmaceuticals. And then we'll wrap up with next steps.

So to introduce this problem space, I wanted to talk about a simple use case. So imagine we have 500,000 files sitting out in S3. These are invoice files. We need to process each file, extract sales data, run some calculations, load that data into a database. And then at the end, we want to kick off some reporting jobs.

So what are some different ways that we could solve this? Well, at, at my core, I'm a software developer and so I reach for the really simple solution, right? I could just loop through all 500,000 objects, but 500,000 objects is a lot of objects, right? Like that's gonna take a long time if it was 5000 invoices. Sure if it was 50,000. Yeah, maybe, right. But we really need to, to kind of like look at this problem a little bit differently. This type of scale is generally what leads folks to distributed processing, right? We're gonna break up that uh 500,000 records that we need to to process those 500,000 invoice files. We're going to process them across multiple compute nodes. We're going to have a coordinator which is going to serve a number of functions. It's going to tell each compute node how much you know what the invoices are that it's going to process, it'll also know if one of those nodes were to fail or go down and then we have storage on the back end. So that as you compute those kind of intermediate results for each you know invoice, we can, we can store them there.

There's a lot of benefits to distributed processing like this, it allows you to complete that processing in a faster time. Give you quicker time to market if you do it well, it's scalable. So if you go from 50,000 invoices that you need to process to 500,000, you, you should be able to, to scale those processes fairly easily, you get fault tolerance improved. Because if you do have one of those compute nodes that is processing some of the that data happened to fail, that coordinator function can rerun those on another node. And you're not failing the whole, you know, 500,000 invoices that you're processing, you're just failing a smaller subset, right? And so you, you can just rerun that smaller subset.

And if you do it right, it's fairly cost effective as well. Generally, though these frameworks to do distributed data processing are not without new types of challenges. Traditionally, you need to think about how, why do you want to go, how much concurrency you want often that's provisioning clusters. AWS has made this easier. There's a lot of like serverless options where you don't have to think about that these days. There's also tooling and skill sets to think about. I mentioned, I'm kind of a developer at my core. I don't know, Spark, I don't know Hadoop. I'm not aaa big data person but I can write some code. And for me to think about moving to one of those big data frameworks, it's a, it's a pretty steep like learning curve for a number of those technologies.

And then we need to balance cost security and the speed at which we're going to do this data analysis. More and more. We're seeing interesting use cases where these aren't batch jobs that are running overnight on a schedule. But these data processing jobs which are still processing a lot of data have a user waiting for them on the other end. And so if you think about interactive data processing workloads like this, you have new additional challenges. So instead of just thinking about, oh, I need to be able to process 500,000 files, maybe you need to process 500,000 files every time a user clicks a button, sometimes you might have no users, sometimes you might have 50 users, sometimes you might have 100 users. So how do you think about that scaling and of traditional data processing very often you would queue up those jobs and those end users would have to wait. Uh people don't like waiting, right? So uh it's uh it's good if we can figure out a way to do this in such a way that users don't have to wait.

So what if there were a serverless solution to this problem? We already have all these invoices sitting out there in S3. We want some way to iterate over that data in S3 and then summarize the results out into S3. And we want to do this in a, a simple effective way.

And just a quick aside when we say serverless, like what, what, why, why serverless? Serverless helps accelerate innovation and it does this because it removes the heavy lifting, right? You don't have to do a lot of the kind of lower level plumbing that you would have to do with other solutions. A lot of the security is baked in. So you're thinking about the higher level security of your code, but you're not thinking about things like operating system patches and lower level things and then it performs well at scale and it's it's priced well at scale. So you're only paying when it's being used.

So in this case, that serverless solution that we're talking about is Step Functions, Step Functions is a serverless workflow service. We see use cases across a wide variety of different use cases and industries. The one that we'll spend the most time talking about today is data processing and that is data processing across a number of different types of data processing. So we see media, file processing pipelines, we see batch processing, unstructured data, structured data, ETL pipelines. And more, we also see it used for application orchestration. When you need to orchestrate between multiple microservices. It also very much excels when you need to orchestrate with a real human being. So maybe you need to call two or three different microservices and then have a human, make a final decision with Step Functions. You can wait for that human to make that final decision and if they, it waits an hour for that decision, if it waits a week, it doesn't cost any more to wait. So you can wait for free with Step Functions, which is a great feature.

We see it a lot in security and IT automation as well. I'll talk about it in a minute. You can orchestrate AWS services with it very well. And so we see a lot of things done in the security and IT automation space, whether it's sort of uh internal AWS security, things that folks are doing or use cases like fraud detection. Uh we see a lot of that works really well with machine learning as well. Uh earlier this week, we announced direct integration between Step Functions and Amazon Bedrock. So you can call foundational AI models from a Step Function with no code whatsoever. Uh so really, really cool feature there.

One of the great uh use cases that we see with machine learning is very often we have people do something in an automated fashion where you have a machine learning model that is making a judgment call based on some data that you gave it and it gives back a confidence interval. And if the confidence interval is high enough, you continue along the automated path. If not, we branch out to have a human, take a second look at it. So lots of great use cases with Step Functions, Step Functions accomplishes this by integrating with almost every AWS service. So basically any AWS API call can be made directly from Step Functions without having to write any code.

Earlier this week, we also announced support to call HTTP API endpoints. So if you're integrating with a partner or vendor, uh calling your own public HTTP uh services, we can now do that directly from Step Functions as well without writing any code, Step Functions. Uh I'll, I'll talk a little bit about kind of the anatomy of Step Functions and and what that looks like.

So with Step Functions, you build workflows that are called state machines. Each step within that workflow is a state. When you execute those state machines moving from one state to another, we refer to that as a state transition, the uh visual builder that you see on the right hand side there is what we call Workflow Studio. The great thing about Workflow Studio is you can build your workflows graphically dragging and dropping in components, but then switch to the text representation of that to put into your deployment pipelines, your CI/CD pipelines. Um and so we make it very easy, easy to move back and forth between those.

Uh another announcement uh we made during Re:Invent was uh integration between Workflow Studio and Application Composer. So you can build the components outside of Step Functions with Application Composer. Build the details of your Step Function within Workflow Studio and have the two of those uh work seamlessly together.

All right. So now that we've talked about an overview of Step Functions. Let's talk about Distributed Map. So Distributed Map is that serverless solution I alluded to earlier, it allows us to iterate over objects in S3 and then run a child Step Function workflow for each one of those objects or a batch of those objects. We can then summarize those results into S3.

So if you think about this again, I come back to you know me being a developer, if you think about this as a developer, all you need to do is think about that child workflow that is going to process a single one of those objects. And then you let Step Functions, Distributed Map, take care of the scaling, take care of all of that distributed processing that coordination. This can scale out to 10,000 concurrent child workflows. So it can go really, really wide.

Iterates over objects in that S3 bucket or iterates over a single object there. It gives you a nice operator dashboard to look at the the Step Functions, Distributed Map runs while they're happening. So you get an idea of what they look like as they're running, see how long it's going to take all of that sort of thing.

So let's dive in a little de into a little more detail into what Distributed Map looks like. So here's an example Step Functions uh state machine that has a couple of steps, a couple states that happen before the Distributed Map and one that happens after.

So first step, we're going to call Athena to stage some data. Then we're gonna have a Lambda function that's gonna gather some third party data for us. And then we start the Distributed Map. So that's the box there. That's uh you know, in the, the pink box there. Inside that Distributed Map. In this case, we have a fairly simple use case, uh fairly simple Distributed Map. It just has a single Lambda function within that child workflow.

And then after all of those iterations of that child workflow that Distributed Map finish, then we have a final aggregation step. Another Lambda function that's going to aggregate our results.

The input to the Distributed Map can be a number of different things. I talked about iterating over objects in S3. And that's one of the most popular things to do with Distributed Map. So basically you can point at an S3 bucket or you can point out a prefix within an S3 bucket and Distributed Map is going to list those objects iterating over those keys and feed them into your child workflows. But we can also iterate over one large file. So maybe you have a four gigabyte CSV file, we can iterate over the rows in that CSV file and we can also do that with JSON as well. So if you have a large JSON file that's a JSON list. We can iterate over those items. in the list,

You can specify the maximum concurrency, how much you want to fan out to those child workflows, how many of those child workflows you want to run concurrently? So here I'm showing a concurrency limit of 1000 that can go up to 10,000, it can go down to whatever you want it to do. This is great because Step Functions can scale quite quickly but not every downstream dependency can necessarily match that scale. So maybe you have a Lambda function in there that's talking to a relational database. You don't want to create too many

Uh you know, maybe inserts or queries against that database at one time, you can dial in that concurrency. It's something that's, that's safe and reasonable. You can also choose to run that child workflow as either a Step Functions standard workflow or a Step Front Functions express workflows. Most of the features I've been talking about so far. A um are are available in both standard and express standard is what we use for longer running workflows within Step Functions. You pay per state transition with standard workflows. So that's where you get that waiting for free express Step Functions are used for high volume like transactional workloads. They can only run for a maximum of up to five minutes, but they can be a good choice for the child workflow within the distributed map. If it's gonna run for less than five minutes. Another great feature that you can use to adjust the performance of your workflow and kind of tune that child workflow is by thinking about the batch size. So if you think about iterating over those s3 objects, maybe we do just want to send a single object into that child workflow, but maybe it's more efficient to send a small batch, maybe 50 or 100 you can configure that batch size, either in the number of items or in the uh amount of data that the kilobytes of json that you send into that child workflow. Important to note here what is getting passed to your workflow when you're using that s3 list is the key name, the s3 key like object, key name, not the data inside that object. So your workflow will gather that data if it if it needs to. So really, really powerful construct that you can use to do a number of cool data processing workloads. I'm really excited to have Roberto tell you about how they've used this uh at Vertex.

Thanks, bye. So for those of you who are not familiar with Vertex, Vertex is a global biotechnology company headquartered in Boston, Massachusetts where I'm from with offices all over the world. Vertex invests in scientific innovation to create transformative medicines for people with serious diseases. We have approved medicines that treat the underlying causes of multiple diseases, including cystic fibrosis, sickle cell disease and transfusion dependence, beta thalassemia. For those of you who are not familiar with pharmaceuticals or drug discovery, I'm going to give you a one slide crash course in some of the challenges that we face in this industry.

So when a company is choosing a disease area where they want to pursue a potential medicine, one of the first things that they'll do is decide what modality, what's the type of approach they want to take to create that medicine. Examples of modalities may be a small molecule may be ingested as a pill. It could be a CRISPR Cas nine, you know, gene editing based therapy, it could be an mRNA based therapy, be um mRNA became more popular with the Covid vaccine and there are others and in some cases, you'll pursue multiple because you think that there are multiple that could lead to a potential medicine.

Now, if we look just at the molecular space, there are tens of billions of potential molecules you could evaluate where one could result. It could lead to being a potential medicine. That is an enormous search space, right? And based on some recent research of other of this industry, it was found that for companies that take a drug all the way to market, they will spend on average 13 years on that process and spend over a billion dollars on the entirety of that process. That is a lot of money and a lot of time and as I mentioned earlier, Vertex invests in serious diseases, there are patients that are waiting. So time is really critical to us. We have to look for opportunities to try to shorten that cycle, to bring safe medicines to patients as fast as we can. And we in technology really want to use technical innovation that matches the scientific innovation that our scientists are using so that we together can accelerate this process.

Now, at certain phases of research, we're running experiments, right? Our scientists will design experiments in some cases. Those experiments involve testing different experimental conditions or combinations of conditions. You can imagine things like size of dose type of tissue on which you want to test the duration of time that that potential therapy is exposed to that tissue. All these different conditions in some phases of that research, they will use plates kind of like the one shown here. The plate has many wells and in each well, you'll be testing a different set of conditions and they will run that experiment. And then one of the ways to to check whether your hypothesis for that experiment was correct may be to run that that plate of wells under a microscope and collect images.

Now sometimes what you want to see what you're trying to use to test your hypothesis. Maybe you can see it on the image but maybe sometimes you introduced a dye or a fluorescent or something that's only visible under maybe certain color channels. And so what starts as one image for that one well becomes 3456 different images because you're looking at different features of the image to see to, to measure the impact of whatever you were testing in that. Well, now I'm showing one plate here with about 96 wells. In practice, there may be many plates with many more wells and so you can see how this can go from, you know, what may start as a simple designed experiment to many many hundreds of images being yielded for one experiment. And oh by the way, we have multiple groups running multiple experiments all the time. And this can turn into many many thousands of images being collected as a part of the process.

Now, we need to find a way to analyze these images efficiently. Usually the scientists have an idea for a given experiment of what areas of the image are of interest to them, to measure the impact of a potential therapy, right? It's not just looking at the image, it's ok. Maybe we're counting the number of cells in the image. Maybe we're measuring how big the cells are, maybe we're measuring the width of the cell lines and we're trying to see which of these experimental conditions had the greatest impact.

Now, you could do this manually. Of course, you could have scientists highly skilled at looking at every image they know what they're looking for in this batch. Of images and kind of identifying and measuring things manually that would not be efficient. And this is actually a very good use case for a common machine learning technique called image segmentation. Image segmentation is kind of what it sounds like. You're training a machine learning model to identify areas of an image for you. Usually it starts with some training, you train the image, you train the machine learning model to identify the areas you care about in the image and then you send it off and it it applies what it's learned onto the remaining images.

So let's walk through an example of what this might look like. Here. I'm showing again an image of cells. As I mentioned, one image can yield actually more than one image. In this case. Let's imagine that image we're splitting into the red blue and green color channels. And then imagine that the green color channel is the one that shows the thing that we're interested in. And we've trained a model to identify perhaps count or measure areas of the image, maybe that area of the cell. So again, just an example of how one image can become multiple. And then you train a model to identify the things you care about. In this set of images, it could be completely different from another set of images.

So Vertex in fact, built a system to do exactly this and the the workflow from a scientific perspective goes kind of like this. The scientist will have already designed the experiment. She has a hypothesis. She knows the id maybe as id 123. For this experiment, she would come into a web application. We designed to pull up a subset of those images. Imagine it created 1000 images. She will then manually teach a base model that we have about the things that she cares about based on this experiment. So she labels areas of the image and as she goes through the 1st 12345 images, the machine learning model is learning. Ah this is what you care about for this experiment. Once she feels that the model is sufficiently trained, she sends it off on the remaining images. So maybe she trains on the 1st 10 or 20 then sends it off and it runs as a batch on the remaining 989 190 images in this 1000 image experiment. And then when it's done, the system is actually has some additional information to know what is it that you cared about across this image set. And i'll create aggregated results for you at the end. Maybe you wanted to know on average how many cells i found in an image on average how big they were? Something like that. That's the results that she's really interested in to see. Hey, were my hypotheses accurate in this experiment.

So this is what the system kind of looked like initially that web application that i mentioned that ran on a web server in the data center. After she had trained the model via that wei, she would then submit the information to a batch system onto a cue that would go and fetch all of the image from an image store. It would process the images by applying the trained machine learning model to every image and then it would calculate the aggregated results, it wrote intermediary results and the final results to a relational database. And it used a cache for some performance improvement across that process.

We had challenges with this system as our scientific process grew and as the image quality improved. And as in general, this type of science accelerated, the first is as you might have guessed you saw on the image, this is a static fleet of servers and this is a very inherently spiky workload. So i i mentioned to you that it begins, first of all, it aligns the scientific process. There are some times during research where they're collecting lots of images and there are times where they're analyzing those images. I'm not collecting as many so times of busyness and times of of quiet. And then it's we don't know when it's going to happen, right? You know, the scientists, this begins with a ui and so that again, there's kind of a human level of spikiness of the traffic when the scientists are logged into the ui kicking off jobs, we see flurries of traffic when they're not, we don't. And so this was a static server fleet. As with static server fleets, you hear abs talk about this a lot with elasticity, you can provision to peak, but then it's expensive when idle, you can provision to average. But then you're constrained by resources, you're kind of, it's hard to find that sweet spot to get maximum performance to optimize for cost.

Well, one way you can try to optimize performance of a static fleet is you put multiple workloads on that fleet and we did. So the workload i'm describing is just one of several that use this architecture. And so of course, you have noisy neighbor problems, right? You know, sometimes the badge segmentation runs fast because the fleet is quiet. Other times it runs slow for the same job because other teams are using that infrastructure. That's a challenge. And then this is infrastructure we have to maintain os s, we have to patch programming languages, we have to patch on that other frameworks we have installed. That's just a burden on the engineering team and our infrastructure teams to keep up with the system had also been around for a while. We had a wish list of things. We always wanted to get around to, to make it we thought perform better.

We thought we could perhaps introduce some batching of the images to make the throughput better. We were trying to figure out ways to improve the concurrency of number of jobs that ran. We also had some i think, ok, error handling and robustness on this, on this legacy system. But there are some failure scenarios it had where it would just kind of stop a job when we wished it actually had some more robust trime.

So we redesigned it. Let me start with kind of the things on the edges. So we at Vertex have a serverless first approach to aws. We actually have a pretty small engineering team and we want those engineers to be spending as much time on Vertex differentiating work as they can. Now what we think that servilius actually helps us achieve that as we get high availability elasticity cost optimization really out of the box, we can accelerate our development cycles and lean into aws to make things easier for us. We're also pragmatic serverless is a spectrum. Some technologies from aws are more serverless than others. So we tend to start with kind of maybe our most serverless ideas and we fall back onto something that we think is the right fit for each workload.

In this example, we took that web application that i mentioned that exposed the we of the scientists, we containerized it, put it on to fargate, we took all of the images and the system that generated them and adjusted that to write the images to amazon s3. This is work we actually had planned independent of this specific workload due to the durability and cost economics of s3. But it really benefited this workload. And then we took the cash and moved it on to amazon elastic cash a little bit more managed.

Now that leaves the hard part, you know that big empty spot up in the diagram. Right now, we had a couple of ideas of serverless approaches to do what almost feels a little bit like a mapreduce problem, right? We have a set of images, we want to analyze them all in parallel but then do some aggregation. At the end, we were bouncing around some ideas. And i i was actually sitting in the audience at re invent last year when distributed map was launched by step functions

"And I actually remember sitting in a talk that Adam gave and hearing about Distributed Map and messaging my team and saying pause what we're doing. Let's look in Distributed Map. I think it's going to actually do just what we needed to do and it did.

So what we did with our system is we took the workflow that we knew the system was doing, taking an experiment ID, grabbing all the images for that experiment, applying a model to all the images and then effectively creating summarized results at the end and represented that as a Step Functions workflow using Distributed Map. And really it looks almost as simple as what you see here in this diagram.

We have one Lambda function at the top of the workflow that takes an experiment ID and looks up all the images for that experiment. It then passes off to Distributed Map and Distributed Map farms out all the images across Lambda functions, applying the machine learning model to every image. And then there's that aggregation step at the end that takes all of the results creates the summarized report the scientists want and actually writes that back to the same relational database that was used by consideration of the system, which helps all the downstream systems.

They don't know, we, the downstream systems are kind of unaware that this system has been updated. They just keep interacting with the same relational database. They always have. This system has been in, this is not theoretical, this has actually been in production for over six months.

So we have some data on the impact that we're seeing and it's actually been even a little bit better than we had estimated ahead of the work. So we can scale much higher now than we could before. We're not constrained by that static server fleet. We've adjusted all of our service limits such that really any reasonable amount of jobs the scientists could possibly submit to us Lambda will scale out and out and out and out.

They can all run in, they're not competing for resources, so they run a lot faster than they did and they're not, there's no noisy neighbor problem anymore. We're seeing on average for the jobs run since we went live, we're about 11 times faster overall. On kind of job start to job finish for these jobs on this architecture compared to the previous, it's also 90% more cost-effective.

So if you look at our monthly costs on a given month, as I mentioned, this whole thing comes alive in response to traffic and then when the job is done, this all goes away like we don't pay for it when the Step Function workflow is not running when the Lambdas are not executing. And that leads to incredible cost savings overall across a given month, a given year of the system running.

And on top of that, because we, like I mentioned, tried to use serverless services everywhere we could. These services are highly available by default. Our estimated up time by, you know, multiplying through the availability of all of our dependencies is 99.9% which is better than we had with our previous system.

Again, it's fully managed no operating systems to patch a lot lower operational burden with this architecture than we had before. And we didn't actually expect this, but we have better visibility into the system than we did. And this is in part, you know, we had ok, metrics on our old system, we had been always meaning to improve them.

But since AWS services by default, emit a lot of metrics about themselves to CloudWatch. We found that actually we got a lot of additional metrics and additional data emitted from our system that we kind of had always wished we had added. And then we got it really out of the box through these services.

Now, for folks who aren't familiar, this was touched on a little bit in Adam's portion of the talk. When you're authoring a Step Function, a workflow, there are a few ways to do it. Adam talked about using some UI based tools in the console. That's a wonderful way to start. That's often how we start building our workflows.

But we also want our infrastructure represented as code, we want everything in the source code repository deployed via pipelines. Now, the way that you represent a Step Functions workflow in code is using what's called the Amazon States Language. The the ASL, there's a snippet of ASL on the right, you can author in YAML or JSON, we prefer YAML and what you're looking at in that snippet is actually a snippet of the ASL for a Distributed Map authored in YAML for Step Functions.

Now, you might remember a few slides ago, I mentioned we had this wish list of things we wanted to improve in our system. And initially my developers were a little bit um you know, grumpy about having to learn a new domain specific language. You know, they have the languages that they know and love they, we know cloud information, we know other infrastructures, code languages. Why am I having to learn another YAML based language for authoring Step Functions workflows?

But I encourage them to push through that. And I want to point out a couple of reasons why we got a lot out of the box with Step Functions that we would have had to have built otherwise. So if you look at the sections of the YAML here that are highlighted in purple boxes.

The first is there's a max concurrency parameter configuration value for a Distributed Map. That's how many Lambdas that'll well, technically how many child workflows, but in our case, how many Lambdas will, will process in parallel if we wanted to as we were adjusting and trying to tweak our throughput.

If we wanted to test, you know, different concurrency values, we just configured that checked in, you know, a three character code change, sometimes a one character code change, ran it and then tested performance.

Same thing with the second box in the middle there that I highlighted in purple around the batch size. We could just tweak batch size and Step Functions would do it for us. So this really helped us very quickly find the sweet spot of concurrency and batch size to optimize throughput of our system. It's part of the reason we got the performance gains you saw earlier.

And then lastly I mentioned earlier that we wanted to improve our error handling there's some pretty robust error handling out of the box with Step Functions. What I have highlighted here is how it handles when a given Lambda. For example, in my diagram fails processing an image, you can tell it what exceptions, try to catch how many times to retry with what back off, et cetera.

And that's been a really big benefit to us as we try to improve the robustness of our system. Of course, we do additional error handling in our code, but it's a really nice backstop now. It's not all rainbows and butterflies. I'm here to tell you, especially when you're adopting a new service from AWS. Sometimes you do hit rough edges and we did so as we really started flexing the service and playing with it, we found some edge cases where the features either didn't work quite as we understood them in the docs or didn't work quite as we expected.

Of course, we work closely with Adam and the engineering team and product team for Step Functions to either clarify our understanding or in some cases influence changes that they were going to make to the product. We like to use AWS SAM the Serverless Application Model. Um it's a, it's a, it's both a YAML based language and a command line tool for building serverless applications.

Given that Distributed Map was brand new when we started using it, we found that AWS SAM didn't quite know how to validate our templates when they included, you know, Distributed Map configuration, what we knew was valid. But SAM thought it wasn't because it wasn't up to date yet.

We also used the AWS VS Code plug in that initially wasn't aware of how to parse Distributed Map because it was so new. So there was a bit of a lag between when the Step Functions feature launched and when all the tooling caught up, which was friction in the beginning. Eventually it caught up. We got almost all of these issues addressed and we have others that we know are coming down the pike. So it really was still worth it. But I'd be remiss if I sat up here and said it was all perfect some hand back to Adam who's gonna finish up.

Awesome. Thank you so much, Roberta.

Hi. Oh, like 11 times faster and 90% cheaper. Like I love that. It's so good. It's so good. Um so I think if, if you take away just one thing from this presentation, it's that if you have the code to process a single object and you have lots and lots of objects to process, think about Distributed Map because that just makes this super, super easy.

We're seeing so many interesting use cases and I'd love after this to hear any use cases that you have as well. Obviously in the life sciences, space drug discovery, we see, you know, things like what Roberto and his team are doing, we also see this in financial modeling. So if you picture taking some sort of financial asset and modeling, what will happen to it if various geopolitical changes happen. If various market changes happen, I have customers now that are doing simulations that are much, much broader than simulations they could do in the past in a in like a real time way where there's a financial analyst that is, you know, sitting there waiting for those results to get back.

Uh we also have customers doing like large unstructured file processing. So if you picture like doing like a virus scanning or security scanning type of use case where you know, as new files are created, you you're scanning those files in real time but maybe you update your algorithms and you want to go back and do a historical run to see. Hey, what does it look like if i look at the you know, past 30 days of files, very easy to do with distributed map often use that same lambda code that you're using for kind of the real time like one at a time processing to now look back at an overall batch and cover that as well.

So really cool use cases there and then some really interesting ones around data transformation and migration. So i talked with one customer that had all their users in a us based system, but they were expanding more and more into the european market. And they wanted to move their european users from that us based system into a parallel system that was running in the eu. But that system involved something like five or six DynamoDB tables. It involved an identity store and all of these things had to be created on the other side and then, then they needed to migrate the users.

And so what they did is they wrote a Lambda function, they would migrate a single user and then over time in a slow and controlled manner, they fed in CSV files of the users that they wanted to migrate from the us based system into the european based system. It gave them a very, very easy way to orchestrate this and scale this out in a controlled manner without having to write complicated code to do so.

So once again, just think about the benefits you're only paying when you're processing really fast, start up times. If you think about kind of more traditional uh data processing frameworks, data processing services, there's often time to spin up clusters to spin up nodes and you're either paying to keep those running so that you don't have to deal with those start up times or you're waiting for them to start up with Distributed Map. You're, you're starting in milliseconds very, very fast.

Uh and then easy for all developers to use you. If you're using Step Functions and then Lambda to process that data, you can author Lambda in basically any programming language that you want. This, it's very accessible for all developers. You don't have to be a data engineer or uh a data scientist. Although if you are, you know, you're welcome to use it too.

Uh but very, very uh accessible to everybody. Uh if you're interested in learning more, we have uh lots of resources including some free uh training guides that are up on our training website. So you can earn these badges that you can then throw up on uh like LinkedIn to show your knowledge.

We have one around uh serverless technology in general and then we have another around events and workflows. Uh some colleagues of mine put a lot of effort into these. I think they're really high quality and they're free, which is uh is always nice.

Uh then there's um uh another thing if you're writing Lambda functions, highly, highly recommend that you take a look at Power Tools for AWS Lambda. This is a set of libraries and utilities that are available across Python, TypeScript, Java and .NET. And we're hoping to add more runtime soon. They have a core set of functionality around uh observable with tracing and logging and metrics. But then they have a number of other features as well. Whether you want to help them, have them, help you build uh your own kind of middleware implement feature flags, uh item potency, there's all sorts of uh cool features within here. So I highly recommend you take a look at those.

Uh visit us at the expo. It's uh getting towards the end of reInvent. So, uh you know, I think the expo closes at four today. So don't have a ton of time to visit us at the expo. Uh but lots of really cool stuff to see there including uh Serverless Espresso and Serverless Video as well. Both of those are powered heavily by Step Functions. Uh makes it really easy to write these apps but also to operate these apps.

And with that, uh we do have a little bit of extra time. So if folks have some questions, uh I think Roberta will come back up and uh we can, we can take some questions.

I if you are gonna take off and just please uh do us a favor, complete the session survey, I promise we read all the feedback probably more than once. Um and then feel free to get in contact with us if uh if you wanna tell us more things."

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Scale interactive data analysis with Step Functions Distributed Map

All right. Hey, everybody. Thanks for uh making your way down to the Mandalay Bay on a Thursday afternoon. My name is Adam Wagner. I'm a Principal Serverless Solutions Architect at AWS. Uh really excited to talk with you today. I'm joined by Roberto. I'm R
复制链接

扫一扫