AWS storage for serverless application development

李白的朋友王维

已于 2023-12-22 17:44:21 修改

阅读量68

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

于 2023-12-20 23:41:21 首次发布

本文链接：https://blog.csdn.net/just2gooo/article/details/135120126

版权

Hello out there. How's everyone doing? First of all, is this 4pm on the first day? Is everyone feeling like they're learning about the AWS stuff yet? Alright. Well, today, just to let everyone know we are in STG 311-R because this is actually the repeat before the, the actual real one. And we're gonna talk about AWS storage for serverless application development. And really what we want to focus on is that um it is really ok to provision infrastructure, but we also want to make sure that folks who want to take advantage of serverless technologies actually have proper access to data.

So, so that's what we're going to focus on today. We're going to focus on serverless compute and then storage. I'm Brian Lyles. I am a Senior Principal Engineer in um Amazon Simple Storage Service S3. And then we have Sebastian Bea and he is a Manager. He's a, he's a Manager of Specialist Solution Architecture in AWS. And then we actually, we brought a customer with us today and we have Jefferson Frazer and what Jefferson is is he's a Senior Director for Edge Storage C ID DevOps and Database at Shutterstock.

So let's get started. So our agenda today um is trying to take us on a journey. We're gonna start it off with serverless compute and then we're gonna go into serverless storage and then we're gonna actually talk about, well, what should I use when and maybe even why? And then we'll go into some workflows and then from there, we'll go right into our customer use case.

So let's get started what are serverless architectures. But really these are the architectures that we create. Whenever um instead of provisioning infrastructure to run our services and applications, we are using our cloud provider. And in this case AWS to provide those those services for us. But really why servius and we've distilled it down into four reasons.

You have an idea and you wanna move, move, move, move, move on this idea. You want to prove out this business case. So instead of thinking about, well, i need to provision this, i need to scale this this way, i need to scale that, that way, what we wanna do is allow you to go from idea right up to the market, we provide services for that.

And then also um you start off and you have this much scale and then as you move on, you actually, you get more customers, more excitement and you go to this scale, but then maybe you're a periodic business and you realize, well, i don't actually need all this scale all the time. Maybe i need at the end of the month, maybe i need it at the beginning of the day. Well, we can take care of that for you. We can automatically scale up and down to where you need it.

And a very important thing for, you know, people who run businesses. Um i only want to pay for what i use. I don't want to pay for what i provisioned. I only want to pay for what i actually am consuming.

And then finally, for all the builders in the room and watching this later on, i don't want to focus on, you know, i want to walk, i want to run, i want to shop. I don't want to think about what my body needs to do to do these things. I just want to focus on the building of my application.

So our catalog of services which support service compute is actually fairly large and i just want to go through these really quickly just to make sure that you're gonna see these icons later. And i want to make sure everyone has at least a little bit of context on there. What i've done is i've colored them, you know, this off yellow color for compute a pink color for this integration bits and green for storage. And i won't go into the compute right now because they get their own slides. But we have Amazon EventBridge, you know, how can we use events to tie things together.

Um we have AWS Step Functions. Let's build these workflows that we can. So now we can provide orchestration. We have Amazon Simple Service a k a Amazon SQS, you know, cues on demand, a com a cue that will grow and shrink with you. And then we also have Amazon Simple Notification Service. Amazon SNS and what is that? Well, i want to broadcast messages.

And then the last two are we have Amazon API Gateway and the Amazon API Gateway handles all sorts of concerns whether it be managing or security if you're providing a PS to, to anyone internal or the internet. And then finally, we have AWS AppSync if you're interested in GraphQL, AWS, AppSync is actually a great option for you.

And then we have the two storage products that we're going to talk about today. We have Amazon Simple Storage Service a k a, Amazon S3 and then we have an Amazon Elastic File System, Amazon EFS.

So let's get into CUS compute. So when i think about compute, um i think about the two foundations and one of the questions that i hear a lot is, well, what should i use when? Well, we have Lambda and we have Fargate and you know, something, there's lots of overlap and that's actually great, but there's, i there's places where both shine.

So let's actually just go through a couple of, of bullet points about each. So first with Amazon Lambda. Well, really what it is is it allows you to run code without provisioning infrastructure. You write a Javascript application, you write a Python application. Matter of fact, you can write a Go application, Java application, um Rust application and you don't have to worry about provisioning.

And best part is you also don't have to worry about provisioning the scale, you build it, you put it in the zip or the OC I container and we will take care of all that for you. But Fargate is actually a little bit different. Um there is no zip files in, in, in Fargate, everything is in OC container and i know someone saying in the back of the room, hey, Brian. Um what about the zips in OC I? And i'm saying no, let's just stick with OC I containers.

Um but the most important part about Fargate is it gives you a lot more flexibility in certain cases than Lambda does because now you can choose how much memory you want to use and how much compute that you want to use. And the interesting bits is that it also works with um CS Amazon ECS and Amazon EKS.

And let's go back into that just a little bit. So another choice because the world is always full of choices. And when you work in the cloud, it is always about choices. What choices do we make to actually give us the best outcome? So we have uh ECS and we have EKS which should you choose. Let's actually talk about this for just a second.

EKS, more about flexibility and then access to that very broad Kubernetes ecosystem and community. And if you think about ECS, um it's more about simple, we built CS to run our services inside of AWS. Um it scales very, very large and it actually is very functional, but really what we built it to do is not just um do the scaling but also to work well with our other AWS services.

So when we think about service workloads, and before we get into the actual storage of our service workloads, let's think about this piece. Um as we have moved on and as computing has matured, we're actually seeing different types of workloads. It's not just i want to serve my website anymore. What we're really seeing now is people who are doing, i have data, whether it's structured or not. And i want to process it through a pipeline either to give it to someone else or to solve a problem. And we need to figure out well, how can we accommodate people who are trying to do that?

And then at the same time, we need to realize that um storage is, is is often a significant component of the cost of moving to the cloud. And we want to make sure that um we are working with ways to, to allow that to never be bigger than it needs to be.

So as an introduction to surplus storage, um i'm gonna pass it over to Sebastian and he's gonna go more into that.

Thank you, Brian. So Brian's mentioned, workloads are changing drastically in, in terms of service, work workloads that are running in CUS environments. We're seeing stuff like EMR AI and ML training and so on. So we need storage that can actually support those types of workload.

So let's start by defining what cervus storage is. Well, there's really three criterias that make circulus storage, what it is today. First, we have no infrastructure to manage. That's very important. You don't have to provision an amount, a fixed amount of space that you have to change later down the road. When your requirements change, you don't have to necessarily provision any capacity aspect to the store or performance aspect to the storage that you're provisioning. The performance will scale with your workload similarly to Lambda or SNS or SQS, for example.

And then you pay only for what you use. And I know what you're thinking. You're thinking Sebastian, i pay only for what i use for everything i use within AWS. Yes, but there's a difference. So for example, if you're using EBS today, well, you're going to provision a volume for EBS and that volume might be, it might be, you know, a terabyte in size because you know you're going to need about a terabyte to host all of your data. Well, you're going to pay for a terabyte, whether you store five megabytes, five gigabytes or 900 gigabytes of data with servo storage, you pay only for the data that you store within the storage service and not any provision capacity per se.

And then the last property of server storage is that it is fully elastic. It, it provides you with almost virtually unlimited capacity and very high performance that will scale as you need that performance.

So there's two services that fall into that category. Brian's mentioned in AWS S3 or Amazon S3 and Amazon EFS.

So if we look at Amazon S3, what defines Amazon S3 in terms of storage for CUS well, first off and most importantly, more than anything else, it provides an object semantic. So if you're accessing data through rest or through HTPH GPS, that's how you want to be accessing data, then you want to be using S3 for that it provides legitimately the highest level of parallelism of any other storage services out there. When you think of S3 and the ability to distribute data over something like like CloudFront and have access to object by millions of different users. At the same time, it really does provide the highest level of parallelism. It also does provide the highest level of throughput.

So if you're looking for a service that can provide, you know, terabytes per second of throughput that will enable, you know, hundreds of thousands of millions of connections

At the same time, you want to be looking at a service like S3. It does have an EventBridge notification or EventBridge integration, which allows customers to actually have data when it gets created or when it gets modified within an S3 bucket to actually take that data and send a notification, launch a Lambda function so that your workflow can actually start with data creation or data modification within the storage.

The storage can be basically an orchestrator of the function that you're going to be running on the data that you've just received. It's also very cost efficient. We've got seven different storage classes as of today. Eight, right? I do believe so. So seven different storage classes to help you manage where your data lives in the most cost efficient way over time.

So your data becomes older, it becomes less access, you can trickle it down through intelligent tearing over the different storage classes that you have so that it is always positioned in the most economical way and it's access accessible from anywhere.

How does that differentiate from any from other services? Well, S3 is built to be accessed over HTTP and HTTPS, meaning that you can access an S3 bucket directly from any workstation, any phone across the world, you don't need to be close to your data or to your S3 bucket. You can just access it and it also offers an object replication target SLA of 99.9% in 15 minutes, but it really replicates objects. The target for the replication is within seconds.

So if you create new objects within a bucket, you want to replicate that bucket, whether it's for data protection or whether it's for resiliency or whether it's for just making those objects available closer to where the the consumers are or the customers are consuming them. It will be done within with a target replication, target target replication time of about a second.

And lastly, you have object versioning object versioning enables you to, to keep a copy of every single object that gets written or overwritten within your bucket. Now, a S3 offers you also a durability of 99.9 actually 11 nines, I'm not going to try it 11 nines of durability for your data as well as four nines of availability.

Now, let's compare this a little bit to Amazon EFS. Amazon EFS offers us through NFS v4, right. So this is important. That's the biggest difference between EFS and S3 is that you're talking about object for S3 and you're talking about file semantics for Amazon EFS, it does offer sub millisecond latency.

So if you're accessing data within the same region where your file system is through the same availability zones, you're going to get sub millisecond latencies from EFS. It is designed for more IOPS intensive workloads. So if you've got a workload that's working or that deals with smaller files that deals with data that is read in very small increment. It can be more efficient to run this on EFS versus running something like this on S3.

It does have the benefit that NFS in and of itself implements client site caching, client site caching, meaning that the client that's actually accessing the NFS mount point will retain some of the metadata and some of the data that is being read from the file system. So upon rereading that data, you might get an even better performance and then it implements a POSIX file permission scheme.

So if you have a need to replicate a POSIX file system permissions that you have on premises, or if you're using POSIX users within your application workflow to determine who can access what EFS can absolutely do that for you from a storage class standpoint. We've got standard storage IA and or infrequently access and archival storage class that are available for you to use. And as well as the intelligent tier storage class which allows you to trickle down similarly to S3 your data based on the last time that that data has actually been accessed, we can access an EFS file system within its own VPC to get the lowest possible latencies.

We were talking about some millisecond latency access earlier and then we've got a replication target for copying a file system between or copying data between two EFS file system between regions of minutes. Lastly, it does provide the same level of durability and availability that S3 does.

So when we think of the well-architected framework, those service services that are designed by AWS are designed with the well-architected framework in mind. When we talk about operational excellence, we mentioned the services do not need to be reconfigured based on the workload that you're running against them. From a security standpoint, we have integration with IAM, POSIX permissions and so on and so forth from a reliability standpoint. We've got the durability of 11 lines of durability and four lines of availability, performance and efficiency. Again, not needing to reconfigure your file system or reconfigure your bucket to get more performance out of it. Just the fact that you're going to drive more workload will provide you with the performance that you need for that. And cost optimization. We've, we've talked about the different storage classes that are available to you using intelligent tearing to actually trickle down the data through those storage classes, provide you the best possible cost optimization.

So let's talk a little bit about when should I use? What if I haven't made that clear when we started the biggest differentiator between Amazon S3 and Amazon EFS is the fact that S3 gets access by through object semantics. Whereas Amazon EFS gets access through file semantics, that's the biggest differentiator if you've got an application or if you've got a component of an application where you're maybe you're using a third party product and that third-party product does not have support for Amazon S3 or object access to data. Then you have to use a file semantic right.

S3 offers the highest level of scalability of any storage service that we have. It offers. It is best for event driven pattern. We've talked about the eventbridge integration with the service. Well, having that eventbridge integration allows us to trigger workflows from that data that's being changed or that's being created on the S3 bucket. And then last, if you're using any tools like Kinesis, like Glue or Athena, then you absolutely want to use S3. S3 was built. All those tools were built around the capabilities within S3 to let you basically work with your data directly from an S3 bucket without having to put them in a database.

And so on, on the EFS side, if you're working with random data access within a large file, so S3 can do, can allow you to do a byte range read if you want. So if you've got a very large file, you can read just a section of that file. But when it comes to rights, you absolutely have to write an entire object back to the S3 bucket. Which means that if you're working with very large file, it might mean a lot of rights that have to be done in order to update a very small amount of data within your object. Whereas if you're working with EFS, you can just seek within the file, write the amount the data that you need to write and close your file and be done with the update within the very large file that all that considered it is best for any environment where you're going to have a high rate of change on existing file. It's going to provide you with the benefit of the lower latency and the ability to update files in the middle of the, of the the file itself and not have to rewrite entire files.

We've talked about the benefits from having NFS client side cashing that will speed up your workload if you're rereading over the same data over and over again without changing it. And then working with, with small files might be very significant for some customers. As you probably know if you've worked with S3, S3 means every time you do a put or a get, you've got a very, very small charge that's applied to your account. But those very, very small charge, if you're doing millions, tens of millions, hundreds of millions of access to those objects can amount to some significant costs. In the case of EFS, there's no I/O charges related to that. All you're paying for is the throughput that you're using between the clients and the actual file system. And then for anyone that's worked with lambda in integration with EFS, that's a really cool feature. EFS is actually has a native integration within lambda allows you to create a lambda function just configure the EFS file system that you want to have. And every single time an invocation of that lambda function is done, that file system will be mounted, the data within that file system will be available to the lambda function.

Why do I bring this up? Well, Brian's talked about containers, he's talked about ECS and EKS. That is not the case with ECS and EKS. We don't have that native integration. You need to be using something called CSI driver. In order to create that same integration, that same integration is possible. If you use CSI driver by installing the driver, you'll enable the cluster administrator to actually define a storage class by defining a storage class. You basically create a storage that the developer can actually access. You eliminate the need to know anything about the storage for the developer. You can just look at the storage class and say this is what I need to access and get basically a persistent volume claim by creating a storage through that storage class and through the CSI container or the CSI driver without ever having to know anything about the underlying storage. I'll pass it back to Brian to talk to us about the common workflows that we see.

Alright. Thank you, Sebastian. So that was the prep you know, we start here, make sure everybody is on the same page knows all the words. Um, from here on out, we're just going to talk about brass tacks and I want to start with, um, some workflows. And the interesting thing is is that, um, as a conference speaker, when you're thinking about this, um, i could just make anything up and i actually didn't have to, um, as i talk to customers and i spend time with customers, i do run into some real world scenarios where i didn't have to make up anything. And people are telling me, you know, this is Brian, this is how we are using your services. So i wanted to come with one or two of these just to explain um when you think about serverless compute and you think about serverless storage, you know, here are some things that you could actually think about that might even be up your alley.

So in this case, what we have here is some kind of client releases an asset and that asset is video and what it does is it puts it in an asset bucket and, and then some hand waving happens. And what happens is we prepare that asset for streaming and we generate a thumbnail and then we put it into a distribution bucket. And you're saying, ok, Brian, hold on. Did you just tell us to create two buckets? I'm like, yes, yes, i actually did you to create two buckets and, and why would i do this.

Um, a couple of reasons. One is, well, i have, if i'm doing anything of any type of complexity, i want to separate my, you know, my inputs from my outputs. In this case, we can separate the raw, we can separate, separate the raw assets away from what we're going to serve to our customers. Another thing that we can do here is that because the access patterns are different for the asset bucket versus the distribution bucket. You can use intelligent tearing to actually save you some money with the asset bucket. Because really what we're going to do is we're just going to put, we're just gonna dump assets in there. They will not change once we put them in our asset bucket. So really, we, we tear this down to the point where we have access that is um not very often.

So now as we think about this though, um the way that our video processing system works is that we need, um we need a directory with files in it, with these source files in it. And we're gonna run two lambdas or two types of lambdas. We can actually run any number of lambdas. The first type of lambda that we're gonna run is gonna generate our thumbnail from our video. And the second one is going to resize it or maybe it'll chunk it up or maybe we'll run multiple resizes. I don't know, but these are two things that we might do in this case.

So how actually what should we think about as we're moving through this process? Well, because we're working in a service world and we're not prep provisioning infrastructure. Well, we want, we want the whole process to decompose into events. You basically want to say, well, if this happen, i do this, if this happens, i do this. If this happens, i do this, this is actually a great place because now you have seams throughout your application where you can test new functionality, you can insert new functionality, but also you have, you have places throughout your application where you can, you can detect and handle failure. And then also, like i was saying before, we want to take advantage of the tier system with an S3 to make sure that we are only spending the dollars that we need to. And then also um we're not perfect. So what one thing we wanna do is we wanna make sure that we never put ourselves in a situation where we cannot make changes.

So there's two steps to this.

First, well, actually, the, the, the big part is that we need to somehow copy this asset from S3 because it's an object storage and we're gonna process it. We need the file system, so we're gonna use EFS and we wanna be able to, we wanna be able to run that.

So how does that look?

Well, um with no, with no provisioning of any infrastructure. What you can do is you can place an object in your asset bucket. And then what you can do is you can configure S3 to um generate an S3 notification. And what will happen in step two here is that Amazon EventBridge can listen to this and then what it will do is it will just launch, it will say, hey, I need to copy source to destination to EFS.

And at the end of your EFS, um end of your copy to EFS, you can launch another notification to say, hey, I'm done and notice what we've done here is we've built this system that is composed and it's small in itself, we can test it. And if we need to change it in a little bit, nothing is coupled to the other step because it's all event driven.

Um the next thing we would want to do though here is now that we have our asset available. Um we can use Amazon EventBridge again to process that and we can basically copy, we can copy this file um from our EFS bucket, our EFS directory to our distribution bucket.

And, and then finally, um, something that we never think about is, um, even though that um, we have unbounded compute and unbounded service, um, servers, um, unbounded compute unbounded storage, we always want to clean up after ourselves.

So one another process that we could throw in here because we realized we were using way too much, uh, EFS is now what we can do is, well, we only got to generate once. Just go ahead and we'll, we'll actually have, um, have a cron like system and triggered by EventBridge. We'll actually clean up our file system to keep it small to make sure we're not getting charged for anything that we're going to use.

So that was the first one.

Mix workflow is something actually a little bit more common. Um but actually, there's ways that we can solve this um in a serverless fashion. Um actually just using S3 and S3's um um access to Lambda.

So let's say I have an S3 bucket and um and it feeds data into an ecommerce application. And what does it do? Well, that's not important. It just fulfills a business objective.

Um so what we would really like to do is figure out how to change that data in real time. And what we wanna do is have it work with our generic e-commerce application. But we would also like to have redacted data and maybe enriched data. And why would we want redacted data because we're using analytics and maybe why would we want enriched data? Well, because our marketing team likes data.

So how can we take that one piece of data and make it available in these three use cases? And I'm only choosing three because that was the size of the slide, how can we make that one piece available data available in in use cases without having in copies of that data?

So why would we want to do this? Well, first thing is data is always changing and um we want to make sure that um we never lock ourselves into the point where um data is changed in a way that is no longer useful or is mutated in a way that it doesn't provide value that we need it. And then also um we want to keep control of cost. So we don't want to have any number of copies of a piece of data because you know what we have to pay for this.

So how would we approach this problem? Well, um inside of S3, we have a product called Object Lambda. And the way that it works is this, what we do is we can create an S3 Object Lambda endpoint. And what that can do is work with the Lambda function so that whenever you request the piece of data through a particular url, the data will can be changed based on the url that you called.

But at the end, at the end, it's just an endpoint of, of data that's sitting in an S3 bucket. So in this case that um for our ecommerce application, when you request the, you go to the bucket directly and you would get that object back.

But let's say that um I wanted to get this data and I'm actually getting it from my analytics application and I don't want any um personal identifying information in there. We can remove that. We can remove that in the call in the, in the get call. We can actually do get, we can do it in front of a get delete head or, or list call as well.

But in that get call for that piece of data, we can just change the data in real time. And then also um for a marketing application, well, maybe we have another database that has our customer loyalty information in there. And what we can do now is have that Lambda actually call the in the, in the while.

Um we're trying to do a get what we'll do is we'll just call that customer loyalty database as well and we'll actually make new data so we can enrich what we're sending out. And if we do this, now, what we've done is um we've used, we basically used um compute a k a Lambda in this case to enrich the data that we have inside of our S3 buckets without having to copy that more than one time.

And what's gonna happen here now is Sebastian is gonna take us to a couple more use cases and then we'll move into our customer be. Thank you. Bye.

So, so let's talk about a couple of real world use cases.

First one being The Brains so if you've never heard of The Brains AI, they're a super interesting company you've probably used or been involved with some of their products before. So what they do is that they create a user friendly interface for companies to go in and create those little videos with avatars and those avatars they've got over 100 different avatars and those avatars can speak 55 different languages.

So you can log into their website and just basically say, I want one person here, one person there, I want this dialogue to happen and they will basically create the video that to match what you're saying. And all of this is being done using AI the challenge the company was having is that they were very popular.

And the reason I said you've probably been exposed to that before is that, you know, HR tends to have a season where they actually create those videos and make you watch those videos. Usually, it's the beginning of the fiscal year, you get like 10 or 15 different trainings that you have to run through.

Well, this, this company was basically victim of the success that they were that they were having. So they were looking with the limited compute resources that they have, how could they maintain a specific SLA with their customers and be able to deliver within a certain amount of time if they had a lot of demand that was coming in a quote from uh one of the uh the, the executives of the company is our legacy infrastructure, hindered our ability to rapidly respond to fluctuating demand and changing market conditions.

So what they decided to do is to extend their on premises infrastructure out to AWS. They were using Citrix on prem and they decided, well, what's the best way we can actually do this and to do that, they leverage EKS within AWS to extend, create more pods so that when new workloads would come in, they would just be able to dispatch those workloads in a different location based on the available capacity.

That way they could maximize the utilization of the on premises asset that they already had. And then when burst was required to actually meet the customer requirements and be able to maintain the SLAs, they would burst into AWS using EKS to actually deliver on this. And the backing storage for that was Amazon EFS and Amazon S3.

So that they would only pay for the storage that they would use while they were running those jobs and not have to pay anything more than that.

The second example I want to talk a little bit about is Woot who is familiar with? Woot? Yeah. All right.

Woot was purchased by Amazon a few years ago. Uh but before the, before that happened, they, these guys were actually running into a problem with their data warehouse. They were running, they only had one person managing their data warehouse and they wanted to modernize the data warehouse. They wanted to be able to scale even more with uh just the limited resources that they had on hand.

So what they wanted to do is is to decouple the legacy monolithic database that they had into microservices. And they did that by actually leveraging S3 as the data source for their data lake without having any databases per se.

They used DMS or Database Amazon Database Migration system to look at all of the databases that they have and translate those into data that was sitting directly on S3. And then as part of that, they actually used Glue crawler to go through all of the data that had been exported by DMS and reconcile all the schemas that they needed to use.

But the most interesting part is that they flip their ETLs around instead of writing jobs that would read all of the logs that they had from their various system and input that inside of their databases. They used Kinesis Data Firehose to actually make the application, push their logs directly into the system that we didn't have to worry about.

Well, did my job run, did I capture all of the logs that were there? Making sure on a day to day basis that all of the new data had been ingested in the data warehouse and that the data warehouse was consistent.

They could just assume that or or put the burden of this on the application, on each one of the applications and have Kinesis Firehose be the entry point for all of the data that was going in.

So with that said, I will pass it over to Jefferson so he can talk to us about his personal use case at Shutterstock. Thank you, Sebastian. Thank you, Brian.

Hello everyone. My name is Jefferson Fraser. I am the Director of Infrastructure for Shutterstock. I lead a number of platform teams which help maintain our distributed storage footprint and our content life cycles.

If you're not familiar with Shutterstock, we have a large portfolio collection of over 730 million images outside of our image catalog. We also have 45 million videos and millions more of mp3s sound effects and 3D models. All of this sits on top of an 80 petabyte data lake on Amazon S3, spread across 2500 buckets and over more than 150 AWS accounts.

Our catalog is consistently growing and it grows by a tick of about 200,000 images every day and 75,000 videos every week.

Shutterstock is really a collection of a house of brands. We have several customer facing entities each which specialize in a specific type and style of content and static content creation. As you can see from the images on the screen here, several of these brands overlap and as we've grown over the last 20 years, we had developed a number of internal bespoke pipelines and we saw many of our brands were duplicating a lot of their workloads.

Many of these brands were building their own content ingestion optimization and delivery platforms. So as we started to move into 2023 we wanted to look at how we could reduce duplicity in our ingestion process.

How could we look at this 20 years of historical compute build and pick just the best aspects of each one of these pipelines and isolate those into lambda functions. How can we design a system that had multiple integration endpoints so that we could still execute our workloads with our happy path and battle tested code while we're driving innovation and building our new systems.

At the same time, we wanted to optimize human content review. All of our content is reviewed by an actual human being at the end of our process before it's marked sellable and delivered to customers. And we settled on our content unification architecture.

We found that there were several commonalities between all of our successful workloads. All of these things began with an event, a contributor uploaded an object that was gonna be marked sellable for our site. They wanted someone else to find when the object gets put into S3.

We fire off an event notification and that event notification goes to Amazon EventBridge using custom EventBridge buses and custom rules. We're able to route these events to either single or multiple step functions, which are a collection of these tiny morsels that we had lifted out of our code base.

We do a lot of the same workloads that Brian and Sebastian had mentioned earlier in the presentation. We do things like verifying uniqueness of objects that are uploaded, making sure that the associated metadata is accurate and reflects what's actually in the image or the video that you're trying to look at.

And we also call out to a wide variety of other AWS services outside of Lambda. And we wanted to have this system in place so that we could have a loosely coupled extensible task definition where engineering teams could modify a small important bit of our ingestion process without needing to be aware of our distributed application portfolios.

And this set us up for some really nice high concurrency serverless processing.

These workloads are highly unpredictable. We know that we have average growth, but we found over time that we had a lot of computes sitting around in idle. Some days, we would have millions of objects published some days, we would have a trickle of only a few 100,000. The Step Functions gave us a large degree of flexibility for how we wanted to schedule and execute our compute workloads.

What we see on the screen here is an AWS console view of the definition graph of one of our Step Functions which runs ingestion for multiple brand properties. These each one of these small blobs is a dedicated Lambda function which performs a specific important task and then passes the compute on to the next step in the Step Function. This complex mapping allows us to execute some workloads highly concurrent and parallel or things that need to be serialized. We can handle that logic inside of the Step Function definition including complex tasks like retries or error recording.

We're heavy levers of the shared storage between steps and Step Functions. This allows us to optimize our use of S3 to make sure that we're not pulling objects into memory unless we absolutely need to. We don't want any large surprises at the end of the billing cycle for millions of unnecessary S3 gets or unnecessarily modifying content at rest.

And as we started to evolve this posture, we found that a lot of our code didn't actually need to be a Lambda function. Step Functions is highly extensible and Step Functions support direct service API access to a wide variety of Amazon services. We found that simple tasks that we had written in a variety of code languages like an S3 get or modifying metadata could be accomplished directly as part of the Step Function definition. And this allowed us to have an extensible and highly concurrent operating model that was based around JSON and object notation instead of the dedicated language that our developers were writing for that specific problem they were trying to solve.

We're also heavy consumers of EventBridge. We've standardized on EventBridge as our orchestration layer for chaining together these complex tasks and multiple compute layers. We have an ingress and egress EventBridge in each of our accounts where a workload runs. This gives us a common interface and a common definition which we can use as a schema and a contract between our services. Further reinforcing that each one of our workloads can be modified by itself and developers do not need to be aware of our larger ecosystem. This also sets us up for isolated and least privileged workload execution.

Each one of these accounts is owned and maintained by one of our service teams. And we want to make sure that the roles that these applications are using are not overpowered. They're directly finely grained and scoped to the specific challenge that that team is trying to solve in that account. EventBridge also sets us up for easy integration into our existing workloads. Having been in business for more than 20 years, we were already very familiar with loosely coupled workloads and were heavy consumers of SQS.

Using EventBridge, custom bus and custom rules, we're able to funnel events to where they need to go and even duplicate them so that they can hit multiple compute layers. We can therefore subscribe our legacy compute systems to our new ingestion process to make sure that as we're developing these new tools for our customers, we're still hitting our happy path, battle tested code, making sure that our production stability is our highest, highest goal.

So what are we doing with Amazon S3 and our storage conventions that sets us up to be able to execute these workloads?

The first is strong prefix entropy. In Amazon S3, we like to maximize the IOPS in each one of our S3 buckets so that we can retrieve the most data possible in the fastest timeline, all while maintaining a high degree of up time and no production impact for our working systems. You have to plan how you're going to retrieve your data from S3 when you put it in. This is not just an object store where you drop it and forget it. We like to be intentional about our prefix naming schemes so that when we go to retrieve an object with something like CloudFront to show an image to a customer, we don't need to waste compute on superfluous tasks like routing to an extra S3 prefix.

We're also strong believers in transparent bucket naming conventions. As we continue to grow and scale tracking all of these distributed Amazon S3 buckets becomes a very challenging task. We're very strong believers in putting the AWS account ID number and the region into the name of the bucket. This makes complex tracing tasks like figuring out what region or what account a bucket is managed in very simple and lowers the time to resolution when engineers are trying to actively troubleshoot.

We're also big fans of disabling object acls. There are very few use cases remaining where we need to have object acls enabled, and most of these are centered around log delivery and aggregation from other services. As a general rule, we try and make sure that acls are disabled by default on the account level for all of our AWS accounts.

And lastly, all of our content is encrypted at rest. We like to make sure that when we're posting, we're making sure that this data is secure, even when it's at rest and not being accessed.

The largest change that we've made that has accelerated our Shutterstock journey has been using attribute based access controls in Amazon S3. S3 supports resource based IAM policies and we heavily leverage powerful conditional statements in these policies to give dynamic access to our application fleet.

The example on the screen shows a conditional with two conditions, each of the conditions must be matched for access to be granted. The first is our principal org ID. As we are strong consumers of Amazon's Control Tower, all of our accounts are underneath a single principal org. So we can say with certainty that all requesting IAM entities have the associated principal org ID in their request payload.

We're also consumers of principal tags. All of our requesting IAM roles are paired with keys and values for tags that give them varying levels of access to our distributed S3 footprint. This allows developers to spin up and down new roles without needing to modify our S3 bucket policies. At scale these service control policies that we have in place, protect these tags. We don't want anyone to be able to reach out and add this tag or modify it. So our service control policies restrict which entities can create, modify or delete a list of known tags.

So now we know with certainty that as we scale only our pipelines are able to give access to these buckets and we have a very robust code review process in place to make sure that no access is being granted, that isn't absolutely necessary all while optimizing for developer time without needing to make superfluous modifications to our bucket policies.

I have another example of a pipeline that we use at scale. This is our event driven content delivery. One of the core tasks for our storage team is delivery of large data sets for machine learning and programmatic compute consumption. Just like with our content ingestion, all of this begins with an event. In this case, the developer is preparing a manifest, a long list of files, a long list of criteria for a specific data delivery for a specific task. And they're uploading this through our JavaScript front ends. This manifest file is dropped into our S3 bucket. And from there, an event notification is sent off to EventBridge.

From EventBridge, we can funnel again into Step Functions. At this point, we'll perform preparation for our operations to make sure that all of our assets are ready to be delivered to our customers. In question frequently, we'll find that we need to modify something about the objects at rest. Maybe it's the storage class or the regionality of a bucket. We want to copy something into a region where it will be more cost optimized for delivery. So we'll fire off an AWS S3 batch operation to perform actions like an in place copy to modify the storage tier and the storage class.

As this completes the same bucket is used and the completion manifest and the success and failure manifests from S3 batch operations are published back into the bucket and the whole process begins again after the first object is completed, the next event fires through and we're ready for our final delivery of our content.

As the event comes into Step Functions, and we see that our catalog is ready for delivery, we'll deploy an AWS Service Application Repository application definition. And this is going to include a dedicated deployment of everything needed to make this batch operation successful. And this includes things like a dedicated Lambda function and a dedicated role with the service control policy managed principal tags.

This way, we have several levers to control concurrency. We can control concurrency on a batch operation through the batch operation, prioritization through tuning concurrency on the Lambda function or tuning concurrency on the account level. With each of these in place, we have many levers that we can use to control many concurrent batch deliveries at scale.

When we're performing these operations, we've seen a huge return on our investment as with most footprints in AWS. This was not our first architecture. We've evolved over time and grown into this delivery platform. We spent a lot of time in our v0 building out on EC2 using things like RabbitMQ and SQS. And we found that our developers were spending a lot of time with toil, a lot of time tracing troubleshooting, managing platforms.

And so as we started to build out this new content delivery orchestration, we wanted to make sure that we were leveraging serverless components that allowed our engineers to focus on their core business priorities and their core business problems rather than "Are there enough threads on this machine? Did I provision enough CPUs? Is this the right box I'm on while I'm SSHing and trying to figure out what's going on?"

We've seen a great return on our investment from prioritizing using service components and have experienced a 4,700% velocity improvement for our legacy delivery platforms.

Sure, you wanna come up and join me here? Yeah.

So um thank you for this. Actually, this is a great example of a real customer using this scale. Um but the question I wanna ask you though is going to the journey that you've gone through and, and not actually going beyond experimenting with CUS and actually employing it to good use. What would you share with the audience of things that you have learned that would help someone who is either at the beginning part of their journey or stumbling on this journey?

Sure, we want to focus on the smallest possible slice for when we're deconstructing our legacy systems. How can we find the smallest repeatable processes that we can execute here? We had a lot of use cases where we had provisioned EC2 nodes and full services for very small subsections of this flow. And we were controlling lots of different components and lots of different microservices.

I would have started earlier, I would have started disassembling a lot of our legacy management so that we could have our engineers focus on solving our problems and delivering value to customers instead of making sure that our RabbitMQ cluster was appropriately provisioned or that our storage layer on our EC2 nodes was meeting the demand that we were hitting.

All right. Well, thank you Jefferson. Um we'll have Sebastian close this out and um yeah, thank you guys.

Um lastly last slide here, we've got AWS training badges for storage. So if you're into storage, you know, or if it's something that's important to you, you can go on to aws.training/storage, you can earn those badges. We've got full curriculums for object, for block, for file storage, for data protection and so on. And you can really learn about each one of those technologies. And at the end of the curriculum, you can take a badge, a badge that's going to be recognized by Cred.

So please go out and, and you know, if you're interested in storage, if you want to become more well versed with all of our different storage offerings, go out there and please try those, try your hand at getting those badges.

Thank you everyone. All right, I think so.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
AWS storage for serverless application development

Woot?Yeah.
复制链接

扫一扫

AWS storage for serverless application development

“相关推荐”对你有帮助么？