Deep dive on Amazon S3 Express One Zone storage class

最新推荐文章于 2024-10-01 19:57:56 发布

李白的朋友王维

最新推荐文章于 2024-10-01 19:57:56 发布

阅读量158

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134838073

版权

Hello and welcome to Deep Dive on Amazon S3 Express One Zone. Thanks for spending your evening with us.

My name is Matt Sibley. I'm a Senior Manager, a Product Manager of Product Management on Amazon S3. I'm joined by Shay Holly, a Principal Product Manager on Amazon S3, Christy Lee, a Principal Storage Solution Architect, and Ambu Sharma Tech Lead of Data and Engineering at Pinterest.

We've got a packed agenda today. To start, we'll overview S3 Express One Zone, our new high performance storage class. We'll talk about why we built it, how it helps application performance and dive into the details of how it works.

Then Shai will talk about some of the use cases and partner solutions and how S3 Express One Zone improves performance and reduces TCO.

Then Christy will walk through a demo on how to get started with S3 Express One Zone and show off a few performance numbers.

And then Bud will share how Pinterest is using S3 Express One Zone to accelerate their machine learning pipelines.

So why did we build S3 Express One Zone? If you look at Amazon S3 today, customers use S3 for a wide variety of use cases and achieve very high performance from everything from data lakes to genomics to origin storage for CDNs. And customers use S3 for these performance critical applications because S3 has this massive front end which customers can scale applications to millions of requests per second.

For example, it's not uncommon for a customer to scan petabytes of data in a data lake and achieve terabits per second of throughput. However, there's a number of applications out there, whether that's video editing or interactive analytics that need lower latency because the speed of access impacts the job time or query time. And these applications are often waiting on the request to complete before the next step can be taken.

Another common pattern is in data pipelines where data can go through multiple stages and each stage is dependent on the previous stage completing. As a result, some customers move their most frequently accessed data of their performance critical applications to custom caching, custom caching solutions to reduce storage latency. However, this increases complexity because now you must maintain additional storage infrastructure and multiple sets of APIs.

So for these workloads, we built S3 Express One Zone. It's a new high performance storage class that delivers the lowest latency and highest performance object storage in the cloud. It provides up to 10 times faster access speed than Amazon S3 Standard and supports millions of requests per minute all while using the same S3 APIs that many customers are already familiar with.

So how does this help? How does S3 Express One Zone help latency sensitive applications? Let's go back to that data pipeline example where the compute tasks are waiting on the storage I/O to complete. For these request intensive applications, S3 Express One Zone's very low latency means that your compute time has spent less idle time waiting on storage. And so as a result, your job or query time completes faster.

A good example of this is model training with Insets (Capital Insets). Capital is a quantitative trading team that leverages world class technology in combination with some of the brightest technical minds. Insets trains models for very high frequency trading continuously over extensive amounts of time series data to stay updated on the lightning fast pace of the financial markets. With S3 Express One Zone, they saw a 78% improvement in workload speed which allows them to scale their models to higher levels of granularity across asset classes, sets and points in time.

So speed is a great benefit when you need a job to complete faster to meet a specific business objective. But it can also save you costs with S3 Express One Zone. That reduction in compute time or compute idle time in combination with the 50% lower request costs of S3 Express One Zone, we're seeing customers save up to 60% in the total cost of ownership of that application.

So I think the thing everyone is curious about is how does S3 Express One Zone actually achieve this very high performance? It uses a unique architecture that is optimized for performance to deliver that very low, consistent, consistent single digit millisecond latency.

First, it stores data in a single zone on purpose built hardware, high performance hardware.

The second aspect to further increase that access speed is it introduces a new bucket type which we call S3 directory buckets. These new bucket types can support hundreds of thousands of TPS of TPS and allows customers to scale very quickly with their object storage.

The third is S3 Express One Zone uses a new session based authorization model that is optimized to provide the lowest latency for requests.

All three of these in combination provide the lowest latency and highest performance and we'll dive each of these and how they work in the next few slides.

Alright, so first, let's talk a little bit about the one zone architecture. Today, by default, Amazon S3 stores data redundantly across a minimum of three availability zones. For example, on a write, data is stored durably across these three AZs before returning a 200 success. This provides the highest resiliency and durability.

With directory buckets, you specify that availability zone and it gives you that option to co-locate storage and object storage and compute in the same AZ to further reduce latency.

When you use S3 Express One Zone, data is stored redundantly in that single AZ instead of multiple AZs. And with the one zone architecture, there's a couple of considerations to think about.

First is the AZ placement. While storing data in the same AZ as your compute provides the lowest latency, if you store data in a different AZ, there's no additional network costs. And so like all Amazon S3 storage classes, there's no inter AZ data transfer costs.

Second, one zone storage classes have a different durability model than S3's regional storage classes. So let's talk more about that durability model.

S3 Express One Zone has end-to-end integrity checking on every object upload and verifies that all data is correctly uploaded. Next, data is redundantly stored across multiple devices before it considers your upload to be successful. Once your data is stored in S3, S3 continuously monitors data durability over time with periodic integrity checks.

S3 also actively monitors the redundancy of your data to help verify that the objects that you've created can tolerate concurrent failures of multiple storage devices. And all three of these apply to all Amazon S3 storage classes.

With S3 Express One Zone, there is a difference and it is important. In the unlikely case of loss or damage to all or part of an AWS Availability Zone, data in a one zone storage class may be lost. For example, fires or water damage could result in data loss. So it is something to think about before placing your data in the storage class.

With S3 Express One Zone, we're also introducing a new bucket type, the S3 new S3 directory bucket, which supports hundreds of thousands of requests per second. And so with this launch, there's now two different bucket types within S3 - the what we call now the general purpose bucket which supports storage classes like S3 Standard, Intelligent Tier, and the new S3 directory bucket which supports the S3 Express One Zone storage class.

And one of the key differences between these bucket types is how they scale. With general purpose buckets, the transactions per second can be very high but scale incrementally under load. With directory buckets, buckets are scaled to hundreds of thousands of TPS. And what this means is that workloads can burst very quickly with very high transaction rates.

The third piece of this puzzle to provide this very high performance is a new auth mechanism with S3 Express One Zone. You authenticate and authorize requests through a new session based mechanism which we call S3 Create Session, which is optimized to provide that very low latency.

When you use S3 Create Session to request temporary credentials that provide low latency access to the bucket. These temporary credentials are scoped to a specific S3 directory bucket and can be scoped to two types of modes - a read only or read/write mode.

Then after the session is created and once that call is made, a token is returned and is used on subsequent requests. The management of the sessions and tokens are all automated within the latest SDKs. So we'd recommend as a best practice to use the latest SDK and all of this is handled for you.

The Create Session is specified in the bucket policy. Here, you allow for the action of the Create Session and specify in the condition statement if the session mode should be read only or read/write.

So to sum up everything we discussed, when you're architecting for high performance using S3 Express One Zone, you first create a directory bucket which scales to hundreds of thousands of TPS. You will co-locate that directory bucket in storage with your compute resources in the same AZ to achieve the highest performance. Then use the new session based auth model S3 Create Session which automatically is managed for you with the SDKs.

There's some use cases that will require you to move data from an existing S3 storage class or a data lake from an S3 bucket to S3 Express One Zone. And the natural question is what's the best way to do that with this launch? We're also introducing Single Step Batch Operations which allows you to move data with a single step using S3 Batch Operations.

If you're not familiar with S3 Batch Operations, it lets you easily perform one time or recurring batch workloads such as copying objects between buckets and easily scales to millions of requests. With this new feature, you can copy objects from your general purpose bucket to your directory bucket. You can copy millions of objects in just a few minutes.

Alright, now I'm gonna hand it off to Shay to talk more about the use cases and partner solutions with S3 Express One Zone.

Shay: Thanks Matt. So far we learned about the performance characteristics of Express One Zone and how some of its new features help your workloads achieve the performance goals that we have set for the storage class. In this section, we will learn about the partner services that are supported with Express One Zone and how you can unlock the use cases with that. So let's get started.

Our customers depend on a large number of AWS services to unlock value from S3 for their use cases and that remains true for Express One Zone as well. So we build integration with key AWS services that we're going to cover now.

First is monitoring and automation where you can track key storage metrics with services like CloudWatch, resource tracking activity with services like CloudTrail, as well as automate your infrastructure with CloudFormation. Like previous S3 storage classes, security management is built in with services like IAM and VPC to accelerate time to insights at lower costs. Key analytical services like EMR and Athena are supporting Express One Zone today.

Machine learning is a core use case with Amazon SageMaker and enabled with Express One Zone because customers will see the biggest benefit with compute with the fastest performance. We have enabled integration with key compute services like EKS, EC2 and Lambda.

And lastly to continue our focus on developer tools, we have enabled automatic performance tuning with popular developer tools like s3fs, AWS SDK and C++ S3 connector and PyTorch connector for S3.

We will now cover use cases in detail for four key services where we think you can differentiate on performance with S3 Express One Zone.

First is Athena that accelerates your performance with SQL queries by up to 2x. Today, you are using Athena for interactive querying workloads where you use familiar SQL commands to join data across several data sources and store those results in Amazon S3 for later use. This provides a flexible way to ingest and analyze data while avoiding complex ETL pipelines. You can now use Express One Zone as a caching layer on top of your long term data lake. This will help accelerate time to insights for downstream applications while also reducing your request costs by up to 50%.

Next is EMR where we tested Apache Spark queries now run up to 4 times faster with Express One Zone. So with traditional big data processing systems, a consistent pattern that we see is that storage is closely coupled with compute. So customers struggle to provision the appropriate amount of local storage that meets the requirements of their workload with unknown access patterns. Under provision storage leads to out of memory errors. Whereas over provision storage wastes space.

Another pattern that we see is provisioning of additional storage to maintain multiple copies of your data to protect against system failures. With Express One Zone, you get a high performance storage that does not require any pre-provisioning and scales independently of your compute workload. In addition, you no longer have to maintain data redundancy because Express One Zone automatically maintains three copies of your data to provide the high durability.

So you can now use Express One Zone as the intermediate data store for your big data processing workloads for your Spark workloads that read and write from S3 Standard today. Performance will be up to 4 times faster when you use Express One Zone to accelerate your downstream applications like quickside dashboards, machine learning training and other analytical dashboards. Your workloads not only benefit from the faster data access but also the reduced cost and reduced operational overhead.

Third is Amazon SageMaker where we tested that machine learning applications now run 5.8 times faster on Express One Zone than S3 Standard in fast file mode. So as Matt covered earlier, compute is often stalled by storage access times leading to spiky utilization of provisioned compute resources. And with deep learning models growing in size to nearly billions of parameters today, GPUs often drive the biggest portion of your total cost of training models. So even the smallest optimizations to your compute resources can result in needle moving savings both in terms of training time and costs.

So we conducted internal benchmarking tests to train a deep learning model called Vision Transformer for image classification and used about 1 million images for a training dataset. And because Express One Zone delivers consistent single digit millisecond latency even at P99, we found that it resulted in a more consistent and efficient utilization of your GPU resources leading to faster job run times and savings in training costs by up to 16% over S3 Standard.

So how does this work with built-in machine learning AWS services? Customers often use SageMaker to train their models directly from their training data in S3 and then store their trained models for machine learning inference back in S3. Training a deep learning model with S3 Express One Zone is not just faster, it also helps you utilize your compute efficiently.

So for example, in our internal benchmark tests, training the same Vision Transformer model with Amazon SageMaker fast file mode, we saw that it was 5.8 times faster than S3 Standard.

And lastly, Express One Zone is now also supported with s3fs, a new high performance file system for S3 we launched earlier this year, which runs up to 6 times faster than S3 Standard. With s3fs, customers can translate local file system calls to S3 API calls and then can use it to mount an S3 bucket on their compute instances as a local file system for their machine training for their machine learning workloads and analytical applications. And with Express One Zone, you can now achieve even faster performance for these workloads to reduce your request costs and optimize your compute usage.

We understand that our customers use a wide range of services with S3. So we work closely with our AWS partners to seamlessly integrate Express One Zone and unlock the performance and cost benefits for our customers.

Customers use Databricks to build an open data lakehouse to manage their cloud and on premises data. With Express One Zone, customers can get up to 40% improved read and write performance for this use case.

Similarly, a number of S3 customers use ClickHouse to build an open source database for their downstream analytical applications. And with S3 Express One Zone, their platform speeds up query performance by 283% and lowers TCO by 65%.

Chaos Search uses an S3 native like native live analytics database to facilitate search queries using SQL. They were able to see up to 60% faster query time and substantial infrastructure cost savings with Express One Zone.

Colorfront offers a media editing platform to process and deliver media for their Hollywood and entertainment customers. Starting today, Colorfront customers can improve the performance for their digital processing workloads by up to 70%.

And that's not all. We would like to thank more than 20 AWS partner services that span across analytics, machine learning, media editing, governance, backup that support Express One Zone to achieve the fastest performance for our customer workloads.

And with that, I will now hand over to Christy who will walk us through a real time demo of how to get started with S3 Express One Zone when you shop.

Alright. So I'm Christy Lee. I'm a storage solutions architect and I'm really excited to show you what you could do with S3 Express One Zone.

Okay, cool. Um I will use the laser pointer a little bit and I'll try to bounce between the two screens just so that everyone in the room can see. It's because I didn't get enough steps while I was here at re:Invent. So you'll have to forgive me.

Alright, you guys ready? Okay. So there's three parts of the demo. The first part of the demo, we're gonna show you how easy it is to get started. Then, we're gonna start somewhere familiar. We're gonna start in the AWS console. We'll show you how to create a new S3 directory bucket and then we'll go into, how do you easily import into that S3 bucket? Okay.

So, log into the AWS console, you'll browse to S3. Hopefully a place that some of you have seen before and you'll notice that there'll be two tabs at the top. One is for your general purpose buckets. The S3 that we know and love and has been around since 2006. That will be under your general purpose. But you'll also have a new tab for your Z1 buckets. Okay.

So let's go ahead and we'll get, we'll get, uh we'll set up a new bucket, make sure that we're in the expected region.

Alright. So once we've confirmed that we're in the region that we wish to deploy in, we'll select that we want to be deploying a new directory bucket. This is the first type of S3 bucket where you can specify which AZ you wish to deploy in. We'll go ahead and select AZ number five. Do keep in mind like the name suggests, this is a one zone service. So we will want you to acknowledge that in the event that that particular AZ is impacted, you may be unable to reach your data at that time.

We'll pop in the bucket name that we would like. And then notice that the bucket name will be suffixed with the region and AZ and dash Z1-S3. Okay. By default, we'll turn on block public access. Uh we'll have ACL disabled on the bucket and we'll also encrypt by default as well. And that's all you have to say to start with their new. That's the directory bucket. Okay, perfect.

So now that we've got our bucket created like all S3 buckets. When you first create one, it'll be empty. Looks a little lonely. Let's give it some friends. So we'll go back. You'll notice that if we select the bucket, there'll also be an import button that's presented to you on the top, right. We'll select import.

So the input that we need to tell it is where do we want to import from? This could either be a bucket or it could be a prefix within an existing bucket. The only requirement for your import source is that it has to be within the same region of where you're importing from. So North Virginia, North Virginia, US West, US West and so forth. For the demo. we've got a general purpose bucket that is set up with a sample dataset. So we'll select that as our prefix to import.

We then verify what role we want to use. Confirm that that's the correct S3 directory bucket we want to import into and once they get, we'll hit next or rather, we'll hit the import button.

So at this point, I will note that if you want to use the S3 Express One Zone storage class. those will have to live exclusively in those directory buckets. Just your general purpose buckets is where your S3 Standard Intelligent Tiering Glacier and so forth is where those objects would live. So naturally any objects you copy over into a directory bucket will be Express One Zone. So this is what we mentioned earlier in that single uh single operation, batch operations. That's what import is doing. It's creating for you that batch operations job telling it in bulk, I want to move X amount of objects, thousands of objects, millions of objects it may be into, into my S3 directory bucket.

Here's one I completed earlier. It completed in a couple of minutes and it easily moves 5000 objects from that sample dataset. A really quick and easy way for you to get started was setting up a directory bucket. And just to show that's how the dataset did move into a test bucket, right? So at this point, you're ready to start connecting your applications. Hopefully, you've, you've chosen the same AZ for your, for locating your compute as well.

For the next part of the demo, we're gonna do a download performance test. So 100,000 objects at 512 kilobytes in size to give you the lay of the land. The top two terminals are connected to an EC2 instance, the same EC2 instance, the bottom two are connected to another, but same EC2 instance. So the top two are one and the bottom two are a different one. The top two are gonna be talking to an S3 Express One Zone S3 Express One Zone objects or an S3 directory bucket. And the bottom two are talking to a general purpose or S3 standard. Uh and just for your information, the instance that we're going to be using is a R6 in 32xlarge.

Yes. Alright. So what I'm showing you here is the config file of how do we set up the test? This is a custom Python script that we use for setting up the test and just to show you the parameters. So what we do is we just modify the config make sure that that's the test we want to run. So it will be a download test or a get request. The only difference between the two configs is one is talking to the S3 Express bucket and the other one is talking to a general purpose bucket. On the right hand side, you'll see the network load performance stats. So we'll see as soon as we start the test, we'll see some graphs come up on the, on the right side.

Let me just take a moment to set up.

Alright. And that will hit, you know, ok. So exact same identical test only difference is two different buckets. Okay.

Three express is finished, it finished. in 5.3 seconds. Downloading 100,000 objects from an S3 directory bucket note that the single digit millisecond latency we got is an average of eight milliseconds and we'll get the S3 standard results when it finishes, it's still going, you can see the now a ch chart on the right. It's still downloading.

Now, I should mention that this is a synthetic test. Your applications will run a little bit differently. There's different ways to scale S3 performance. So this is just to show you what happens when we run from a single EC2 instance talking to S3.

Okay, if the S3 standard test is done. So for the same test of downloading 100,000 objects average latency is about 80 milliseconds and we got about one gigabyte per second of throughput. Whereas S3 express, we got about nine gigabytes of throughput.

So there will still be the majority of use cases. We do expect customers to continue to use S3 general purpose and S3 standard and similar storage classes. But for those low latency workloads, we do expect our customers to consider if S3 Express would be a good fit for them.

Let's try another performance test. But this time we're going to modify the config file for the S3 Express instance uh to 1 million to downloading 1 million objects. We'll keep the S3 standard test the same where we just downloading 100,000 objects. We'll we'll see how this goes.

Alright. We'll kick those off. We should see those network load charts come up. Yep. Okay. So if you look at the numbers on the right, i, i know they might be a little bit tiny for the people at the back. So we're pushing about 11 gigabytes per second and then about, I think that's 11 about one. Yep. Thank you.

So, for those high throughput workloads, uh you should be able to see that there would be a considerable difference, but it really will depend on your use case. So it's something to think about.

Now, most customers, I would expect that they would distribute the computing when it comes to using S3 Express so that they can get that benefit of hundreds of thousands of TPS per second.

Okay. Both tests are done. So both of them completed in around roughly 40 seconds or so, but one was doing 10 times more. So 1 million versus 100,000 download for objects. So pretty cool while still keeping that single digit millisecond latency, i, i think that's pretty cool as a storage person.

So, alright. Uh hopefully you found that as interesting as I did and then I also have TPS numbers in there. So 22,000 TPS for the S3 Express and about 2000 or 3000 for S3 Standard.

Alright, let's move on to the third section of our demo. So Jackie mentioned some of the benefits for especially our analytics services including Amazon Athena

So Amazon Athena, we're gonna try see something similar if we've got a data set and S3 Standard and we've got a data set and Sary Express One Zone. Let's see how they do. Uh so for this, uh for this demo, I have a sample data set of about 100 gigabytes. The only difference is that they are mirrored and stored in two different buckets.

We'll set up our query table. Uh so we'll set up our table so that we can query it in just a moment. Same conflict for create table. The only difference is pointing to a different three bucket.

Ok. Once that's set up, we'll pop in our queries. Yeah, we just need to make sure that this table here. So looking from the side S3 Standards here, S3 Express is here.

Alright. So for the first query, we'll do a simple one. The, the data set we have is sample game data. So we're going to take a look at transactions for customer spend uh for gaming platform. Um exact same query. The only difference is we're querying two different tables and the two different tables are occurring two different mirrored datasets.

So we'll let this run it shouldn't take too long.

Alright. So for the S3 Standard table, we said we see that the run time for the simple query is about 16 seconds on that 100 gigabyte data set. And for S3 Express took about 12 seconds for that 100 gig gigabyte data set. So roughly 33% improvement or for, for four seconds, let's try a slightly more complex query or at least the most complex one that i can write.

Um so this time we're still gonna do select statements, but we're gonna do a table join.

Alright, same idea. Um same data. We'll let those win. Let us go. Ok? Yes, your express is finished and s3 general purpose is done.

Ok. So same data, it uh the S3 Standard finished in about 20 seconds and S3 Expressed in about 12 seconds for a slightly more complex SQL query.

So for customers, i imagine that they would use this to all sorts of possibilities, all sorts of use cases when it comes to analytics. But it's really those where you do have those performance critical workloads that do require that low latency. That's where it's worth considering where S3 H would be a good fit for your environments.

So speaking of these cases, i would like to welcome our co presenter, Am Shama.

Thank you, Ahmed. Thanks Christy. How, how's everyone doing? Awesome. Uh hey, so let's quickly take a look at what Pinterest is first right before we dive into three express.

So Pinterest is a visual inspiration platform for discovering and shopping the best ideas in the world. It's basically like a digital bazaar and i frequently use Pinterest myself for my woodworking hobby and, and as well as home improvements.

Uh so with that said, let's see what makes Pinterest work behind the scene. Uh Pinterest infrastructure. So here's a quick overview of what we're doing. Um we have over 300 plus services that are built using uh shared services stack. Um and a lot, a lot of open source technologies uh running over tens of thousands of two instances at exabyte scale data and Amazon S3 and having been born in cloud has taught us a few things over the last decade.

First and foremost is state full and stateless systems scale very differently. State full systems will often require data application, rebalancing and reparation and using doing so it could take days or hours or days and depending on the size of the data, therefore, to achieve scalability in the cloud, it's it's imperative that we use storage and compute and separate them out. And it's also interesting that doing so has an impact on efficiency.

Um running custom replication in the cloud can be expensive and last but not least, Amazon S3 provides some very powerful primitives which allow us to use efficiency with economies of scale.

So these these four primitives is something that we have learned over the over the years. And also over these years, we have seen two evergreen challenges for interests. First is scalability, right? Our our our scale grows and the second thing is efficiency.

So scalability, let's define that for us that is the time to serve business needs. How long does it take to scale up our services? How long does it take to launch a new feature or if you had an existing feature, how long does it take to scale up? And the second part is cost, right? The cost of doing business? So does it and it really matters to us.

So can we can we do both? Right? Have scalability and efficiency at the same time? And how does that impact our both operational and infrastructure expenses? Right. So when we talk about cost, it's not just about infrastructure cost, there's an operational component to it as well, right? How long does it take to do operations on a cluster?

So let's try to take a look at these challenges through the lens of a use case at Pinterest, a platform at Pinterest called our data ingestion stack. And this thing powers our our our ads, our home feed. Um and much much more basically, if there's data flowing through it through in Pinterest, it's gonna go through the stack and from a scale perspective, uh we process about 80/80 million events per second, that's about seven trillion events a day.

Um and that's uh pushing over two petabytes of compressed data through the pipeline. And this entire architecture is using at the heart of it is published, subscribe system, right, which is, which is pub, it's producers and applications generating data storing and the pub system is is replicating and providing durability to it. And with the consumer applications could be consuming the data in near real time or real time or in batch cases cases in some, in some cases.

So we asked ourselves, well, we have to build this scalability. Um how do we do that better? And could we use the powers of Amazon S3 to do it? The answer was yes. In 2018, we started on a journey and an idea that could we use Amazon S3 to power build a pub subsystem natively.

Um which means the pub sub system would use S3 as the replicated data, data storage engine. And because it would decouple the storage and serving components due to the fundamental design, it would allow us to do have on on demand scalability as the as the team covered previously in multiple slides. And the idea there is that because the data is setting in Amazon S3, we don't have to worry about data reapplication, rebalancing. All of that is nearly instantaneous. All of that is a metadata operation and we built a system that we call me q.

And with me, we are able to power our machine learning pipeline, specifically our training data, which is, which is by volume, about 50% of the total pipeline that we are talking about in terms of data board, data, volume and event volume and being able to do that to satisfy business needs where machine learning engineers may be working on a new feature. And when that feature gets added the pipeline spikes to 2 to 3 times, they may be doing validation, which would again do on demand spikes of the data and doing so in a traditional system would be very, very challenging because you have to get the capacities ramped up, scaling, rebalancing with me q it's, it's literally a metadata operation happens in seconds.

So with those powers and capabilities, we were uh we were able to scale this environment since 2018 to be able to run 50% of the total traffic for pops up. But there is a trade off like it, it sounds too good to be true, right? And the trade off here is latency uh where we have to buffer data to be able to write seamlessly to Amazon s3.

So as to be more efficient, efficiently used as as as we, as we saw that there's a difference in the request rates, Amazon s3 standard can serve versus express one zone can serve. That's that's the difference we're talking about and the to bring to bring add more color to it. This is this is just this is totally fine if you're trying to do batch data analytics offline training, right? You just want to make sure the data is dab stored in your environment and being able to be served by your offline processing applications on demand.

So what we try to do in here is we're, we're like, ok, this is all good and great. But could we power sub second storage with this? Well, the answer now is yes with Amazon s3 express one zone coupled with meq, we're able to take the efficiency of mem q which is 90% cost reduction and apply it in, in, in, in the Amazon s3 express one zone low latency case and run sub tech in pipelines through this.

How does that work? Well, all we had to do was apply, apply code changes to accommodate for the new session token and pretty much rest was literally a drop in replacement.

So let's see some performance data. But before we get into the performance analysis, we should also check up what the setup of our performance looks like. What we try to do is use our standard logging pipeline, which is the most common use case for applications at Pinterest, which is a microservice generating log data. And we call pretty much every data. We kind of picture that as log data.

We take that that pipeline configure at one mega byte object size. And the reason we use one megabyte in this case is because we were trying to compare to s3 standard and we had done tests in the past for s3 standard with one megabyte, we had good historic data to compare the results against.

The second thing is we ran it at a throughput about three gigabytes per second, which was uh which was representative of our large use case. And then the kpis we care about in this case were latency distribution. And in this case specifically, last by tail latency.

So your rights are not acknowledged until you get your rights are not durable until you get acknowledgement from the storage system that there's a 200 right acknowledge from the storage system. And you can't acknowledge the data back to your client until that happens. Therefore, last white ladies are very, very important. And the second part is consistency of throughput, right? If, if the system is jittering everywhere, you cannot predict memory requirements because your buffering requirements will be all over the place.

So let's see what the results look like. Here's a graph of uh our two board over time during our testing. Uh what you're seeing is the inbound outbound traffic on the mq cluster and the brief spike can be ignored because that was client rebalancing.

So, but what you can see take away from this chart is you see a consistency, a consistent graph with throughput being nearly consistent and jittering very, very little. And that just speaks to the performance capabilities of Amazon s3 express one zone.

Let's look at the latency, right, the top of the town, right, which is p 99 latencies, we see a hovering around 30 milliseconds for the test that we ran. There's a the nuance here because of certain constraints, we were not able to run the tests in the same same availability zone. So this is actually going across availability zone and you can see the 30 millisecond flat latency curve that you're seeing here.

Um the other thing that that to mention here is that we care about put latencies, as i said, that if we really care about last by put latencies to acknowledge the data back to storage, so that we can tell the producer that hey, you can, you're ok to keep moving forward and writing more data because we have written it durably to the end storage for the same setup in Amazon s3 standard men, you sees latencies around two seconds.

Um and um and, and as you can see, this is around 30 milliseconds. There's a remarkable difference in in performance here.

Let's see what the end to end latencies look like, right? Because that's what the application sees, right? The 30 millisecond latency for puts is great for, for men, you looking at it. But what does it look like for, for the end to end latency?

We saw a consistent sub 300 millisecond latency for this test. Um and then just speaks to the the again, the consistency of performance that we see.

So what's the impact of this? Well, um it first eliminates the io and storage capacity planning as as uh was being mentioned before and the time it takes to provision things, right? How long does it take for your two instances? To move up is pretty much the time it takes for me to scale up and start serving new workload.

In addition, you also see tco improvements in our case, the test that we, that we shared and the use cases that we are concerned about here, we see a reduction of uh up to 40% on the on the on the total cost for the use case as well as the scalability on demand. Scalability for the use case is over 50% because uh there's less operational cost uh to have engineers go provision things and then serve, allow engineers to be able to serve more workload through the uh through the pipeline.

So what's next? So we want to roll out nem q and Amazon s3 plus one zone. As you know, the service was just announced, we also have a few use cases as we came up with this design pattern. In 2018, there is other areas where we can apply the same design pattern, buffering, apply and decoupling storage and serving.

First one is shuffle service which is spark shuffle service. If you have shuffle data being generated. And as you saw, if there's out of this, out of dis errors, you would have jobs, jobs retrying and that's bad because that adds to your total cost of ownership.

The second one is data set cache when you have hard data and you want to be able to quickly get access to it quickly access to prema materialized insights. We want to be able to, to potentially leverage this technology.

And then last but not the least when we have extra large value storage in our key value store, it makes total sense to not have that be stored in memory. We can, we can use a new technology to do that. And that's Amazon express ones on here with that said, uh i'll pass it back to shay for a quick overview of the the entire presentation.

Thanks sampur. Uh to wrap up, we want to cover three key takeaways. First s3 express one zone is a new high performance storage class that results in single digit millisecond response latencies for your most compute intensive applications and is up to 10 times faster than s3 standard.

Second, to achieve the performance goals for this new storage class, we're introducing three new features, directory buckets for request intensive workloads, create session a p for a faster authentication, authorization and a single zone architecture to co locate your storage and your compute.

And third, so that your workloads can automatically start using the storage class express one zone is natively integrated with a number of aws and partner services.

Thank you for attending this session and taking our time to learn more about s3 express one zone. We really value your feedback today and throughout the year that helps us build these products and we hope that you have a great rest of rest of your re invent.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫