AWS storage: The backbone for your data-driven business

最新推荐文章于 2024-08-12 19:15:24 发布

taibaili2023

最新推荐文章于 2024-08-12 19:15:24 发布

阅读量490

点赞数 5

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134690512

版权

Andy Warfield said:

"Please welcome Vice President and Distinguished Engineer AWS, Andy Warfield. Thanks for coming. I am so excited to be up here and to be giving the storage innovation talk this year.

My name's Andy, I'm an engineer and I work across our storage teams and I've been at AWS for about 6.5 years. Prior to that, I was involved in a couple of start ups. I've been a university professor, but the last 6.5 years have been really, really incredible for me. I love working with our storage teams and today I get an hour to tell you about the work that our storage teams do and the ways that our customers use our storage products. And I'm really excited to do that.

And so in today's talk, I thought what I would do is walk you through a set of stories about how storage works at AWS and, and how our teams build stuff.

One thing that I'd really like to start by emphasizing is unlike other aspects of our services, our storage services don't tend to get a lot of really big exciting launches. We tend to be behind the scenes and underneath a lot of other product launches and a lot of other really cool features that are built over AWS and working with our teams. I'll tell you the teams really kind of like that right. There is a humble pride, I think in building services that continuously improve year on year.

And so internally as storage builders, we really kind of look at our services in terms of these fundamentals, durability, security, availability and performance, and we're constantly focused on improving those aspects of our systems. And that constant focus that continuous, transparent and visible innovation inside our systems is something that I think is probably one of the most wonderful and also really, really interesting aspects of our storage services.

So set a nice term to talk about continuous innovation, but it's probably a lot easier to relate to if I make it concrete. And so I'd like to start with one of our services. Has anyone here ever heard of EBS? It's uh one of our, one of our storage services.

EBS turned 15 this year, right? Our, our blog storage service. And so I wanted to spend some time in today's talk telling you about EBS and I I sat down with the team and we put together some, some content about EBS that I think you're really going to like, first of all, EBS is kind of an engineering marvel at this point, right?

100 trillion IO operations, every single day, 390 more than 390 million EBS volumes are created every day and EBS transfers over 13 exabytes of data every day. The activity in EBS storage alone is just astounding and so in putting together a, a bit of a story about continuous innovation in EBS. I asked the team to go and look at how long the longest lived volumes inside EDS have been in service. Anybody want to guess how long the longest EBS volumes have been in service?

It turns out that there are a sizable number of volumes that were created the day after EVS launched still in service today. And so there are volumes still serving virtual machine workloads, still taking IO still actively running workloads above EBS that were created 15 years ago. And so it's fun to think about the continuous innovation in the service through the lens of those teenagers, right? They're, they're almost old enough to get their driver's license at this point.

And so as we go back early on in the history of EBS, the product page kind of looked like this. And if we think about the first few years of the service running and how EBS was built, all of EBS volumes were hosted on top of hard drives, right? This is the inside of a hard drive. And so this makes sense at the time, it's, you know, 2008 to 2012 and this is still kind of the dominant media for, for for storage and data centers and hard drives are kind of interesting, right?

The inside they're like record players, you have this armature, the armature moves back and forth and virtualizing a hard drive is not the same as virtualizing a CPU virtualizing a CPU is reasonably straightforward, not trivial but virtualizing a hard drive because of the mechanical motion of the arm, hard drives go really, really well. If you're reading in a straight line and when you go and do a random access, the arm has to move and the disk has to spin and you waste milliseconds of time waiting for your data to become available. Which is why for like 40 years of file system design file systems have worked really, really hard to lay out data on disks to move that arm as little as possible on the hard drive.

And so with a hard drive based block virtualization layer, which EBS was as soon as you add the second workload things stop going as well. It gets a lot harder. It's a, it's a like more than linear um delay or decay because those workloads interleave and you end up seeking like crazy. And so this was the reality approaching 2012 with EBS that the performance wasn't where the team wanted it to be. And it was a structural problem with the mechanics of these hard drives.

And so this is a great story of reinvention. Inside AWS. And it's one that has, you know, tons and tons of examples through the years. But the EBS team sat down and thought about what they would do and they looked at deploying SSDs but SSDs at the time, still relatively early days were very expensive. And so it wasn't realistic to deploy SSDs for the entire fleet of EBS. We had to be more surgical than that.

And so what the team did was they made a software change to the design of EBS to move all of the right path first onto SSDs and then right from SSD to hard drives asynchronously after the fact. And by moving those recent rights onto SSDs, we could deploy a smaller amount of SSD into the fleet, right, we could get a lot of value out of it. But also since storage requests tend to exhibit a lot of locality and time, a lot of temporal locality, we often hit the data before it was written out. And so you get this reorganization where a lot of the traffic can move to SSD and we can at our leisure sort of delay writing it out to hard drives and make good decisions about where the data is placed on the hard drives. So really good change.

However, deploying this into the EBS fleet, the software was only the first step we had to qualify the environmentals and make sure that there was like suitable thermals to be able to add new hardware into these servers. We had to make sure that we had a plan to do it safely and transparently under the fleet of all these running virtual machines, there was tons and tons of preparation and planning that went into this. It's effectively changing the engines on a plane while it flies. And so over three months, we managed to replace or install SSDs into tens of thousands of EBS servers.

Now, this is the bit that I'm really excited to share with you because as I was talking to the team about this, this is before my time. And whenever you ask about this stuff at Amazon, incredible stories come out about sort of the, the innovation that folks did. This is one of the SSDs and that's velcro, um it's heat resistant velcro, which is a thing that I didn't know you could purchase prior to preparing this talk for you.

Um and so there weren't 3.5 inch drive slots inside these servers and we had to figure out where to put these SSDs. And so that's the SSD and that's the SSD inside the server. And this was how EBS initially adopted solid state storage. Now, none of these servers are still in the EBS fleet. In fact, we're a few generations past these servers, but, but this is how EBS evolved into solid state early on in the, in the evolution of it as a product.

Now, EBS, you know, that's four years in and there was a lot of innovation that happened prior to that. But since then, we've innovated across the stack. And this is true of all of our products. I can tell you a similar story for S3, a similar story for EFS probably. Um but if you really want to see it in EBS, let's just look at a write request, right?

So there's a VM in the top left and there's a write leaving the application flows through the operating system hits the VIB in the interface and thinks it's talking to a local SSD, but it's not, it's talking to EBS. And so we use Nitro, the request moves out through Nitro over the network. The request is replicated to a few other EBS servers. It's acknowledged quickly after being written to persistent storage and then returned to the client. And all of this happens in under a millisecond and I can call out.

I mean, I would happily spend the rest of the afternoon telling you about all the pieces of this. It's so neat. All the levels of things that have happened inside EBS but just to call it a couple, the Nitro system underneath EBS is so remarkable and it's so powerful as a storage posture.

We never want to see customer data, right? It's, it's absolutely critical and we're ruthless about not ever seeing customer data, but the OS thinks it's talking to a local SSD and it's right in clear text over that wire. And so Nitro gives us a posture that is secure and local to the host where we can encrypt that data before it ever leaves the host. Some really, really powerful posture.

At the same time, Nitro is not stealing resources from the CPU that the instance is using, right, which means that all of those resources are the customers to use. The RDMA protocol is something that Peter talked about last year. It's a transport protocol that we use in our data centers to replace TCP that's designed for low latency and to quickly run around failures.

Um Nitro SSD and Graviton, I'm sure you're familiar with Nitro SSD is a, is an SSD that we built from scratch at AWS the flash translation layer using commodity media cards. So over the course of those 15 year teenage volume lives, right? The amount of improvement that's happened in the experience of running a EBS volume is incredible from 100 IOPS, 100 operations per second at launch to over 400,000 IOPS per volume. On our fastest instance today, it's just unbelievable the level of improvement across the lifetime of these volumes.

In November, we launched EBS io2 Block Express is available on all Nitro enabled instances. Uh io2 Block Express is our fastest volume type. Uh it's capable of uh of incredible performance and latency and is our most durable volume type. It's a sand style volume.

And so that's the theme of all of our storage services with EBS as an example, continuous seamless innovation over the entire lifetime of these services.

I'd like to switch gears now and talk about performance a little bit and I'm going to talk about performance through the lens of S3. S3 has is our oldest storage service, 17 years old and S3 has been storing customer data and served as a basis increasingly for data leaks and applications for almost two decades now.

And a thing that we've noticed over the past, especially, I don't know, three or four years is we're being pulled closer and closer to applications. Customers really want to use S3 as a primary data store. And so over that period of time, the performance asks that customers were making kind of changed.

Originally, S3's performance shifted from being sort of an archival store to being a throughput focused store where customers would run huge parallel throughput analytics jobs. And this is a pattern that we see today very actively, hundreds of terabytes a second into single workloads onto S3.

But now as we move further, customers are asking us to move out of the sort of shipping truck type of throughput to the bicycle courier type of low latency, right? Fast interactive performance from storage. And so this throughput versus latency aspect of storage is it's, it's a little bit of a nuanced topic. So I just want to explain to you why it really matters and how it's interesting.

Here's a simple example of querying a table. We can think about this as looking through a parquet file. Um, and I've drawn my disk and my CPU. And if I implemented this in a really, really basic way I would read from the disk, I'd take the data from the desk, I'd process it on the CPU and then I'd read my next data from the desk and then I'd process it on the CPU. And as you see, I'm idle half of the time, I'm not making great use of my resources because the CPU is always waiting on the desk.

Now, if things were perfect, I would fully pipeline this, right? I would run all of my transfer from the disc in the background and process it as I went, right? This is kind of the ideal for making excellent use of storage. And if you think about, you know, machine learning workloads, which we'll talk about later and expensive GPUs, this is absolutely what you want, right? You want to keep the GPU busy because that's the most expensive part of the system.

And so the problem, the challenge with getting to that fully pipeline thing is that there's data dependencies in the storage. So if I go back to that example of a parquet file at the end of the parquet file, there's a footer and the footer is the metadata for all the, all of the rows and columns of data that are inside the parquet file. And so I have to first read the footer, I actually have to read a pointer that's after the footer to figure out where the footer starts. And then I read the footer and then I can go read my data. And so those yellow blocks inside the data kind of represent those pointers. And I can only pull forward that piping of requests to the point that I know which data I need to read next, right?

And so this is a challenge. And so to speed this up, the best I can do is make my reads faster, right? If I reduce the latency of my reads, then I can get closer to that pipeline type performance. And so if you look inside those reads, there are really two things happening. The second thing that happens is the obvious thing, it's transferring the data back to you on the read. The first thing that happens is everything else. It's requests being queued waiting to be submitted to storage. It's metadata lookups, it's authentication and authorization and network latency, all getting to the data. But if we can speed all of this up, what happens is some cool animation. The whole thing pulls in and our workload goes faster and the faster that I can speed up those requests, the faster my workload goes.

And so reducing latency really, really helps workloads that have a lot of data dependencies and it also really, really helps workloads that have humans involved in them, right? If you have creative professionals, folks editing video, clicking the mouse and waiting for data to come off of storage, right, they're very, very sensitive to latency."

And so the lower that you can make latency, the more responsive your applications are for those folks, these are the two things. And so, um a few years ago, one of the other engineers in s3 and i seth markel started looking at this problem. We were being asked a lot by customers around whether there was something we could do with s3. And so we did a pile of prototyping looking at what it would take to make s3 run in a better domain for latency. And we learned all sorts of interesting stuff. It was, it was like surprisingly more complicated than i think either of us expected when we started, one thing we learned was that s three's regional design worked against latency. I guess this is intuitive, but it's wasn't to us at the time, like there's a lot of internal round trips inside of s3 to make sure the system is, is functioning at a regional level. And so we realized if we wanted to build something that was going to have really, really responsive latency, we were going to have to move to a single zone product.

We found out that inside the request processing pack. there was a lot of work that happened every single request and there was an opportunity to pull that out into a session level set of state that we could establish with the sdk upfront and make requests go faster. And so all those artifacts that i mentioned on the reed side were things that we could tackle. And so today, as you heard in adams talk this morning, we're delighted to tell you about this new offering from s3 s3 express one zone. This is a single zone s3 bucket that is about 10 times lower latency than regional s3 single digit millisecond latency and the ability to do millions of requests per minute.

There are three really big properties that i can tell you about inside of one zone. The first one is we're launching and this is probably the biggest aspect of it. We're launching directory buckets. This is the first new bucket type that we've ever had in s3 directory. Buckets are a reinvention of how we work with metadata, the session level protocol talking to the data in s3 and it's designed for very high tps workloads that need very low latency. As i mentioned, we're also delivering this as a single zone product which gives you the ability to co locate compute next to the bucket in the zone where you place it, the bucket is still as available as anywhere else in terms of sending requests, but it's a single az construct. So it's it's only as available as the, as the a z itself. And finally, one zone is based on top of high performance storage media.

Let's talk about what this means from a workload perspective. So this is in a machine learning image training workload using the image net data set. This workload runs for about 15 days if you do it end to end. And on the left, i'm showing you the workload running against s3 regionally and on the right against s3 express, the right plot at the top of the graph is the cpucpu is really busy in this workload, right? it's moving a lot of data back and forth. The gp u is the pink plot and the valleys that you see in the gp u on the left are the gp u waiting for storage, right? The gp u is outrunning the transfer of these images in for it to process. And so by moving to a lower latency store, you see the gp u become more utilized and it shaves a day off of that 15 day run time and it shaves 16% off the cost of this workload because you're running the whole job for less time.

Now, we can answer really important questions with this kind of insight like is that a chihuahua or a muffin? And so the the storage teams have this amazing set of um um of essays and there's a team within them that does workload validation. They work closely with customers to understand performance dimensions of storage, workloads. And they sat down and, and put together this this s3 express demo of just running this classifier against both types of storage. And you can just see very responsibly how much faster we're able to work through the data without changing any of the software. Right. We just changed the bucket that we were working with.

Pinterest has had early access to this and they've been working on it for a recommendation engine inside of their system. The experience that they've had is that their, their tasks run about 10 times faster and they've been able to reduce costs by 40% if i step back. Um i'm borrowing a sort of visualization here out of like computer architecture texts, right? This is the memory hierarchy like you might see on ac pu right, with registers and cash at the top and then main memory, i've just like carried it down to think about the entire storage stack. And as you can see, we've got a portfolio of storage products that you can think about from a latency perspective. And this is really how we're approaching things. We want to make sure that you have the right tool to bring to whatever job that you're trying to do on top of our storage.

Now, of course, there are other dimensions that we can look at this with glacier de archive at 12 to 48 hours retrieval costs about a dollar per terabyte. It's an incredibly effectively priced product right. There are durability tradeoffs and so on. But in terms of latency, there's a really rich set of tools that you can bring to your storage workloads and that's where we'll go is performance for every workload on top of, of our services.

Now, at this point, i'd like to give you a little break from me and i'd like to invite mary ann johnson out. Um mary anne is um is the uh uh the uh chief product officer at uh cox automotive. And she and her team have been working at cox, i think she's been there for almost six years now. And her story and the, the, the work that she's done at cox in terms of uh cloud transformation and working with all of the various cox businesses is absolutely incredible. And so i was so happy that she agreed to come and tell you about their experiences. And so let's have marianne come out now.

I'm thrilled to be here today with all of you. And i'm gonna share some thoughts about the cox automotive journey in just a few minutes and it is about our cloud journey and our data first journey. And i'm going to share a few things really about the journey and the steps that we took a lot of things that we did that we find that were successful. I also want to tell you some of the tried and true measures that we've been able to use to get to where we are today. But first, let me tell you a little bit about cox automotive. We are roughly about a $9 billion company, part of cox enterprises. And we serve the entire automotive ecosystem through software, through services and many different types of capabilities. But we've grown through acquisition, we are over 70 different acquisitions. And in those acquisitions, you can only imagine what came with that. We had different skills, we had different levels of maturity, we had different levels of types of tech stacks and it was quite an interesting ecosystem. And in that, um we were able to start a transformation journey and we made a decision in 2018 to run like one cox automotive. And you say, what does that mean? Well, that really means that we went from a product centric organization to an enterprise type organization, meaning that we wanted to be solution focused and the journey that we've been on has the label enabled us to do that to actually help our customers thrive in a marketplace that is really changing very rapidly. And today we have across all of our brands solutions that are helping us do just that.

Now, our journey started before i actually even started with cox. I started in 2018. And my very first day on the job, i found myself in seattle at the aws offices because just before starting, the company made three key decisions. One was that we needed to get out of the 53 different data centers that we had and we needed to go to the cloud. So there was a decision there. The second decision is that we needed to implement agile as a business methodology. And the third was that we needed to be way more intentional about our data. So when i came back from being in seattle, it became very clear to me that i had to resell the leadership team on why we needed to go to the cloud. It wasn't just about taking expense out. It really had to be about having a nimbleness, having an agility to grow. So that was a big deal.

So we started on our journey, we talked about our data back in 17 and 16 being swampy and we really kind of had to get out of that. So be super intentional about that. The letters that we use have really been able to get us to where we are today. And i will tell you what's been interesting about that is today. So four years later, we're all in on aws. We have over 500 workloads in the cloud. We have um our data in a marketplace in one component that we call drive q and it has ubiquitous access across our entire ecosystem. And those have been keys to unlocking value just to give you a feel for the kind of productivity that's come out of these moves on average in our ecosystem. We launch anywhere from 20 to 30 new products a year. We do over 10,000 feature enhancements to our ecosystem and over 80 new integrations every year. And we wouldn't be able to do that if we had not made this journey and going all in on our cloud. And our data journey was key to that success. We had to do that, not only with aws, but with the partners that were part of the ecosystem there, that was key. And you know, if we think about the data, you heard a lot this morning about how important it is to be able to go fast and create differentiation. You got to be super intentional about your data. We did all the things you have to do, how you ingest, how you acquire, how you make sure you take care of data, exhaust how you catalog it, index it and make it available. So that changed even our model delivery. We do a lot of model delivery. We went from quarters, we went to months, we went to days and now we're deploying new models and our so significant change. And one of the keys to our success was being nimble and how we did it asking ourselves where we were getting in our own way and be willing to change those things and be able to level up. I heard the phrase this morning race to the top. I wrote that down. That's sticking it. How do we level up? And that's been key to our journey.

Now, i want to bring it to life for you a little bit more. It isn't about those data centers. Remember i said 53 we retired 50 of those data centers. We only have three now. And that is for the environments that are not fit for purpose for the cloud. We are all operating in the cloud, but we had to think about how do we maintain the right amount of cost elasticity but turn on speed and innovation. And that's been a big part of what you see here. It's personalization now that is available through our data. It's also advanced imaging that plays a part in our ecosystem. And it's also about conversational a i. How do we engage with our consumers? How do we engage with our customers? All those things are happening now through recommendations. If you've ever been on kelley blue book, if you're on autotrader and then how we help our retailers connect with their consumers. It's all now happening and it's been made real. And so those things are just to name a few. But at the end of the day, we couldn't do any of those. If we weren't that intentional about our data, it all starts with the data.

Now, once you have the data, then you can apply your advanced techniques, whether it's machine learning, natural language processing or computer vision. And those are all things that are artificial intelligence and give you the right jump off point to be able to go after ja i now i want to bring, give you a couple of examples

We talked about personalization a second ago. I don't care what business you're in, whether it's B2B, B2C, B2B2C, personalization is a big deal.

We created a product called MagicQ Magic Wall through our AI technology and our data feature that's embedded in those ecommerce platforms that actually will enable a consumer. If you were buying a vehicle it would only get five pieces of information from you to a soft credit pull. And we could tell you exactly what your monthly payment would be and knowing that you could get financed for a loan for that, for sure. So you would have confidence in that, then it becomes all about making it personalized to you - what vehicles are you interested in? That's all data driven.

And now we know what you can afford. And so the ability for us to actually get a consumer to buy a vehicle, understand their affordability and their interest - those that engage with the MagicQ are 10 times more likely to put that vehicle in their shopping cart. And for those that do that are 2 times more likely to actually purchase the vehicle.

So it's a big deal when you're leveraging your data. Now, what do you do when you have access to millions and millions of vehicles? You image them and then you use our patented technology to turn that into a digital twin. And that is what we've done with our Fixed Imaging Tunnels and our Mobile Capture.

It's our patented Fusion technology that is set a new standard in the industry for vehicle imaging and it's an industry first. Our FITs, as we call them, the tunnels actually produce through 50 different cameras, we have over 90 different light panels that produce over 2500 images that allow us to produce 90 images that are assessed and then pick the 12 best images.

And what happens from that is those images are so high res you can see road rash on the vehicle, you can see a scratch on the tire, you can actually read the text on the side rim of a wheel. So it gives so much confidence - if you're the seller of the vehicle you know what you need to do to recondition it and what your return on investment should be for that vehicle. If you're a buyer, it gives you the confidence of the actual condition of the vehicle. It gives you comfort in the purchase you're making.

And just one more example to bring it to life because we've done so much with AI - AI is not new to us and what we've done with our data - we've been able to step into what I call pragmatically aggressive. And I'll say that again, pragmatically aggressive on gen AI. And we've been able to do that and we're already starting to see some differentiation for solutions that we can serve our market better.

And I will tell you that because we've done what we just talked about, the experimentation is happening rapidly. So we're really excited about the potential promise of what we're seeing there. So AI practices are now leveled up. I have a lot of engineers, I have over 100 here at the conference today, and they're all getting access to be able to play and use and learn rapidly.

So just a couple of examples, we do a lot of contact communication with our customers and consumers, but we also believe that gen AI is going to be able to make that communication more relevant, more timely and more personalized. So we're pursuing that.

We think about what we do to manage search engine optimization - SEO - we believe that we can do that with higher quality, more efficiently and generate revenue for that. So those are just two examples of many examples that we have in flight right now.

So my advice to all of you is just to get started and what I have shown up here, I am not going to unpack this. But if you want to take a picture, I would suggest that you follow these principles that we share up here which are our key success factors in how we went to the cloud, the buy-in from our business leaders, as well as our data journey.

We all know that the data space and the AI space is moving rapidly and you cannot sit on the sidelines, you need to make decisions. So I would encourage you to lean in, be in it, be in motion. I'm, you know, as I'm not unpacking this, I also want to tell you that these actions appear very actionable. They're great tips for how you want to move forward in your organization to leverage your data, leverage your position so that you can take advantage of these advanced techniques.

And I talked about being all in with AWS and partners and I'll give a shout out to RapidScale, which is the Cox company, which is in that partner ecosystem who can help you. But that is how we have leaned in. We've taken advantage of everything and what I heard this morning excites me. So we are going to race to the top with what's next?

I would leave you with: radical transparency, manage your expense along the way, show value as you go - let that build momentum. But focus on outcomes. If you haven't started, start. If you're early in, lean in and accelerate. And if you're all in like we have been, I want to talk to you because I'm pretty sure we can learn something from each other.

So thank you very much and I'm going to turn it back over to Andy. That cool? The car video, the car video is really neat. This story is so wonderful and it's, it's, it's actually representative of a lot of these enterprise migration stories. I mean, Cox has been so incredible in terms of execution but it's a pattern that we're seeing in other places.

So with the second half of the talk, I'm going to spend some time talking to you about sort of three directions that we see customers, you know, sort of pulling us in from the perspective of storage services.

And the first one is migrations. And migrations historically have been a reasonably boring thing to talk to a room about when it comes to storage, right? It's usually a cost focused thing, it's a bit tedious and so on.

The reality though is we've seen a really interesting shift in migrations for storage, I think over the past few years. A couple of things have happened. First of all, a lot of customers, even enterprise customers that have incumbent data in enterprise managed storage, are just building in the cloud. So there's a workload transformation that's happening where people are building from scratch applications over these services while still maintaining that estate.

But the other thing that's really interesting to have seen happen is there's forklift upgrades, right? Things like SAP HANA, VMware being moved over - this is probably what you'd expect. Similarly, there's data protection recovery type workloads - instead of building a second data center or getting colo space where you have to also provide compute, people are replicating storage to the cloud.

The third case is really interesting. The third case is we're seeing customers realize that there's an enormous amount obviously of value in the data that they're hosting on those enterprise storage targets. But there's an equivalently enormous amount of value available in the services that they can build on top of it and the scale that they can analyze that data with in the cloud.

And so we're seeing a lot of migrations actually being driven by a desire for a transformative change to be able to bring new workloads to data that has potentially been curated and built over decades, right? And this is a really, really interesting shift.

Now, the products that we build at AWS specifically for these migrations are the FSx series of storage products. These are single tenant products. What we hear from storage admins that are looking at doing migrations to the cloud is they really want to buy storage that looks a lot like the storage they already have, right? They want stuff that's integrated and easy to work with.

And so with FSx for ONTAP, NetApp ONTAP, where we partnered with NetApp, it is a NetApp product, right? AWS and NetApp partnership offering NetApp ONTAP storage systems in the cloud. You can turn on SnapMirror and replicate data into NetApp running over AWS.

Windows File Server is another popular FSx. And then we have two open source based offerings - EFS and Lustre.

Today, we are announcing new support for scale out facilities within FSx NetApp ONTAP. This is a pretty sizable performance bump in terms of cluster capabilities. These clusters can now scale on SSD up to 1.2 million IOPS for the file target and 36 gigabytes a second.

And this feeds directly into that use case that I was telling you about - you can mirror your data into a scaled out ONTAP cluster and suddenly now you can use a GPU cluster to do training or fine tuning on that data or you can launch 10,000 Lambdas against that data and really draw a lot of analysis potential off of it at a level that you probably wouldn't buy compute for as a single owner.

eHealth NSW is an example of an ONTAP customer that's done a migration like this. They migrated 1.3 petabytes of medical imaging data and saw a $16 million cost savings through that process. They were able to reduce the image fetch time by 10x and they were able to shorten the delay for a negative COVID result from 10 days to one hour after this change.

So customers often look at migration as a starting point of a transformative change, right? They're really focused on the data and how a migration opens up new capabilities.

Let's shift and talk about similar examples of working with shared data and data being at the center of like a long term investment in building. And so this is where we talk about data lakes, analytic services.

And so from a storage perspective this has been a remarkable area for well over 10 years now. It's like every year there's exciting new things happening in the way that we can work with data.

Anecdotally something that we see is that data growth inside customers, enterprise customers, is exponential. We see typically about a 10x increase in the amount of data under management over a five year period.

We see data coming from all sorts of new sources, increasingly diverse data formats on disk, and a really sweeping set of new tools. And this is where the data lake pattern, with that term actually has meaning.

And so it's worth talking about it for a second. On S3 today, we have over 700,000 data lakes. And the data lake term, when it was introduced, is introduced in contrast with the idea of a data warehouse, right?

Like historically, the idea of data warehousing was that you would establish a schema upfront, you would load it, and you would interact with your data over that query engine, right? The idea of a data lake is we still want those facilities, absolutely, but we're going to decouple it and we're going to make a decision to store our data in a shared storage substrate.

We're going to establish file formats that our tools can work with. We may build a warehousing facility on top of it, but our teams are still free to bring other tools to work with that same data. And that flexibility to bring other tools to work with the data is the thing that lets teams go and have ownership and freedom to go and decide to play with PyTorch, right? And do a bunch of machine learning or experiment with new query engines or other tools.

They can go and stand up their own software directly working on that data, whether it's in a container or a Lambda or a VM.

S3 is obviously at the center of a lot of data lakes. I think S3 is especially attractive as a substrate for data lakes as a storage substrate for data lakes, notably because of the fact that there's no storage management involved. You don't provision, you don't plan for capacity, you just use the storage.

The scale of S3 and the ability to elastically scale performance up to hundreds of terabytes of throughput into a single dataset is a burst capability that is very difficult to replicate in a storage system that you built for a single tenant.

And so that's been something that's allowed people to get incredible performance for jobs, whether it's genomics analysis or querying data on top of S3. And the other facet of S3 that's incredibly powerful is all of the integrations and partnerships that we have, whether it's AWS analytic services and databases or partner services like Snowflake and Databricks that can be brought to bear.

So I'm gonna quickly tell you about two things that we've seen in data lakes over the past year, especially around re:Invent last year. In almost every single conversation that I had with customers that were building data lakes on S3, we talked about Iceberg. Customers were bringing it up.

Iceberg is an example of what's being called an open table format. And the idea with these open table formats is to build a first class table abstraction for data stored on object stores like S3. It's really, really remarkable what's been achieved in here.

All three of these examples - Iceberg, Hudi, and Delta Lake - are formats to build tables, tabular content over existing storage. And they solve for a whole bunch of challenges that customers have had in the past, working with just plain Parquet or columnar formats on top of the object store, to pick an example.

A common use case here is log data, right? You're writing out log data to storage and at some point in the future, you want to retire that log data. And so there's mutation in this simple case, it's a p only mutation, but I'm still adding and removing data from the table. If I'm issuing transactions against that directly against the parquet, I need to be resilient against those files getting deleted under my feet. I need to be listing all of the files to figure out what's there. And I need to be working across all those files in my table by moving to something like Iceberg, you get an interaction layer that provides sizable transactions against the data. So that query is not going to interfere with mutations to the table.

You get improved performance because it's managing naming and layout inside the object store, you get the ability to evolve the schema by adding or removing columns for example. And because these systems are snapshot based, you can travel back in time and issue queries against past states of your data. So it's a really profound change to this.

There's an enormous set of tools evolving around these open table formats and the pattern looks basically like this that there's S3 at the bottom, an open table format in the middle and some analysis tool or analytics tool at top. It's worth diving in for a moment and just giving you a little bit of a view of how these things work internally because it's I think, helpful to have a kind of a mental model around this.

And so if we look inside Iceberg as an example, we can really split it up into these three components right at the bottom, you've got your data files, these are Parquet or other columnar data files at the bottom at the top. There's a catalog and the catalog is responsible for telling you about your tables, telling you about the snapshots that exist in the past and handling updates and making sure that they're serialized appropriately in the middle. There's a bunch of metadata that also ends up being written into S3. And the metadata summarizes the views that you have of your data across those files.

And so as an example, if I had a giant, you know, table existing in Parquet and I changed a few rows of it, I would write out a new Parquet file that just had those changes and I would update the metadata to say here's what I overwrote and here are the ranges that are different. And so that's how I can work from a query engine or from interrogating the data to understand what's changed.

Now. It's really, really powerful. One side effect that proves to be a little bit challenging, especially if you're doing lots of steady state rights is you create a lot of small Parquet files and the query performance off of those files is considerably worse than if you were just doing big reads out of a single large object. And so last month, Glue Data Catalog launched automatic compaction of Apache Iceberg tables. And so this is an example of the direction that we're watching these workloads evolve and shifting in. And what this compaction does is it takes all of those small Parquet files, it coalesces them and writes in a single large Parquet file and updates the the metadata for Parquet or for Iceberg to say that that there's this new view you get about uh um 40% performance improvement for some workloads on top of this.

A second area where we're seeing a lot of interest in movement is around governance, right? The the data lake sort of development environment wants to use lots and lots of different tools. And so we want developers to move as quickly and as safely as possible. And so today, we're launching S3 Access Grants Access Grants, allow you to bring your own directory service, right? Active Directory or other directory services, integrate them with IAM and use your own principles as part of access control checks to S3 objects. So this is a really, really powerful thing in terms of integrating access into the object store with your own organization directories.

One area that's immediately been really popular with customers that have started to use Access Grants is the ability to perform fine grain audit right, accounting accesses to objects and it's three to individual users that are working with the system.

We heard from Mary Anne about how Cox went all in and has built, starting from initial footprints into a really rich and robust set of services on top of S3. BMW is another example. They started in 2019 with an S3 data lake today. That data lake is three plus petabytes of data and takes vehicle telemetry from 20 million vehicles. In Adam's talk this morning, we heard from Pfizer in a really similar tone. And so there are these really, really exciting examples of customers that are able to move so fast. And I think that's kind of the the thing that I would leave with you on this topic of data lakes, which is it's a technical term. But the place where we see it being most successful is in its organizational implications, right? The ability to establish a center for your data, to think about governance and to think about being able to move fast and experiment by standing up new teams and using new technologies. It's a really, really compelling ability.

Now, the data lake topic from a storage perspective feeds really directly into AI/ML and generative AI. And so we can talk about that for, for a few minutes as well. I'm sure you're hearing a lot about generative AI in talks this week. It's something that, that all of our services are very excited about and I'm sure you are too. From a, from a sort of slightly historical perspective, I think this is kind of amazing just the, the rate of change that we've seen in these machine learning technologies kind of hinges on these three properties, right? The the ability to build giant data sets that, that was even challenging, you know, like a decade and a half ago, the cloud's ability to bring enormous computer to bear on that data. And the renaissance that we've seen in computer science in terms of algorithms to work with this data, especially over the past five or so years.

And so if we look at the size of models that customers are building on top of this, it really speaks to the rate of change here, right, even over the last two years, the dimensionality of these models in terms of the number of parameters that are being changed, it's just, it's just skyrocketing and commensurately the things that people are doing with these systems is really really incredible.

Our view in terms of our storage services and the approach to these machine learning technologies is that it maps very closely to the data lake pattern, right, that we expect customers will want to have choice of models, choice of tools just like they do on analytics and that they'll want to bring those models and tools to their data, not the opposite. And so that's really from a storage services perspective, the way that we're thinking about this.

When we look at large scale customers, customers that are training large foundation models using thousands of GPUs over weeks of time. We typically see two patterns from a storage perspective. It's, it's typically either Lustre, which is a file system that was designed for high performance computing, right? It's a large scale high performance file system or customers are building directly on top of S3. And it kind of depends on the heritage of your, of your environment, right? The Lustre community or the customers that are using Lustre tend to be folks that grew up with a cluster file system. They're using tools like Spark and to do data prep and they just want to go and, and hack on the data from L. That has to be customers often are data lake customers and they, they have a data lake ecosystem of tools around them. And so we're, we're really excited about both of these paths and we're investing in both of them from a storage perspective.

For AI/ML on the Lustre side in November, we launched throughput scaling on demand. And so the pattern here that we see is a data science shop that's building a model will have a launch, a bunch of developers, they'll be working on the data and then they'll move into production training and they'll want to scale up the throughput that's available from the, from the Lustre cluster for the time that they're training and then when they're done, they'll want to scale it back down, right? And not continue to pay for that level of performance. And so this is a really meaningful ability to move performance up and down on the S3 side. There are two really interesting announcements in the context of ML frameworks.

The first one is earlier this year, we launched a file connector for S3 called Mount Point, right? This is a connector that wraps the get/put verbs and allows you to access your data over local file APIs. We've launched support for this for Spark, Dask, and Ray and we get a 6x performance improvement in terms of throughput to S3.

The other thing that we've just launched is an S3 connector for PyTorch. PyTorch is probably the most popular right now machine learning framework. And for both checkpoint and data loading, you can now with this plug in or with this connector directly access data in S3 from PyTorch.

The other dimension that I think is kind of fun to think about in terms of storage and machine learning workloads is what we're seeing with those customers, those data lake customers that are moving into AI and generative AI is there's often a need for secondary data that they want to bring to integrate with their own data. And AWS has been investing in this ability to gain access to curated great data sets for years.

On the left, what you see is AWS Open Data. This is a huge collection of open scientific and otherwise public data sets, right? Things like the Common Crawl data, which is used extensively for training large language models. On the right, we have AWS Data Exchange and we've just launched a data category for generative AI and so with partnerships with organizations like FactSet and Shutterstock, we bring the ability for you to gain paid access to commercial data sets that are curated in really, really high quality. This is also a place where if your organization has data that you think is valuable for other folks to train off of, there's an opportunity to monetize that through this marketplace of data.

I'm just going to close on AI/ML with two quick points. The first one is that anecdotally when we talk to customers that have done a lot of work in this space, one thing that I find surprising, although I guess it's a little bit intuitive, retrospectively is about 80%. Again, is an anecdote of the work that's involved in building a generative AI application or a sophisticated machine learning application is data preparation, right? It's it's curating and preparing your data to be used by the tools and a really remarkable observation is that if you've already built a data leak on top of these storage services, if you're already doing analytics, you're a lot of that 80% of the way there. And so a lot of these tools may be a lot closer and more accessible to you than you realize. And I'd encourage you to go and look at what's possible because you may find that you're a lot closer to doing incredible things than, than you realize.

The other thing that I'll point out is I really tried to focus on the storage aspects of these technologies. Today my colleague is doing a soup to nuts talk on generative AI on Thursday and I've seen a preview of this and it's an incredible talk. And so I'd encourage you to take time to watch that one.

Ok. The last thing that I would like to tell you about because I couldn't resist throwing a little bit more nuts and bolts into the storage update is to go back into how we build stuff a little bit. Across data lakes and generative AI, I've mentioned a bunch of these connectors that data lake customers and folks are using to access data in S3. And there's a thing that is really, really, I think neat here that we've seen on the storage teams. And the thing is that the S3 team seems to be in the process of kind of redefining what the team perceives the service boundary to be.

What I mean by that is, it's historically, when the REST API right requests arrive and depart the REST API and that's kind of where storage starts and finishes to the team. And what we've realized as customers are starting to pull us closer to applications is that we need to think about the actual application experience for these things. We need to get like our sleeves rolled up and really work with the other side of the network to make sure that the experience is amazing and it must be an amazing experience by default.

So when we started to talk to customers that were really driving a lot of performance out of S3, we learned that it was possible, I mean, the scale of S3 is enormous and it's totally possible to drive incredible levels of throughput from the system, but it's a bit of work. And so a few years ago, we actually published a white paper and extended the S3 documentation with these performance best practices, right? And we talked about paralyzing flows and all sorts of things to give you a quick sense of of some of that advice.

The observation is S3 has thousands and thousands of servers, right? And there's these IP addresses with low bouncers and sets of services behind them. But if you only talk to one of those, you're talking to a subset of what's potentially a very, very large fleet. And so if you're pinned and you're stuck on that one and some other clients come along, right? You risk getting into a spot where all the traffic is overloading that host it better to try and distribute it. And that's a client decision. We have to drive that from the client and as we move to GPU instances, right? In these really, really beefy machines on a P4, you're seeing 400 gigabits, a second of access on P5s, this grazes to 3200 gigabits of access, right? It's a lot of traffic off of a single nick.

And so there's an example of the P4 actually back in, you can kind of see the cabling, right? That, that we're putting into these, these GPU instances. And so it's far better and the performance best practices advise that you spread the traffic over the surface of the fleet and you need to watch the behavior of the fleet because it's a distributed system. Sometimes, sometimes things don't go as well as we wish they would. Sometimes there's, you know, traffic at some links and, and so it's important on the client side because it's the best vantage point for it to track that stuff.

And so we weren't satisfied with documenting this, right? We really felt that it needed to go further. And so that AWS SDK team has a bunch of like, absolutely amazing folks and we worked with them really closely on making this the default experience. The SDK team produced a library, it's a native code library called the CRT Common RunTime and the CRT takes all of those performance best practices and it wraps them up and it makes them automatic and it sets the goal for itself of being able to automatically saturate the NIC with S3 transfer performance.

And what we've been doing is linking the CRT in underneath the CLI, the SDK and these various connectors that we built. And so today, by default, the CRT is now enabled in the CLI and in Boto3, which is the Python connector for all of our machine learning instances and we'll be rolling it out in other instance types. It's behind a feature flag so you can turn it on today as we move forward.

Transfers with the CRT turned on under the CLI are 5x faster than directly accessing through the SDK without it.

Mount Point launched earlier this is the file connector for S3 that I mentioned and we've just launched internal caching for Mount Point. And so if you have reaccess to data during training or other workloads, you can now have the option to cache data on the local instance SSD, for example, reducing your request costs and further reducing latency.

There are other CRT based clients. I mentioned the connector for PyTorch. We've also just launched an ECS driver to make Mount Point available under all of the container interfaces under EKS.

And so the bottom line with all this is Continental is an example customer that's used Mount Point. Continental saw their simulation workload speed up by 20% and a 40% cost reduction just moving over to run directly against S3 with Mount Point. This is something that you should expect to see us continue to do all of these connectors. This has been a really, really rewarding and powerful investment. I think in working closer to the clients, we're doing almost all of this work actually in open source. And so it's, it's there for you to participate in or, or send feature requests to.

And all of that works with both S3 and with today's launch of S3 Express. And so let's let's wrap things up. I started by talking about continuous invisible innovation in our storage services. That's something that's absolutely our focus. I shifted to talk about how we aim to deliver performance for every single workload and talked about the launch of S3 Express One Zone.

We heard from Marian Johnson about Cox's journey into the cloud. And then we talked about enterprise migrations, data lakes and AI/ML as directions that our teams are really focused on anticipating customer needs for and making sure that the storage performance behavior and the storage experience is just incredible out of the gates.

And then we indulged a little bit more in some internals around the work that we're doing with clients for S3. I hope that if I can leave you with anything that this is what you should expect in building over AWS storage services that our services and the teams that build them are just focused, absolutely focused on continuing to listen and improve the behavior of these services over time and like those 15 year old EBS volumes, you should expect to continue to see these massive strides in terms of the capabilities that we're building for storage.

So thank you very much and enjoy the rest of the event.

taibaili2023

关注

5
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
AWS storage: The backbone for your data-driven business

Andy Warfield said:"Please welcome Vice President and Distinguished Engineer AWS, Andy Warfield. Thanks for coming. I am so excited to be up here and to be giving the storage innovation talk this year.My name's Andy, I'm an engineer and I work across our s
复制链接

扫一扫