Deep dive on Amazon FSx for NetApp ONTAP scale-out file systems

最新推荐文章于 2024-09-10 14:59:31 发布

李白的朋友王维

最新推荐文章于 2024-09-10 14:59:31 发布

阅读量111

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134837779

版权

Andy: All right, good afternoon, everyone. Uh thank you for joining at 5pm. I appreciate you all being here.

Um today in this session, uh really excited to uh spend a few moments uh doing a deep dive of a new capability we announced just a few days ago, which is a scale out deployment option for Amazon FSx for NetApp ONTAP.

My name is Andy Crutch. I lead the product management team for the FSx ONTAP service. I'm joined here by Mike who leads engineering for FSx for ONTAP.

And later, we're going to be inviting David Miller who is a senior director at Arm Up on stage, a customer of FSx for ONTAP to speak a bit about his journey using the service.

So before we dive into scale out file systems and give you a better sense of, of what they are and and how they work um first wanted to start off by providing a quick overview of Amazon FSx for NetApp ONTAP.

Um so this is a service that we launched in collaboration with NetApp just over two years ago and, and our goal with this service is to provide customers with, with really two things.

One is a complete NetApp ONTAP file system that's deployed in AWS as a manage service with the simplicity, the agility, um the resiliency, resiliency benefits that come from an AWS manage service.

Um I've talked to a number of customers over the years who are migrating or building applications on AWS. A lot of customers use storage technologies like ONTAP or ONTAP specifically because of the rich set of capabilities that it provides.

And and our goal with with this service is to give customers full access to these rich set of ONTAP capabilities in AWS. But make it easy for customers to to leverage ONTAP without needing to worry about infrastructure management software patching all of those you know, management tasks that otherwise would come with managing a storage system.

And so this service offers customers a number of really rich capabilities that ONTAP provides whether these are access to ONTAP's rich data management features and APIs, ONTAP storage efficiency features that a lot of customers leverage such as thin provisioning, deduplication, compression, tiering.

All these are capabilities that customers highly depend on in order to make sure they have the right TCO for their overall data sets.

Unless lastly, there's, there are a lot of really rich capabilities built into ONTAP that make it easy for customers to replicate data, cache, data, protect their data um features like SnapMirror and SnapVault that are also built into the service and and again, our goal here is with the service is to make it easy for customers um to either migrate or extend ONTAP deployments uh over to the cloud and to be able to leverage ONTAP in AWS in a really easy way,

Talked a little bit about ONTAP's capabilities. Um wanted to share, you know what, what these are. And this is a subset by the way of the capabilities that ONTAP offers and that are available in FSx for ONTAP.

The reason I wanted to share this slide is just to give you a sense if you're not familiar with ONTAP as a storage solution of, of just how powerful the data management capabilities are that ONTAP provides that a lot of the customers we work with on a daily basis, highly depend on leverage and and we'll get into some of these capabilities later on throughout the session.

So if I take a look at FSXf ONTAP today, as I mentioned, we launched the service about two years ago. Uh what are some of the common use cases? We've seen customers run on the service over the past couple of years and, and in short, it's a, it's a pretty broad spectrum of use cases that we're seeing customers use, whether it's storing their general purpose, user shares, group shares on FSx for ONTAP leveraging the the native access from Windows and Mac clients, the integration with identity stores, whether it's leveraging the high performance that ONTAP provides for and the resiliency that we offer in the manage service to power databases and high-performance applications.

And lastly, an interesting use case, we see customers adopt is a lot of customers before they bring their application over to AWS. Maybe one of the first forays they make into the cloud is actually starting by replicating their data over and using AWS as a DR site or disaster recovery site.

And so for a lot of customers, um what we see them take in terms of a migration journey is one where they start by retiring their secondary data center on prem replacing their DR site with an FSX file system in AWS. And then over time being able to leverage that data and actually move the compute over and start running applications against the data since they already have the second copy of the data in the cloud.

So these are some of the use cases we've seen again over the past couple of years on the service want to talk a little bit about performance. And the reason I want to talk about performance is one of the main uh value propositions of this new scale out file system type is higher performance scalability.

So when we launched the service in September of 2021 about two years back, we offered a multi deployment option which this is a file system that is automatically replicating data and highly available across multiple availability zones.

And when we launched these file systems back two years ago, we offered a maximum performance of two gigabytes per second of throughput and 80,000 ssdi iops.

About 8-9 months later in April 2022 we launched a single AZ deployment option. What single AZ deployments do is they give customers who don't need the added resiliency of multi AZ file systems. Roughly twice the price performance, single AZ file systems are about half the cost because they only need to replicate data within a single availability zone for customers who don't need the resiliency. It's a great solution for running really high performance compute intensive workloads.

And just last re invent a year ago, we were excited to announce a two x performance increase for the service. We increased the maximum performance from 2 to 4 gigabytes per second of throughput from 80,000 to 160,000 ssd iops.

Um this is a really exciting cap uh launch that we had about a year ago. You know, 11 of the key pieces of feedback we hear from customers is customers love the rich capabilities that FSx ONTAP provides.

Um and the simplicity um and the resiliency that we offer as a manage service. Um but a key theme we hear when we, when we engage with customers is customers are always looking to do more.

So while you know, there are a number of improvements, we've made over the past two years to improve performance and improve price performance. Customers are always looking to do more.

So before I talk about scale out file systems, I wanted to spend a couple of minutes explaining what we've offered before and and I'll use the term scale up file system to refer to the file systems that have existed on FSx for ONTAP before today.

What a scale of file system means. And again, I'm oversimplifying a lot of the architecture behind the scenes. But each file system is powered by a highly available pair of file servers. There's an active file server and there's a passive file server in a single a file system. Both of these are in the same availability zone in a multi file system. These are actually spread across two availability zones to provide high availability across AZs.

And each file system also has an SSD storage layer. This is a high performance really fast storage layer for active data as well as a lower cost capacity pool storage tier that's really cost optimized for colder data. We'll talk a little bit more about those tiers later in the presentation.

And so with, with a scale up file system, as I mentioned, there's two file servers and one of these file servers are is active. What that means is that when you're accessing your FSx file system from a number of compute instances, every one of those compute instances is accessing that one active file server at a given time.

What this means is the performance of a file system. For the most part is governed by the performance that you can drive from this single active file server. Again, that's how, how a scale of file system works. All, all traffic is funneled through this one server.

And so, um you know, and this is um one of the th this is how file systems have worked kind of up until this day with that one active file server. As I mentioned, we've improved performance quite a bit over time. But we offer, we've offered historically up to four gigabytes per second of read throughput, about a gigabyte per second of write throughput and again, 160,000 ssdi iops.

So i mentioned earlier, customers are always looking to do more. A lot of the workloads. Customers are migrating or building in the cloud are constantly growing.

Customers are looking to, you know, throw more compute at processing their data, they wanna get results faster when they're running analysis. Also, one of the themes that we're seeing is sometimes customers who are migrating workloads to the cloud. Maybe the workload performance requirements aren't changing very over time.

But as part of having data in the cloud, customers are looking to leverage the broad range of a services that we offer to do more with the data run, more analysis against the data gain, more insights from the data.

And so what that means is in many cases, customers who are bringing their data over to the cloud, the needs to have high performance access to that data only increases over time.

So we spend a lot of time thinking how do we deliver higher performance than, than what we have? And there's really two ways of doing this.

The first is we can have even larger file servers. The performance improvement we launched last year was when we went from 2 to 4 gigabytes per second at a high level. The way we did that is we, we offered a larger file server that was about twice the horsepower of file servers that came before.

So as part of as part of re invent, we're excited to announce that as as in conjunction with a scale out deployment option, we're actually increasing the maximum performance of the file servers that we offer on FSx for ONTAP by about 50%.

Um so before file servers offered about four gigabytes per second of dis throughput, that's now increased to a maximum of six gigabytes per second,

Um they offered a maximum of six gigabytes per second of network bandwidth that's now doubled to 12 gigabytes per second. Again, one way of uh of thinking about how do you get better performance out of a file system? Use a, use a file server with more horsepower. So that's one that's the first announcement and that's the first kind of aspect of our scale out file systems that we're delivering to give customers even more performance and they've been able to deliver previously.

The other way of getting better performance is have multiple file servers. As I mentioned earlier, when you're accessing a scale of file system, all of your clients, all of your io goes through one file server. And so you're limited by the performance that one file server can deliver.

Well, one way of getting more performance is have multiple file servers. And that's exactly what we're doing with scale out file systems, with the scale out file system. Your file system is powered by anywhere between two all the way up through six highly available pairs of file servers, each has their own storage, each is operating as part of a single cluster um but can independently deliver, you know up to the six gigabytes per second, for example of dis throughput that we talked about in the previous slide.

And so this this horizontal scaling of the file server. And this is where the scale out kind of name comes from. What it enables customers to do is to have a single file system that under the covers is horizontally scaled. So that as your clients access data, those clients are distributed across the back end file servers. And this is what ultimately enables customers to deliver much higher levels of performance than you could if you only had one active file server.

So what what does this all mean in terms of performance. Again, i mentioned performance is going to be a key theme in today's discussion. Well, with scale out file systems through the combination of offering larger file servers than we ever have. And through the combination of enabling customers to create file systems that are powered by multiple a pairs of file servers, we're excited to increase our maximum read throughput on a file system by nine x.

So we're increasing the maximum performance on FSx for ONTAP from four gigabytes per second of read throughput to 36 gigabyte per second of read throughput, right throughput is increasing by six x. We're going from one gigabyte per second to six gigabytes per second.

And the total number of ssd iops that you can provision is now seven x higher than it was before single file system. Single scale out file system can deliver 1.2 million ssd iops. This is an increase seven x increase from the previous limit of 160,000 iops.

So our goal here is again, customers are always looking to do more performance. Isn't, isn't ever part of our road map because customers are looking to do more with their data over time. And we're really excited as part of our scale out file system offering to give customers the same ONTAP capabilities that, that they rely on that they depend on and they give customers really interesting ways of managing their data.

Um but with much, much higher performance than they could ever get in the cloud before. So with that, i want to hand off the mic to, to Mike our head of engineering, who's going to give you a few more details of how do these scale out file systems actually work?

Mike: Thanks Andy. So Andy just told us about all the awesome performance numbers. But as he said, we're going to talk a little bit about how does this work? You know, how do you actually take advantage as a customer of the scale out file system?

So we're going to start just recapping what does an FSx ONTAP file system look like there's components that make up every FSx file system. Be it a scale up or a scale out file system?

And the first one is your file system, a file system is akin to a non tap. So that's all of the underlying infrastructure, you know, discs, networking, um cpu et cetera that power your data storage.

Uh and in every file system you'll have one or more storage, virtual machines. A kasv ms um think of sbm as the virtual file server for your data. Uh you can partition your data. However, you'd like with these, you know, with uh sp ms um lines of business, get their own partition or however you choose to partition your data

"Um, every single SVM has its own permissions, shares, exports, et cetera. Uh, and so it's a great way to again partition your data virtually.

Um, and the, the last layer, the, the bottom layer I guess or maybe at the top is the volume. So volumes are the actual virtual containers that you provision on your file system that contain your data. So every single volume is a mount for your clients to access the data in your file system.

Uh, and we're gonna talk a little bit about what are the types of volumes in FSx, uh, in ONTAP. And how do you make use of those, especially in the new scale out deployment.

So the first type of volume that exists for FSx and for ONTAP is called the FlexVol. So a FlexVol is a single volume that maps to one SSD storage tier on one single file server. Um, the maximum size of a FlexVol is 300 terabytes. Uh, and again, it's powered backed by a single file server, single SSD tier, right? So as you write data to this FlexVol volume, everything's going to that one SSD storage.

The second type of volume is a FlexGroup volume. We've supported FlexGroups up to today, but they take on additional sense of power when it comes to the scale of file system. So a FlexGroup is still just a volume that your client is mounting, your client is mounting the one volume.

Um, but under the covers that volume can actually be mapped to multiple SSD tiers or even multiple file servers. A FlexGroup is made up of multiple constituent Flex volumes. So think of it as a FlexGroup essentially under the covers is a bunch of small, smaller Flex, smaller FlexVolumes which all kind of map into this one FlexGroup for your client access.

Um, the FlexGroup maximum size is 20 petabytes. Uh, and the, again, the cool thing is you can actually spread your FlexGroup volume across multiple file servers. So you can see why if we're talking about a scale out offering with multiple file servers, more than an active file server. Um, the FlexGroup volume, it becomes the, the key volume.

So what does it look like when you write to a FlexGroup volume? Well, as mentioned, it still just looks like one volume, one mount to your client. And so your client just, you know, writes the data to whatever active file server it is currently mounted to.

Under the covers the FlexRoot volume again is all these constituent FlexVol volumes. And so what ONTAP is doing is essentially hiding the complexity of data management and putting the file you've written onto the FlexVol constituent that is most efficient for your workload. And it's trying to optimize for balance for future efficiency of reads and writes, et cetera.

And so as you see, you're writing and these, the files, you're writing, the directories you're writing are getting put in different constituents ultimately on different file servers under your file system.

And what if I want to read the file out? Well, same story. Your client is mounting to a single file server. You are reading from that volume. When you try to read a file from that particular active file server, maybe that file is already on that file server. It's on that SSD store. Great. It's retrieved back, but it could certainly be on one of the other file servers.

And in that case, again, all of this is just taken care of by ONTAP, it'll retrieve the data from whichever aggregate your data is written to and again, serve it back out across the active file server to which you are mounted.

Um, you know, write it to the client. It's all still just reading and writing to a single volume and all of that complexity of moving the data, managing the data where the data lives optimizing where to place the data. All of that is just taken care of by ONTAP.

The last bit of kind of how this simplifies for you when you're using a, a FlexGroup. Um, and a scale out file system is around the actual mounting and using DNS.

Um, so an individual client again is only mounting to a single active file server in the situation. Um, but you have multiple clients who want to access multiple file servers. As Andy mentioned, that's how we're going to maximize our throughput right is by being able to have our clients hitting different, uh, different servers.

And so if you use DNS to mount your file servers, essentially the DNS record, which we create is going to be used to round robin and load balance across your clients, accessing the file servers to again attempt to achieve the maximum throughput of your workload.

So let's actually look at creating a file system using a scale out file system and, and what that looks like.

Um, so you go to the FSx console, uh, and this, you know, experience, uh, has not really changed in the, in the easiest sense of it. Um, we go to create a file system, you select the ONTAP type of file system.

Um, there's two types of creates kinda in the form. There's the quick create, which is what we recommended in the past for customers who are just looking for a simple scale up file system. If you want to scale out, you need to use the standard create.

Um, you enter in how much SSD capacity you need. You give it a name, you select single AZ, um, and then you choose your throughput. So when you, you can type in a throughput number that you're trying to achieve, well, autocomplete to a number like that.

And you'll see there's multiple options actually there, right? Because you can choose how many of the nodes you want to map to your throughput. So in this particular create that we're showing here, we're going to select the option. We want a six gigabyte throughput file system and we want that across two HA pairs. So two pairs each at a three gigabyte throughput scale.

Go ahead and create your SVM, your volume. We create a default SVM volume for each file system, you can create more SVM and more volumes after your file system is created, but you need a default for, you know, your initial data.

Go ahead, review all of the things you've entered, hit go and your scale up file system will be created.

All right, now, we don't just create file systems to have a file system. We're going to read and write for them, right? Uh, so first, I wanna look at what that looks like from a Windows instance.

Um, once you've created your file system, this is what the experience looks like. You get the file system page. Uh, there's a number of sub tabs here. Um, again, your volume is the data container that you are actually accessing.

So you're going to go to your volumes tab, you're going to look at the volume which you already created and the mount information is right there in that modal, take that to your Windows client, go ahead and mount using that information.

Uh, and it's really this easy to get your new scale out file system, uh, with that FlexGroup volume that is again spread across multiple SSD tiers on multiple, uh, aggregate file systems.

Um, set it up, read and write data from it and access on your Windows file system.

All right. What about Linux? Well, uh, I'm sure you're not surprised. It's easy for Linux two. Of course, uh, back to the exact same experience. You look at the Flex volume, the FlexGroup volume that you've created.

You go in and grab the mount information out of the, uh, attach, um, and you go ahead and mount this using NFS, still make fun.

So in addition to mounting NFS, uh, in this one, we're actually going to drive some I/O to the file system using Linux, um, and see what that looks like and actually see what it looks like to monitor your file system while you're using it.

So go ahead and mount and we're gonna run just a basic I/O test on the file system. Uh, you'll be able to see on the screen, um, a whole bunch of writes happening and, uh, how much longer is left and you can watch as it's happening, the graphs that are available in the console.

This is your aggregate level, throughput, aggregate level IOPS, et cetera that are being displayed here in that volume and the volume experience. So on top of the able to see what it looks like for that particular volume for the aggregate here, um, we're seeing this six gigabyte file system actually drove over seven gigabytes of throughput for that test.

Um, and the latencies are sub millisecond, about half millisecond latency for that particular test. Um, but maybe you want to get more involved, you want to, uh, be able to see how are the individual file servers being used, not just that volume, that FlexGroup volume that's spread across all your file servers, um, but the individual file servers, uh, on their own.

So in the console, in the monitoring section of the console for FSx, you go to your file system, uh, and you go to the performance monitor and we can see in here both again, the aggregate data that was driven to your file system, right? The IOPS and throughput of that.

But also we can go and look at for each individual file server. What does that look like? Again, this is just a very simple run an I/O test. So it's pretty even across all the file servers. We're not really seeing any differentiation.

Um, but depending on your workload, depending on how things have been laid out and partitioned over time, you know, you might be able to, to find ways to improve and optimize your workload by being able to look at how are individual disks being used in your workload.

And, uh, with that, I will turn over Andy to talk more about how customers are going to make use of FSx. Thanks, Andy."

And so as a result, customers need access to that data both from Linux clients as well as Windows clients. And that's where the multi-protocol access that FSx ONTAP provides makes it easy for customers to get the best of those two worlds.

So some of the very specific examples of use cases where we see again, this combination of high performance storage and ONTAP features being really valuable for customers are:

Analytics workloads like SAS, grid, life sciences and genomic analysis applications.
And then also in the oil and gas space - seismic data processing.

Again, a lot of these are high performance primarily Linux based workloads, but where there's also a need to access data from either a Windows or a Mac OS client of some sort.

So that's one use case.

Second one is cloud bursting. So for those of you who aren't familiar with cloud bursting, cloud bursting is the idea of customers having on premises footprints, they have storage, they have compute on prem. But those are often fixed infrastructure resources and sometimes customers have periods where they wanna be able to drive even more performance, use even more cores to run a certain analysis than they have available on prem.

So what customers can do with the cloud, because cloud is pay as you go, customers can spin up, spin down resources - it's a common pattern. We see is customers will deploy compute in the cloud and they'll use that compute to run analysis on data that's been generated and stored on prem.

One of the challenges with this setup is that the cloud and on prem may not be adjacent to one another. Sometimes there may be latency in the tens of milliseconds depending on where your data center and your AWS region are. That means that your compute instances accessing data from on prem may not be able to do so as quickly as they'd like and that ultimately can slow down these applications and make it longer to finish a compute job.

So a cool thing that we've been seeing with FSx for ONTAP, and Dave is gonna talk about this a little bit more in the context of ARM, has been using the service is for customers who use ONTAP on premises. One of the really unique capabilities that ONTAP provides is a feature called FlexCache.

What FlexCache enables you to do at a high level is if you have an ONTAP system, you can have another ONTAP system that you configure as a cache for that first one. And so the idea here is customers who wanna run cloud bursting workloads, they have data on prem, but they wanna run compute in the cloud.

What they can do with FSx for ONTAP, and again this has always been supported with the service, is you can deploy FSx in the cloud, configure it as a cache for your on prem data. And now what that means is your compute workload that's running in the cloud can access data that's on FSx with much lower latencies and much higher performance than if they had to go all the way back to on prem for each and every piece of data they needed to access.

So how does scale out help with cloud bursting workloads? Well, customers are doing cloud bursting because they want more compute, they wanna run these jobs faster, they wanna run them at a larger scale. The more compute you run often the more storage performance you need as well. So by having a scale out file system, this simply amplifies customers' ability to leverage cloud bursting, leverage the virtually unlimited compute that AWS offers to ensure that the storage that you're configuring as a cache for your on prem data can keep up with your compute and can keep your CPUs or your GPUs fully utilized. That's the goal at the end of the day, make sure your compute is not bottlenecked by storage so that you can run your analysis as quickly and as cheaply as possible.

So the third use case I wanted to talk about is remote artist workstations. This is something we hear a lot, especially when we engage with customers in the media and entertainment space.

So a common dialogue we have with customers is in, in many of these creative types of workloads, customers may have remote artists in one part of the globe and they have, you know, artists are working on assets that are stored in a specific storage system at the same time, many of these teams are global. Sometimes we talk to customers who say I have a team on the west coast, I have a team on the east coast in Europe and APAC. And, and these teams all need the ability to collaborate on a shared piece of data globally.

Some of these customers have a follow the sun model where you'll have assets generated updated by a US based team. And a few hours later, they need a team in APAC to pick up with that and continue running with it.

And so with the FlexCache capability that I talked about for cloud bursting also provides a really unique ability for customers in AWS to have a global file system that they can use for these remote artist workstation use cases.

So as an example, what customers with this type of workload are often doing is they can deploy an FSx system in each of the geos where they have artists and they can configure each of these FSx remote systems as a cache for the origin. What this gives customers is a globally accessible namespace where to each artist, they're accessing an FSx system that's right next to them. So they're seeing low latencies. But the data is strongly consistent across all of these.

So if you have an artist updating one asset, the other artists are also able to see that updated data, this is a really powerful capability that gives customers global access to data. And also with high performance.

And again, FlexCache has been available on FSx ONTAP for the past couple of years since we launched the service. It's a really popular and powerful capability that we offer. But it comes into play pretty commonly in these types of use cases where customers have globally dispersed teams that need shared access to data.

So how does scale out help here? So the other thing that we commonly hear with with remote artist workstations is performance is paramount and very commonly what this really means is performance when you're accessing data, when you're trying to stream data from your file system from an artist workstation.

And we're excited as part of this launch, we, we've actually been working with some of our workload validation teams on the AWS side to benchmark different types of video editing and media editing workloads on FSx for ONTAP. And with scale out, we're excited to announce that scale out file systems offer customers a 10x increase in the maximum number of streams, playback streams that you can stream from FSx.

So what this means is if you have a video asset stored on FSx, this is a 10x increase to how many of those video streams you can stream concurrently. And in short with scale out file systems, we've been able to demonstrate over 101,124 concurrent 4K streams that you can stream from a single file system.

So this is 124 streams of a 4K video asset coming from a single file system. Again, this is a common ask we hear from customers in this space. A lot of these assets are large, customers need high performance, high throughput to be able to stream them so that artists can quickly and effectively work on media assets from a shared file system.

So those are some of the, again, three use cases I wanted to talk through there where customers are looking for the combination of some of the really unique capabilities ONTAP offers as well as high performance.

Lastly, I want to spend a couple minutes talking about some best practices for optimizing performance and TCO on the scale out file system.

So one of these is with parallelism. For those of you who may have been running your compute on AWS for a while, you may be familiar with EC2. Well, if you establish a single TCP connection between two servers, by default, that connection is limited to about five gigabits per second or about 600 megabytes per second going from bits to bytes there. And that's per TCP stream.

So with scale out file systems, we're talking about large throughput and the maximum file server we offer supports six gigabytes per second, about 10x more than this per TCP stream limit. So how does a single client drive that full performance if a single stream is limited to only 600 megabytes per second?

And the answer is with parallelism. So there's the ability in NFS and also in SMB to automatically have it be the case that when you mount an FSx file system, that connection under the covers is powered by multiple concurrent TCP streams.

So for example, if you have 10 streams that allows you to have a single mount on your client that can drive six gigabytes per second and give you a multiple of the performance of that five gigabit per second limit. These capabilities are called NFS nconnect on the NFS side and SMB multichannel.

If you slow play the demo that Mike showed you, you'll actually see that when we mounted the NFS share, we pass in the parameter -o nconnect=16. You pass that in, that tells your client how many TCP streams to spin up that enables you to have really fast performance.

SMB offers a similar feature called multichannel. It's actually enabled by default on the client side and on the FSx server side, as long as you're creating an SMB share and joining it to your Active Directory. So on the SMB side, it's even easier, it's enabled by default.

Another best practice is that there's a capability in ONTAP called pNFS or parallel NFS. At a high level, here's how pNFS helps:

So in a diagram before we showed what does it look like if you have a client that's trying to access data that's hosted on a file server that's not the file server it's talking to. So in this case, this client's trying to access a file that's hosted on the file server on the very bottom, but it's communicating with the file server on top.

Well, what happens here without pNFS is the client asks the file server it's connected to for the data, file server contacts its peer and then it replies back with the data. So there's two network hops - client to server one, server one to server two. This is without pNFS.

What pNFS is is it's a capability that's actually enabled by default on FSx for ONTAP scale out file systems where a given client, when it opens a file, it actually is told where that file lives on which server.

What that means is that when a client goes to read a file, it actually knows exactly which server to go to. It directly asks that server for the data. There's no need for this extra network hop.

So what pNFS does, it's again, it's enabled by default. In the demo that Mike showed it actually was running under the covers. It's a capability that makes it gives you even lower latencies for data access and reduces some of the extra network hops and just overall gives you better performance.

So lastly, I want to talk about optimizing TCO. You know, for a lot of these high performance workloads, making sure that you're able to pay the right amount for your storage compute becomes increasingly important as these workloads scale and size.

And in the context of TCO, I wanna talk a little bit about storage tier. So with FSx for ONTAP, we offer two storage tiers. Talked about this a little bit earlier. We have an SSD storage tier which is high performance. It offers sub millisecond latencies. It's really optimized for active data.

We also have a lower cost capacity pool tier that's historically been used for pretty inactive data sets. SSD tier is priced at 12.5 cents per gigabyte a month. It's higher priced capacity pool tier is priced at about 2.2 cents per gigabyte a month.

And depending on the workload you're running on average, we're seeing customers store about 20% of data on SSD. But it really depends on the use case, how much of your data should be on SSD, how much should be on capacity pool.

So, as I mentioned before, a lot, historically, a lot of customers have been using capacity pool storage for pretty cold inactive data sets. And a big part of the reason for this is because historically, the performance, if you wanted to read data out of this lower cost capacity pool tier was measured in a few 100 megabytes per second of throughput.

Again, these are rough measurements we've done on our end just to give you some guidance about what customers have been seeing.

Well, one of the really exciting aspects of scale out is by having multiple file servers that also gives you a multiplier effect, not just on the performance you can drive to SSD storage, but also on the performance you can drive to your capacity pool storage.

So as an example, these are real numbers we've run - customers can drive over six gigabytes per second of read throughput to this lower cost capacity pool tier on a file system that has six HA pairs.

And the reason I share this is because again, a lot of customers today with performance in the range of a few 100 megabytes per second may be using the capacity pool primarily for inactive super cold data.

And the capacity pool offers higher latencies than SSD storage. And so in many cases, SSD storage is really the right solution for active data sets. But there are a number of use cases out there that don't need the sub millisecond latencies of SSD storage. And really just need kind of high levels of throughput if you're streaming large data sets.

And so with, with as part of scale out by offering a pretty significant increase in the overall performance customers can drive to their capacity pool, this really opens the door for more use cases to leverage that lower cost tier while still getting great performance.

Tier: One of the things we're expecting to see and we've already talked to a few customers about this is customers being able to leverage capacity pool storage on FSx for ONTAP to power some of these large block, large IO streaming workloads, think large files need to read with large amounts of throughput. So again, it's another by product of this launch as having multiple NVRAM pairs under the covers and giving customers not just higher performance for SSD storage but for any storage. So there's a lot of details about scale out file systems.

I wanted to, wanted to close by actually inviting David on stage. So I'll let David introduce himself in a bit, but um you know, wanted to, to have you also hear from a customer of ours about their journey using FSx for ONTAP some of the testing and validation they've done of scale up file systems as well. David, thank you for being here.

David: Thank you. Um yes. So a little introduction to myself. I’m David Miller, I’m from the UK and from ARM's headquarters in Cambridge and for those that aren't aware, this isn't the Cambridge accent and it's from much, much further north in the UK. And this isn't my normal look either. This is in support of men's health in November. Been at ARM since 2014, just started my 10th year. And I look after our IT architecture teams. This is very different from the ARM instruction set architecture. I'd like to point out my teams look after our internal systems which cover our engineering workloads, which is where about 75% of our people operate and other internal systems that provide business productivity. And then further into external facing systems where ARM's partners, which we call them instead of customers because it is a true partnership, receive their IP the hardware and software intellectual property that we produce for the semiconductor industry.

I just spent a few minutes covering a little bit around ARM itself. ARM was founded in 1990. Our first major success would have been in the cell phone and mobile and then further into the smartphone area. But today and going forward, we'll pretty much power every technology, computer based revolution going forward. Whether that be personal computing, the next generation mobile experience, smart devices and further into infrastructure realms, whether that be OEM hardware that you'll buy in your data centers are on premise and server storage and networking or in the cloud. And we're in all the major hyperscalers already, IoT devices that we see in the world, automotive and then recently is very prevalent at the moment in artificial intelligence, whether that be back end, large language models and the learning or whether it be the inference that we'll see on the end devices.

What that means to date is ARM shipped over 270 billion based devices through our partners through 30 years of history. We’ve balanced performance with a low power compute and we will continue doing that going forward. In fact, just recently, we've done some experiments with AWS, which we've published on our website for the Graviton processors showing a more than 67% benefit in carbon neutrality compared to some of the other platforms available.

Tier: Speaking of Graviton, you know, I know, ARM and AWS have been collaborating for quite a while now on bringing quite a few different innovations to AWS. I would love to hear from you a bit more about kind of our partnership and collaboration.

David: Sure. Yeah. So first Graviton processors were released in 2018 and there's been three generations since the last one with Graviton 4 announced this morning. I do have a picture of the silicon. It was so new, doesn’t look like the new symbol. But we've actually been in partnership with Annapurna Labs who developed the Graviton processor on the Nitro system. And for those that aren't aware of what the Nitro system is at AWS, this is essentially a hardware and software based secret sauce at AWS that provides the hypervisor for all the EC2 compute that provides you the real security and break up the walls across the EC2 instances we see out there.

Tier: Gotcha. So, I’d love to hear a bit more about some of the workloads you're running on AWS as a customer.

David: Sure. So we started shortly around about the same time as the first Graviton instances were launched. We looked at first into the cloud, we can actually use NetApp specifically in this case, but we patched up in containers a number of our workloads, fired them from on-prem to the cloud and then received the results back and we continue to do that. To give you some idea of the scale of where we are with that today, we execute north of 10 million jobs using about 470,000 CPU cores on around about 28,000 EC2 instances consuming spot entirely. There's a case study for this which you can find on amazon.com as well.

Tier: Give us more details on what you did. It’s quite a volume of CPU you’re bursting. It was interesting. I thought it was until I attended a presentation yesterday and I saw something with 1.7 million. But yeah, this is, you know, the bar keeps raising. So yeah, nice. So tell me a bit more about how FSx ONTAP fits in play and you know what led you to use the service and how it's been going so far.

David: Yeah. So I talk a little bit about the batch workloads but not everything we do is batch. We have interactive workloads. We have compound workloads which are a mixture of batch and interactive. And the challenge was to take what we have on-premise of this group that briefly right now this is our HPC or high performance computer, high throughput environment. One of the problems that we have is that the nature of HPC or AI or anybody that's using HPC in general is, it's spiky, we'll have compute times that isn't used and we'll have times where we don't have enough compute to satisfy the demand.

So what the environment looks like on-premises - we have our users using a remote X windows session through a bit of technology called Exceed On Demand by OpenText. We use a scheduler in this particular case, it's IBM's LSF. We have a directory as you mentioned in your presentation that's LDAP based and that's for the performance that we need. The data set which is shown on here, which the tools, the project data, the scratch areas, the reference data, this is tens of petabytes in size in tens of billions of files for high performance. And then we have the licenses that are consumed by the compute nodes through the scheduler against that data to execute the jobs on-premise. NFSv3 is very important here. And through that, we need continuity provided by NFSv4.

Tier: Got you. And so how does this look, you know, this is your on-premises environment? How does this look in the cloud?

David: So in the cloud, this is the rough architecture and the differences here is that we've been able to take the NFS, the FSx ONTAP capabilities from on-prem, use DataSync to move the data to the cloud. It has the LDAP support, it has the NFSv3 support and we call it the Cloud Foundation platform, but we've also been able to separate projects into a Cloud Foundation platform each and this gives us better security, better governance. And most importantly for our users, the experience is essentially the same. They don't know whether they're running on-prem, they don't know whether they're running in AWS.

One concern remained in that was scale-out performance. But with the announcements that we've seen today, let's hope.

Tier: Yeah, and it's really, you know, it's really exciting to hear. You know, as I mentioned in the very beginning, like one of our goals with this service is really to give customers a first class on-prem experience. And so, you know, hearing the fact that you're relying on NFSv3, rely on LDAP. You know, your users are used to a particular experience and you can replicate that in the cloud so easily. That's really exciting to see. It's definitely kind of a big part of what we want to do with the service.

David: The problem we had previously and before the scale-out was there was that we had to split, always create that horizontal scaling and getting that right is difficult. So quite often we look at small projects were fine, they'd run on one FSx ONTAP cluster, medium projects would maybe split the scratch out and have the project with two of the refs everywhere else. On large projects, we have to split those again and you get it wrong, you’ve got to reconfigure it, it involves downtime. But what we're seeing now is with the announcements today is that where we were limited on the number of compute nodes we could put against the storage, we're now trying to struggle to get enough compute nodes to drive the storage. And we're actually seeing some of our workload execution times be better by more than 50%. So it really is a game changing for us and we'll keep pushing the boundaries.

Tier: It's really exciting. You know, I mentioned earlier, big goal here is we want to make sure compute's not bottlenecked by storage, right? Especially a lot of these licenses that you're using are very expensive. And so to the extent to which we can make sure that your compute can run as quickly as possible, you spin it down, you mentioned you're using Spot, that's really our goal here. So that's really exciting to see kind of some of the early results of your early testing.

David: That's exactly what we had hoped for. Awesome. Thank you. Well, thanks so much for joining. Appreciate the time.

Tier: Before we wrap up, just wanted to share a quick list of additional sessions. So if you have any additional questions or whatnot, we're gonna be staying back towards the end answering any questions that you may have. But also just wanted to share a list of some related sessions, sessions related to FSx ONTAP scale out or even just similar storage products that we offer at Amazon.

With that said, I want to thank you all for your time being here. I know it's well past 5pm but again, thank you for your time and we'll be sticking around for any questions that folks may have just closer to the podium.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Deep dive on Amazon FSx for NetApp ONTAP scale-out file systems

Andy: All right, good afternoon, everyone. Uh thank you for joining at 5pm. I appreciate you all being here.Um today in this session, uh really excited to uh spend a few moments uh doing a deep dive of a new capability we announced just a few days ago, whi
复制链接

扫一扫