HPC on AWS for semiconductors and healthcare life sciences

最新推荐文章于 2024-10-15 19:13:43 发布

李白的朋友王维

最新推荐文章于 2024-10-15 19:13:43 发布

阅读量152

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134790677

版权

Good morning, everyone. Welcome to Session CMP214 - High Performance Competing on AWS for Semiconductors and Health Care Life Sciences. I am Riva Tali from AWS and I lead our worldwide HPC market teams, which is basically our customer facing teams for high performance computing, quantum computing and accel computing workflows.

This morning, I have the distinct pleasure of being joined by two of our customers and guests - Mark from Arm and Sean from AstraZeneca.

This is a high level agenda in terms of what we'll be talking about for the next 40 minutes or so:

I'll kick off with what high performance competing on AWS looks like - what are different customers across multiple industries are running and using the AWS services, how AWS has been innovating for the past 15 plus years specifically to the needs of HPC workloads.
Mark will then talk about Arm's journey on AWS for their chip design and workloads.
Sean will then cover how AstraZeneca has been innovating on AWS for their healthcare and life sciences, genomic workloads.
Finally, we'll close it off with some closing remarks.

So without any further ado, let's get started.

As you can see here, high performance computing is really all around us - whether if you take traditional modeling and simulation within the automotive and aerospace industries where they are running product design engineering and simulation workloads, weather and climate - many customers use AWS for running their numerical weather prediction as well as climate workloads. Many customers use AWS for healthcare life sciences workloads such as genome sequencing, cryo electron microscopy, modeling and simulation, semiconductor chip design for front and back end. And that's what Mark will talk about.

Within the energy space, we have many customers running workloads such as seismic processing and reservoir simulation. Financial services is another key industry where we have many customers using AWS. And so on and so forth.

And one of the common things that we see across all these different workloads is a common set of patterns - and HPC workloads basically come in a variety of compute and throughput characteristics. What do those really mean?

Some of these workloads is what we call loosely coupled workloads. So these are basically embarrassingly parallel workloads where you're running a single simulation on maybe 2 or 4 cores, or maybe sometimes higher, but you're running tens of thousands, hundreds of thousands or millions of them in parallel.

Similarly, another classic HPC workload is what we call tightly coupled workloads - where a great example for that is computational fluid dynamics or weather simulations where you're running a single job on hundreds of thousands of cores using MPI and low latency networks in order to get the best price performance.

More and more in the last 5 plus years, we have started to see the emergence of accelerated computing workloads, which is mainly around large scale machine learning and training workloads where these workloads use special purpose GPUs and ASICs and so on and so forth.

And this is just a small snapshot of a variety of our customer base in terms of by industry, the types of workloads they are running.

So you can see here, Sean will talk about AstraZeneca's journey within the healthcare life sciences industry. Similarly, within the energy space, we have customers like Commonwealth Fusion who is a startup looking at nuclear fusion simulation and they were here at re:Invent last year talking about their journey. Similarly, customers like Baker Hughes running modeling and simulation, and so on and so forth.

The other important aspect when it comes to running high performance computing workloads on the cloud is the ability for our customers to scale and thereby achieve acceleration. These are three bespoke examples:

In the healthcare life sciences industry, Dana-Farber Institute - they were actually here at re:Invent last year talking about their success - they were using an application called Virtual Flow where they were able to scale to more than 5.7 million vCPUs in order to identify billions of cancer protein cells.
Similarly, Woodside and other customers based out of Australia, they were able to use AWS for their seismic processing workload, which is basically identifying the natural gas below the subsurface. And they were able to compress the cycle time from 10 days to 19 minutes - that's 150x in terms of business acceleration.
And finally, another customer of AWS - Dear Labs within the geospatial domain. Typically every year, we have the Supercomputing Conference where customers submit their Linpack results and get listed in the Top 500. A geospatial startup Dear Labs was able to use a credit card and spin up a close to 10 petaflop cluster on AWS and get into the Top 50 a couple of years ago.

So that's really the scale our customers are able to achieve by tapping into our HPC services.

And as you can see here, AWS has been continuously investing and innovating within the high performance computing and the accel computing domain for over a decade. And this is just a snapshot in the last 6 to 8 years.

For example, one of our services such as AWS Batch - and you'll hear a lot more about that in both Arm as well as AstraZeneca's journeys - today this was one of our first services for scale out workloads. And all the way in the last 18 to 24 months, you can see there's been a lot of innovation specifically around compute, storage, and networking.

Now, when it comes to running HPC workloads or accelerated computing workloads on AWS, we actually take a system level view. What does that really mean? It basically boils down to five fundamental pillars - compute, storage, networking, job orchestration as well as results visualization.

So these are really the five pillars when it comes to HPC workloads. So let's spend a couple of minutes on each of these buckets:

For compute, EC2 stands for Elastic Cloud Compute, we have a variety of choices for our customers - more than 600+ EC2 instances to virtually address the need of any kind of workload that you want to run. We have processors from Intel, AMD, Nvidia including our own silicon - the Graviton - for general purpose workloads. Plus when it comes to accelerated computing, we also have Trainium and Inferentia.
When it comes to storage, customers have a variety of options to pick and choose, but for HPC workloads, these need high performance and fast IO. So as a result, we have services such as FSx for Lustre, FSx for NetApp ONTAP, as well as OpenZFS, including many other services such as S3, EBS, EFS, and so on and so forth.
The third and most important service for HPC workloads is networking. Now we have our own network called the EFA - stands for Elastic Fabric Adapter - and this provides a low latency, high bandwidth, high throughput network. On our GPU nodes, we have 3.2 terabits per second networking that these large scale generative AI and ML workloads actually need in order to have those MPI communication calls across the nodes.
And the fourth one is job orchestration, right? So you have your compute, storage, and network. Now when you need to schedule and run your simulation workloads, you need some sort of a job scheduler. So AWS Batch is one of the oldest cloud native schedulers on AWS and that's super widely used by a variety of customers across a variety of workloads. We also have the AWS ParallelCluster and it actually supports SLURM as a scheduler within ParallelCluster.
And two weeks ago, we introduced a new service at the Supercomputing Conference - it's called the Research and Engineering Studio that actually provides a portal for our customers to be able to submit their jobs through a simple webpage.
And finally, the last pillar for HPC workloads is visualization. So we have a service namely DCV - stands for Desktop Cloud Visualization - that actually gives customers the ability to stream pixels over the network. Let's say you run a large scale simulation workload that actually produces hundreds of terabytes or even up to petabytes of data - instead of moving the data locally and then looking at the results or post processing, DCV gives you the ability to stream pixels over the network instead of moving the data back and forth.

So in summary, again - compute, storage, networking, orchestration and visualization - these are the five pillars and all the services to address these pillars.

Just in the last 15 to 18 months alone, we have launched five different types of EC2 instances. And you can see here AWS has purpose built compute for virtually all HPC workloads.

For example, the first one is our HPC7a instance - it's actually based on the AMD Genoa platform. It's got 192 physical cores, 768 gigabytes of system memory, so that gives you about 4 gigabytes per physical core, and 300 gigabit EFA - that's the low latency network.

The other very interesting EC2 instances are HPC7g which is actually based off the ARM 64 processor. It's got 64 cores, 128 gigabytes of system memory and 200 gigabit EFA. It's a great EC2 instance if you're running applications like computational fluid dynamics, numerical weather prediction as well as computational chemistry codes such as GROMACS, AMBER and LAMMPS, and so on and so forth.

One of the interesting things that we have been doing, especially around the ARM ecosystem, is by working with a lot of our partners - the independent software vendor partners. So you can see here, this is a very small list of partners that we are working with, but the list is quite exhaustive with the likes of Ansys and Cadence and Siemens and Synopsys, etc.

And similarly, when it comes to the system integrated partners, we work very closely with the likes of Rescale and Rondin and TotalCAE and many others. So it's, we understand that it is super important for us to also build collectively the ecosystem for customers to be able to use these instances for their HPC workloads.

With that, let me now actually introduce Mark Galbraith from Arm who will talk about Arm's journey on AWS for their HPC workloads.

Great, thanks Riva! Awesome.

Uh hi there, my name is Mark Galbraith. So I'm VP of Engineering for Productivity Engineering, I work for Arm...

So I'm gonna talk a little bit about uh our journey uh to the armed cloud. Uh and through that, I'm gonna talk about, well, why, why use uh cloud and why use arm servers in the cloud uh for ed a uh workloads. I'll give a few uh a few case studies. Uh and then I'll talk a little bit uh about the about the benefits.

So I'll just give you a bit more of an introduction to, to arm before we, before we get into that. So 70% of the world's population are using arm processor technology. Um but we don't make chips ourselves. So we developed ip and we developed the software to our partners uh to then build the so cs and develop their software using our technology. 270 billion arm based chips have been shipped to date uh with 30.6 billion of those uh in fy 23 alone, which is quite incredible.

And so if you think about the, the sort of products um that you will see uh arm coming in is, you know, starting from the mobile revolution to the cloud disruption. So you're seeing arm based products uh in the infrastructure space and of course the the client space, um automotive and uh iot so whether it's, you know, it's your tablet or your cell phone uh or its servers in the cloud or it's in your, in your car. Uh you'll find arm technology everywhere. Uh as, as it has just been mentioned, uh aws is uh an r ip customer uh through on the labs and you can see where armed servers have made their way into the aws uh silicon uh over the last few years.

Ok. So um I also want to just mention uh sustainability uh as well. And so we've got a target by 2030 to reach net zero carbon. And so the chart on the, on the right hand side, uh you can see uh the, the yellow lines here are the uh are the, the targets that we've got, these are uh target emissions and the orange bars, they're, they're actually what we've been able to achieve uh already. So an 87% improvement versus the, the baseline. And so how do we did that? And then how does that relate to cloud?

So we started with using renewable energy uh for our offices and our non cloud servers of, of all of our on prem that's now 100% renewable. We have to work of course very closely with our supply chain and then thinking about travel and uh looking at our uh our carbon travel budgets uh and then implementing this into our travel policy. But of course, the uh the usage of cloud has really been quite incredible uh as well. And the recent uh simulations that we've done, we see a 67.6% reduction in carbon emissions using the graviton servers versus uh other non graviton technologies. So it's been really quite remarkable.

Ok. So let me talk a little bit about the the chip creation process. So we've got a um a little bit of a a flow chart here. So everything from requirements all the way through design verification implementation finally through to tape out uh and, and uh fabrication and of course, making a chip costs multiple millions of dollars. So design really has to be right uh first time and this is the sort of exercise that, you know, chip design house such as anna perma labs is going to do. And so what arms providing to fit in with this process is we're going to provide the pre validated ip units, you know, so whether they're cp us gp u systems ip, so we provide them, we also provide uh the pre validated um cell libraries, the memory compilers. And ultimately, uh this allows the partners to then take these pre validated units and then build their, their soc. And this is exactly this, this the sort of thing that aws three on upon our labs is doing.

And when we think about those pre verification steps that we're doing, this is where tens to hundreds of millions of cpu hours are then used in simulations to test those designs. And by uh i doing that upfront and whether it's again the soft ip like your cp us and gp us or even now with the droplets which are hardened versions of those, um we're, we're able to provide our partners with the quickest starting point for them to build their silicon. And therefore we are with that um huge compute target that we have uh the usage of cloud uh has then become uh instrumental uh for us.

So, so why move to that in cloud? So I really see it boiling down to 33 basic areas. And if I just take the first one in terms of improving agility, so if you've got a, a fixed compute farm and you've got multiple projects all running at the same time, everyone's got their own milestones, it becomes a real like a game of tetris to try and fit all of those uh project profiles into a fixed compute area. So by providing uh an elastic compute capacity that allows you to run all those projects in parallel without disrupting any of them and this leads to shortened development looks which, which is incredible.

And then we think about the project teams themselves. So this allows us to then unlock the efficiency. So what we find when we think about all the different um workloads that we have, when it takes, when it comes to building a chip, we find that different compute uh is needed at different times. So sometimes it's a large memory machine that's needed. Sometimes it can be, sometimes you need multiple um machines all at the same time. And so what this allows you to do is you get the the right machine at the right number at the right time for your uh for your project teams. And this really allows them to uh maximize their, their productivity and then also think about it as accelerating innovation.

So in this case, this is where uh you can collaborate much more easily with uh with all all your uh project teams, whether we're uh within the same organization or with partners and that, that can work extremely well. And it allows our teams to get to their quality milestones much more quickly, which leads to a much more improved uh time to market. So it helps tremendously.

So we're gonna deep dive a little bit more into the that design flow that we was showing uh at the beginning and talk a little bit about the ed a tools that are then available. So looking at that overall flow, we split it into the front end design, the backend design and then production and test. And so when we then think about all these different sections um and then think about our ed partners, whether it's synopsis cadence, uh siemens or ansis. So they're providing uh technologies that are then available now running on arm servers that addresses every single part of the, the design flow.

But more than that um when we think about the, the different characteristics of each of these workflows. So if you're running, uh so thinking about some of those front end design tasks, so a large part of the ip verification process involves unit level uh random simulations. And so they're typically quite short running jobs uh but you're running thousands and thousands of them uh in parallel. And so that maps really quite nicely into, into spot instances. Um and so, so we can then run them and you can scale out very, very quickly.

And then if we think about on the back end uh design process, so this is where we're running along placing routes or maybe static timing analysis. So these jobs run much, much longer, so there'll be many hours uh or even or even multiple days. And for these, these sorts of processes often you're using. Uh so, you know, it could be a full chip database. Uh the data sets are much larger. And so that involves typically a larger memory machines, the jobs run much for a longer time. And then you've got a lot more storage uh that you need to take into account there as well. So real different profiles when it comes to different stages uh of the of the design process.

So then when we think about how to uh access cloud, so we've got a number of different ways of uh of doing that. So I'll share with you a few, a few examples just now. So one of the ways of doing it is we've developed some technology called uh cloud runner. And so this is a way where we containerize uh our workloads. And then there's these containers that we then take into uh into aws and then we can then run them uh using aws batch. We can get access to the spot fleet, we can get access to the on demand servers.

We also have the cloud bridge technology that we've developed which allows our teams to uh directly uh access uh the aws uh infrastructure. And in that case, typically, you know, we may be running aws batch, but also the usage of parallel cluster has been very useful for us. And the what what this um again, the crucial thing here is our projects are if you like based on prem because we've still got all of our uh our license servers uh and our project databases remain on prem. But then we are essentially bursting into the cloud for all the extra capacity uh that we need.

Like, so I'm going to talk a little bit more about then how we run entire uh projects in the cloud. So, before I do that, I'm gonna talk a little bit about our on premise environment. So I'll just do the build first so effectively what we have, our engineers will be using uh exceed. Uh and that's how they submit their jobs. I said you've got all your ed a tools, your license servers, your repositories are all based on prem. We use an ib mls f cluster and then you can run all your job there, whether they're interactive uh or batch.

And then when we decided to go into the cloud to run the entire project within a different mechanism was needed, which is why we then developed our cloud foundation platform, which is essentially uh lsf on the cloud. So the front end looks almost exactly the same engineers still using exceed. But then we, because we've built the cs f cluster effectively in uh in aws, that's where we then run uh our workloads. And so from an engineer's perspective, it has a very similar look and feel to how they operate on prem. But now we're able to run in the cloud.

And the important thing here is that with, instead of it being on prem where the computers is there at all times, and that's, you know, you can, you will operate with that fixed size in this case, you can extend and reduce uh as as needed by the different workloads.

Ok? So then thinking about, well, where are we, where are we now? So a couple of examples. So the chart on the left, so it's talking about our percentage of uh on premise uh versus cloud. And so the blue is the cloud usage and orange is uh is is on prem. So you can see it increasing over the, over the years. And so we're actually now at 73% of our computer workloads, uh we're running in cloud.

And so what we're finding is that all of our product groups uh are now using cloud uh daily uh as part of their uh as part of their development life cycle. Um so whether you're um a hardware developer or a software developer, even, um there's, there's cloud going to be available for any of your workloads, whether it's simulation based form, workloads uh place and route uh or, or modeling.

And so on the whole, we've been using uh spot instances wherever we can for uh for, for maximum cost efficiency. But of course, we need to, we will use on demand instances as well. And so when with those, and also with those multiple platforms that we've been using with cloud runner, clyde bridge and our foundation platform, we're able to find the right platform that's needed uh for the for the right the right time.

Um and there's a really sort of view the uh the process of getting into cloud, they really sort of three main steps that the first part being uh enabling our teams to get into cloud. And then it's of course, it's migrating them into cloud.

And then the final stage is really then looking at that optimization piece where you're mapping the best instance uh type for the given workload that you have and really looking at the whole stack end to end seeing where you can be most uh most efficient.

And then the chart on the right is really looking at our uh graviton usage over time. And so this has been gradually increasing and and this is uh in, in part with the partnership they have with their ed a partners and with them developing uh and optimizing their tools to run an arm based technology. And so we can see that uh continuing to increase uh over, over time.

So thinking a little bit about the benefits now. So we've got uh same workload on a couple of different uh tools. Um and so on the left hand side with the, the first tool. So the green, the green bars here um are the, the graviton instances and the orange bars um are the uh the, the x 86 instances. And so lower is better here because we're looking at the cost to run an entire workload. And so the uh arms r seven g has given us the best, uh the best overall cost.

And then if we look at the lower charts on the, the, the bottom left, so this is the outright performance. And you can see here uh the, the performance is improving uh over, over time with the various uh generations uh of uh of graviton. And you can see that uh here, we've got the, the graviton. Uh and in this case, the a md instance uh of the of the two fastest ones on the right hand side, this again, looking at uh different uh different tool uh this time uh overall cost. Again, graviton being the the the cheapest cost uh and then comparable uh performance uh as well compared to the uh the a md instance.

And so looking a little bit further ahead, I was going to talk about our, our primary cloud uh concept and vision. So we started off uh bursting into the cloud and then we were starting to think about how do we run whole projects in the cloud. And now really when we start to think about primary cloud, this is one cloud account for the majority of our work uh in our, in our projects. And in this case, the the whole environment uh is provisioned in a fully automated manner uh using code.

So first of all, we set up this uh virtual desktop infrastructure. Um and then from this uh area, this uh we make sure that the data is co located with this, this, this is primary cloud and then that allows us to set up in our primary cloud, our foundation platform environment. But also we get access to uh we can access um partners such as reskill with their technology that then allows us to either go into uh on premise.

Um so when we run our uh our project workloads, typically, we can't run everything uh in cloud today because we've got some bespoke hardware that that really has to remain on prem. So these are typically board farms, fpgs uh emulators uh that that type of thing. And so having that technology allows us to then run workloads uh back on premise.

Now, whether it's our on prem clusters, which will be getting more cloud like in the future or the bespoke hardware, it allows us to do this but also allows us to access um uh with our multi cloud uh goals, we can access other other cloud uh infrastructures as well.

So to conclude, um so design chips is very resource intensive and it is expensive, but aws cloud helps us really improve productivity, reduce the costs while decarbonising compute. So of course, the leverage aws graviton where we can for the better performance and the the the lower cost that that brings. But then think about that sustainability angle. So the latest, as i said earlier, the latest graph on simulation. So this 67.6 reduction in workload carbon intensity which is really quite incredible.

So, of course, the cloud, we will find that this optimal instance for the best price for the job at hand. And of course, we always have the scalability uh which really improves the the throughput for our for our project teams. And of course, when you give the teams what they need, when they need it, you have happier engineers who end up being much more productive engineers.

And as i said earlier, the ed partners and their tools ecosystem uh is ready uh for arm based technology today. Uh and arm is now available in all major clouds. So with that, i'm now going to hand over to sean. Uh who's gonna tell us about astrazeneca's uh john to crowd? Great.

Thank you, mark. Um so i'm sean o'dell and i work for astrazeneca in the center for genomics research. Uh i'm from um currently working in cambridge in the uk and thank you for coming to this session. So early in the morning, i know what you feel. I ca i came uh from uh london last night. So thank you for coming so early today. We appreciate that.

Um so what i'll um what i'll talk about here today is to tell you, well, who are we uh i'll talk about um what we do and, and why do we want to get data into the hands of the scientist? Um i'll put this in context of scale and we came up with this term pettis scale, but i'll, i'll explain what that means to us. Uh and then i'll talk about how we do this with aws um to achieve outcomes across um not only sharing data that we create, but also how does this relate back to the business and back to what um astrazeneca is trying to do?

So, the center for genomics research, th this is um this started in 2017 and i've actually been working in this capacity or with the center of genomics research in one way or another since 2018. Um and when we, if we start at the value, so what what is the value that we're trying to derive? Well, it's really around two things. You know, we talk about life sciences and we talk about health care and in, in the area of science or the in the area of life science, we want to use information about your dna or about your genome uh to help us drive better understanding of the biology of disease.

And then in regards to health care, by using this information that comes from your dna or from your genome across a large population of people, we can then derive stas statistically uh valid information which will then help us to s to select patients for like new clinical trials. But also to influence uh the type of therapeutics that might be applied to a patient that has a condition. And we also refer to this uh in, in the um in the industry or in this area as precision medicine.

So we're trying to couple information that we have about your, your genome and analyzing that across a huge and vast set of data to then personalize medicine and selection of therapy uh to a person based on that information. So that is the value that we're, that we are uh driving or attempting to drive from the center for genomics research.

And to do that, uh we have, we need different capabilities or we need a need a set of capabilities and the capability that is in the middle there that i'll probably be spending most of the time on and, and my role is uh building these informatics and analytics uh pipelines and capabilities. So i'm gonna talk mostly today about uh a platform built on aws uh that drives us to being able to deliver these values to the business. And i'll give you some examples of what those outcomes are and also how we do that with aws.

But this is who cgr is. I work in informatics, which is the one the head with the gears and the ah slide there. Ah and that's where my focus is and that's what we'll kind of focus on today.

Ok. So how does this all start uh w when we think about data? And i said we want to get data into the hands of the scientists. So if you, if you saw me in the elevator and you said sean, what do you do? I would say my job uh and the team that i'm on our job is to get data into the hands of scientists so we can drive that value.

So this is how it starts. We start on where we talk about sequencing. Uh and this is taking going from uh biology like blood or saliva uh that comes from a, a person. So we're, we're, we are focused on humans um as not opposed to aliens, but on humans as opposed to like mice or something else. Uh and what happens is you people contribute uh their biology or their blood or their, their saliva uh to what we call a ban, for example.

Uh and there's a l there's some large bio banks that we work with. There's the uk bio bank. Uh there's a big bio bank we work with in mexico city, mexico city population, that's the name of the project. And these are very large bio banks like the uk bio bank. Uh 500,000 people have contributed what we call a sample uh into this bio bank.

So that sequencing step, what it does there is it takes that biology and it unravels that and it creates bits and bytes. So now at this point, we have files, we have data. Um and then what we need to do is we need to extract and transform and load that data into places repositories that then people and scientists or scientists can run analysis or run analytics to drive discoveries or to drive innovation that then uh influences those values that we talked about.

And it's really interesting when you look at our workload or workflow here because mark has a very similar workflow. And i think one of the points that i think mark has made and then i will make as well. Is that what we've talked about here? You can apply it because i'm assuming you're not all in health care here today, but you can apply it to anything.

So we call it the genomic workflow orchestrator and the genomics file store. But you can think of it as the widget workflow orchestrator or the widget file store. And i think what mark showed in his slides is um it's almost exactly the same type of slide. So we're doing, we have these steps that we need to run to get data from the sweet sequencing machine as it comes off into what we we call tertiary ingestion or tertiary analysis.

So these are all um work flows that we need to run. And that's we and that's what the workflow orchestrator does. So it coordinates and it orchestrates uh running all of these workflows in an automated manner. But we also need to keep track of all the data uh that's being produced by these workflows. And we refer to that as the genomic file store.

And each one of these workflows, primary validation or ingestion, they're all writing data, reading data from the genomics file store and writing data back to the genomics file store. And when we started this project, we actually built the genomics file store. Uh was the first thing that we built and what is that? It's um s3 objects.

Uh and it's also metadata uh about those objects because it's very critical that we understand. How was this file uh created, what were the tools uh that were used to create uh all of this data? Ok. So this is, this is our extract transform workload. It's etl, it's really what it's all about has a fancy name but it's about etl.

And when we talk then about the scale and the challenge, you know, we, we, we came up this term pta scale, we could say large scale or big scale. But petta scale uh kind of refers to the fact that we're moving petabytes of data um in and through this pipe, these pipelines and also storing, storing petabytes of data uh in our genomics file store. Ok

And, and we have i mentioned here 25 petabytes of data. And that's not, not necessarily what we're consuming in s3. Uh but that is all the data that has kind of passed through this, this pipeline at various stages.

All right. And we also need to have, have standardized pipeline and uh paula and i were speaking about this earlier that, you know, i talked about 500,000 samples that we have to process the first sample in exactly the same way as we process the 1,000,000 sample. Uh so you need to have standardized pipelines and you need to have standardized uh workflows and of course, wrapping around this, we use um to make sure we do that, to make sure we do that with code and make sure we're testing that always in the same way. Of course, as you would expect we use code pipeline.

Uh we use data lake a to bs, data lakes to um bring all of this data together into a place where scientists can access it. Uh and we also um as part of our processing that we do, uh we use software called dragon uh which runs on f one or fpj instances and um you know, there aren't thousands and thousands of those. Uh so we also built our pipelines so it would be geo distributed so that we can in an automated manner.

Uh we can run our pipelines in, we use three regions, um london, dublin and uh us c. So those are the things uh that you need to consider when you're, when you're building something like this and in the genomics world.

Um you know, the claim to fame is, well, how fast can you do this? Ah how many samples can you get through in a day and using aws and using the kind of technology that a aws provides as mark also mentioned, where you can burst and you can have large spikes of workloads. Um that is what has be has enabled us to get to this point uh where we can run some of these numbers like 38,000 whole exomes in a day. Uh is uh is quite industry leading.

Uh we talk about whole genomes 11,000 per day. But the point of this is that using aws allows us to run at this scale, we call it pet a scale. You can use that term if you want as well.

Um so here's some of the components uh that, that we use that are, these are kind of the bedrock components and, and um srinivas has, has talked about some of these. Um but these, this is the components that we use kind of the core components and they are as what you would expect.

We use aws batch, aws batch is really our, our work course. I mean, we've been using it, as i mentioned here, we've been running in production since 2028 uh and aws batch, all those workflows that i was showing you in the previous slide. Uh those are all running in aws batch and each sample needs to run probably 10 batch jobs. So if we had 500,000 samples, uh that's a lot of batch jobs that run.

But we also use a w bs parallel cluster. Uh because parallel cluster has a nice interface that people are familiar with that have used hpc has slur m. Uh and there are times that that we run, we burst from our on premise slur m cluster uh into a w bs using parallel cluster and that works quite well.

Uh and then to coordinate all this work, we use a to bs step functions. And of course, we use a roar for database and dynamo. And then s3, as i mentioned is where we store all of our objects.

We are using a new feature of, of s3 called mount s3. I think it's called mount point s3. The mount mount s3 is the command. Um but that's really a handy feature now that allows you to mount an s3 bucket as it were a file system and it does, i mean there's been things that could do that for a while. Uh but this is something fully supported by aws and, and we've been an early adopter of this and have been quite successful using that.

Ok. So how do we stitch all this together? I know this is a little bit of i chart and i won't go through every uh box here. Um but this is how we automate all this. So when a person so that the there's an end user terminal there, an end user doesn't think about oh i need to run this on batch and then i need to store the data in s3 or i need to load the data into a database.

What the end user thinks about and this is what we call a pipeline manager thinks about is i have for example, 500,000 samples that i need to get through my workflow. And i don't care how that runs. I don't even care if i'm a pipeline manager, if that runs on aws to be honest, or it could run in google. Uh so that's what a how a pipeline managers thinks.

And they just run a few commands to submit a processing set, a data set, for example, for 500,000 samples. And then we do what you would expect. This is not, i don't think what i'm showing here on the screen is really innovation. It's a standard core pattern uh that anyone can use.

And it's your typical pattern of a job goes to a queue that's dispatched by a lambda function that then starts a step step function that coordinates the work of running the batch job, collecting the data after the batch job finishes keeping track of that data, storing that data in the metadata catalog and so forth. But this is a very common pattern.

And i think in the terms of kind of these types of, of diagrams that we see. Uh this is a fairly simple pattern and the point that i wanted to make to you to leave you on this slide is that anyone can do this. So this isn't related to genomics. Uh this is if you're processing widgets, uh then you could use the same type of pattern.

Ok. So uh some of the outcomes, well, i'll talk about, i mean, this is um well, what does this really mean? I mean, what, so, so what, so you ran these etl jobs and what kind of impact do they have? I i think the, the um 11 of the things that i wanted to mention here that mark and srinivas also also mentioned is that when we run like for example, billions of statistical tests, uh these are perfect examples of scatter and gather jobs.

So we, we do that all running on aws spot. Uh and it's quite cost on ec2 spot. It's quite cost efficient uh to do that but being able to do this and to run these business buildings of statistical uh test. Uh this does give us a value to our business uh because we can do it quick quickly and we can get data into the hands of scientists quickly.

We also then have been able to identify novel targets for medicines uh and also has, has had a big impact on clinical trials. So now we can use this information that we've created and analyzed to target people for clinical trials, to get medicine in, into, into the hands of those people uh that has a basis of not only their clinical data, uh but their genomic data.

Ok. And we also, it's very important to astrazeneca and many of the companies here, it's not just astrazeneca, but to return uh what we've gained from being able to do this uh back to the scientific community. Uh and we do that uh by publishing papers as many of a as is the practice. Uh so we want to make sure we're contributing back to the community.

And i just have 11 example here, we call it the f was a portal. Um but what this is doing that kind of tying this back together uh is we're, we're showing the linkage between a clinical observation and i know that the writing is hard to read, but you can even reproduce this diagram yourself.

Um but we're showing the linkage between clinical observation in this case, breast cancer, uh two genes that are in this, the yellow box there that have some type of linkage between this clinical condition. And this isn't a new discovery. I'm not sh sharing some secret that we just came up with. But this information as a as a scientist is somebody who doesn't have the resources uh that a w or sorry, that astrazeneca has.

Uh this is our way of giving back to that scientific community so that people, independent researchers can leverage the, the data that we have. And this is just one example of that.

All right. So that, that's um, what, what i had to cover today. Uh thank you again for attending. Uh, you know, from, from our point of view, working with aws is not just about what technology can aws provide. Uh, but it's really a, as much of that. It is also about the people uh, that we work with at aws.

Uh they have a lot of, there's a group that has a lot of knowledge in health care and i think in the hpc world, uh h aws has a role experts. So i just wanted to say in closing that it's not just about the technology, it's about the people that we've been able to work with.

And we really look forward to working with aws in the future. I mean, this is a long journey uh and look forward to working with them in the future to impact people uh society and the planet. Thank you.

All right. Thank you, sean. So you've seen in the last 40 minutes or so how customers like arm and astrazeneca and thousands of others are able to use our uh hpc and accelerated computing services at aws and accelerate their innovation, providing business value to their own customers.

One of the key aspects for our customers to be so successful in running their advanced computing and hpc workloads on aws is really the ecosystem of our partners as well as you know, customers, right? So here, you can see again a very small subset of our independent software vendors or isv partners as well as our consulting partners.

It's very, very important for us to work with these partners very closely in ensuring that their applications provide the best price performance on aws earlier. You have seen, you know, a a set of two instances that i've talked about. But when it comes to launching these types of new services, our team works very closely with all these partners and making sure that their applications provide the best price performance for our customers.

Similarly, with the si partners, we work with them very closely so they can actually go deploy these services to you all successfully. And in summary, you know, all of our innovations over the past many years have been recognized again at the recent supercomputing conference two weeks ago in denver.

So for the sixth year in a row, its has been chosen by our customers as the best hpc cloud platform for running all your hpc workloads with that. Uh i thank you all for taking time on a monday morning at 830 coming here all the way to the other end of the strip. So really appreciate your time and thank you all.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫