Today Agreement and welcome to CMP3 13, which is our deep dive session on how Graviton enables the best price performance for your AWS workloads. By way of introduction, I'm Sudhir Raman. I lead to product management for our core compute instance platforms at AWS and co presenting with me today are Ali Si, Senior Principal Engineer at AWS and Ambu Sharma Technical Lead at Pinterest.
So we have a full and an exciting agenda lined up for you today. We're going to talk about our latest and greatest processor that we just announced at Re:Invent a couple of days ago, which is Graviton 4. We're going to discuss workload performance customer stories. We're going to talk about how you can get started moving your workloads on Graviton and um will also walk us through Pinterest Gra it on adoption journey and some of the key takeaways.
So with that, let's get started a little bit of background here. AWS has been investing in building custom chips over the last several years and these have included our Nitro cards that are custom chips that we have built to power our Nitro system where we offload storage and networking of the main server um and maximize resource efficiency, improve the overall security of our platforms. It's also included custom chips for AI this is chips like Inferentia and Training and then we have powerful and efficient core computer infrastructure with Graviton based services.
Why we do that? There is multiple reasons why we have invested in building custom chips. So first and foremost, looking at specialization, so this allows us to bring the right feature set to our products optimizing for cost and power based on use cases in the cloud on AWS for our customers speed from defining the product to building the server to actually landing those in our data centers, all the teams that own the hardware and the software all under one roof. So we're able to work together to essentially get the product from concept to definition to launch really quickly innovation. So again, an opportunity for us to innovate across the stack from the root of the hardware to, to compute all the way up to the overall software and the application layers. And we do that by working together and not optimizing each of these components in the silo and finally, security where Nitro continues to help us enhance the security of our servers.
So let's take a look at our Graviton journey which started back in 2018 with the first Graviton ship that was announced powering the two A one instance at that time. And our goal really was to essentially prove out that you could run cloud workloads on our based service. And, and we quickly followed that up with Graviton two in 2019, that delivered a leap in terms of performance and capabilities versus the Graviton. So when Graviton two came out, it came out with four X, the number of cores and two X performance per core versus what we delivered in the first year.
We've continued that innovation vector with Grabiton three that was announced in 2021 that delivered under improvement in terms of performance, up to 25% improvement was grabbed on two. And finally, the latest addition to that lineup of chips is grabbed on four that we just announced in Adam Phillips key's keynote on Tuesday and it was characterized as the most powerful and the most energy efficient chip that we built thus far at AWS.
But first up, let's start with Grabiton two and, and take a look at what we have in terms of offering. So Grabiton two when we launched it delivered up to 40% better price performance for a wide range of workloads. And this includes workloads anywhere from web servers, video streaming, gaming, all the way to databases, analytics. And more today, we have 13 instance types that are powered by gravid on two chips and they're available for nearly every workload in the cloud. We compute memory storage network optimized. And we also have options with GPUs attached to them when Grabiton three came along.
So Grabiton three added another 25% compute performance over, Grabiton two and also deliver two X. the floating point performance was just Grabiton two Graviton three was also the first time that we brought DDR5 memory to our data centers and to the cloud and that delivers up to 50% more memory bandwidth versus DDR four memory that we used in gravid on two really benefiting those memory intensive workloads.
And today AI is, is front and center in everyone's minds. And for a class of workloads that AI that you can run on CPUs, Grabiton three brings very interesting architectural improvements with two extra vector width and also support for bloat 16 instructions. And when you combine that with all the improvements that have happened in the ecosystem around porch intensive flow and arm computer library, customers can see up to a three X performance for CPU based ML workloads when you run that on gravid on three.
In terms of offerings, we have eight instance types today that are powered by graviton three chips. This is part of our core portfolio with our compute general purpose and memory optimized workloads. And we also build variants for specialized instance types like network optimized CMGn. And we built an instance just for HPC workloads or high performance computing given all the benefits that Grabiton three delivers, especially with the Grabiton three EQ that adds another 35% improvement in terms of vector instruction performance, directly benefiting those HPC workloads.
In terms of customer momentum. I just wanted to quickly give you a snapshot of kind of where we are seeing all our customers use, Grabiton and today, more than 50,000 customers as well as all top 102 customers use Grabiton based instances. And we're seeing essentially adoption across customers of every size in every geo across multiple verticals and industries.
Here's what some customers are saying as they are transitioning their journey to Grabiton three some examples here. And this is a theme that we will see resonate across pretty much the broader customer base, which is improvements anywhere in the 10 to 20% all the way up to 40 50%. In some cases like you see here from Next Role, Sprinkler and Strike sustainability is a key pillar that many organizations are looking into today and looking at opportunities in which they can lower their overall carbon footprint. And Graviton plays an important role in being able to achieve that objective.
So Graviton uses up to 60% less energy to compute the same workload as other comparable processors. So this really means that in addition to price performance benefits that Graviton enables a lot of our customers are also looking at using Graviton to lower their overall carbon footprint case in point. Here's an example with Snowflake. So when Snowflake transitioned their workloads to Graviton based instances, they were able to lower their carbon emissions per Snowflake virtual warehouse credit by 57%. And additionally, as part of the transition, they were also able to offer 10% faster performance on average to their customers and it's not just external customers.
Graviton has been an integral part of our infrastructure even internally within Amazon. So an example over here is how we've been using Graviton for our own Prime Day events. And we started that in 2021 with 12 retail services deploying more than 50,000 Graviton instances. We continued scaling that infrastructure through 2022 and come 2023 Prime Day, more than 2600 services have been running on Graviton with over tens of millions of normalized Graviton instance types.
So now that brings us to our next iteration of Grabiton which is Grabiton for the latest process so that we just launched a brain and we've tried to continue to push the performance envelope further even though Grabiton two and three have already offered multiple benefits. We've seen a lot of customers come and say, hey, as I'm bringing on more workloads into the cloud, I have a need for more computer performance or larger instance sizes or I want to scale up some of those workloads and Grabiton.
So all of that has led us to continue to innovate on this vector and come up with our next gen Gra it on four which you will see first in the form of powering our eighth generation two instances. So grabber down for is first coming to the R instance which offers the best price performance for memory intensive workloads such as large databases, big data analytics and in memory caches. Compare to gra it on three r seven g, the r grab it on four base instance offers on average up to 30% higher compute performance and larger sizes. So up to three X more v cpus and three X more memory versus grab it on three.
So these instances are in preview today. So if you want to give it a try, you can sign up for the RG preview on the web page. So with that, I would like to now invite Alisa to come on stage and tell us all about the gravity on for.
Thank you zadar. So we're gonna spend the next few minutes diving deep on what graviton four is and uh some of the performance that it brings. As sadir mentioned, we've been iterating a consistent cadence here every couple of years. We're inducing a new graviton with some substantial improvements in performance. With graviton four. We increase the number of cores in every CPU by 50%. We have 96 neo verse v two cores. As we looked at real workloads, we saw that their working set just wasn't fitting in the caches that we had. And so every one of those cores has a two megabyte l two cache as well that lets you put more of your, your data and instructions close to the, the cores and they execute faster just like graviton three. It's a seven chip lit design where we've got more cores, more ddr 12 channels of ddr 5 5600. It's the fastest in ec2 and up to 96 lanes pc ie gen five much like uh with graviton three, the cores are all connected together with a mesh. Portions of the last level cash are spread out across that mesh. And there are those 12 channels of ddr five that's over 500 gigabytes. A second of theoretical memory bandwidth and the efficiencies we get there mean that you get pretty close to 500 gigabytes. A second oo of actually achievable memory bandwidth in one single chip.
Now, as cia mentioned when we started graviton one, a lot of it was about proving that you could have another architecture. In ec2, you could run a variety of armies, you configure instances the same way security groups, everything just worked as you expected. And some scale out workloads saw pretty good price performance there. With graviton two, we increased those workloads substantially. We saw people running java applications go applications, key value stores databases, tons of other workloads. Graviton three increased the performance on of, of most apps by up to 25%. But we also targeted it for um workloads like machine learning where we added instructions and we increased the, the sy d width to make it applicable to machine learning, to hpc, to s id. And with ground 14, we're really increasing the applicability again. Um we've had customers come to us and say i moved all my databases to graviton. It's great. I currently use uh an eight xl 32 vcpu si think in the next one or two years, I'm gonna end up using 64 vcp us as my business grows. But then what you don't have an option that goes bigger. So with graviton four, we have an option. Every single socket has 50% more cores. So a 24 x large and we've gone one step further for the small number of workloads that need to be scale up and can't be scaled out or support in coherent multi socket. So two graviton cp us can connect together and deliver single systems that have three x more cores than graviton three and three x more drm than graviton three.
When we first announced graviton two, we talked about how we encrypted the drm interface on uh on the graviton two processor, we carry that over to graviton three and with graviton four here, we're expanding that as well. We're not only encrypting the dr am interface to graviton between graviton two and, and, or the graviton and the drm. But we're also encrypting the interface between the nitro card and graviton. And we're encrypting the interface between the coherent links if they're being used at the time. So, assir mentioned earlier, we are building nitro, we're building graviton and that gives us an opportunity to cove these. And so when we, we sat down to build um the pre the fifth generation of nitro card, we wanted it to be able to power this server in a couple of different modes. We can power it in two non coherent virtual systems, one coherent virtual system or two metal systems or one metal system. Um and one of the reasons we do this is if you're not using the coherence, we can turn that off and save some power.
When we developed graviton four, we paid a lot of attention to real workloads and looking how they performed. And as we collaborated with our partners like arm on the core that we would use in in graviton, understanding how real workloads ran on on this system. And to visualize this, we used radar graphs that look like this. And a single plot can give you some idea of how workloads are going to perform and what they're sensitive to. And each one of these axis corresponds to a different design trade of a chip with the value corresponding to how sensitive is that workload to the chip. And if you split a uh ac pu in half, there's a front end. This is the part that receives instructions, predicts branches um wants to find branch locations. Ultimately, these result in front installs if they're not working as well as they could. And then there's the back half of the cpu, the back half executes the instructions
It's got the adders and the multipliers, the load and store units. It's what interacts with the L1, L2 and L3 cache. And ultimately, those result in back installs. And so we can take workloads and we can look at them and say, how well does the workload work on the processor? What's it sensitive to? And the higher value, the more sensitive it is to performance, the smaller here is better.
So what can these kinds of graphs teach us? Well, we can look at different workloads and say, how sensitive is it? How well are we gonna do with a new processor? And here I'm showing a traditional benchmark. You can see it's stressing the last level cache but not a whole lot else. It's very back end sensitive.
Now when you step back and think about it, well, that makes a lot of sense because when people build benchmarks, usually they find a hot kernel of code, they extract it out of a real application and then they loop through that hot code many many times to get some stable performance that they can measure. And so by doing that process, it's just sensitive to not that many things.
When we look at real workloads, we see a completely different set of graphs. So here I'm showing Cassandra, Groovy and Nginx - some common workloads we find with cloud - and we compare these workloads. You can see they're bottlenecked by an entirely different set of things, the front end more and various other portions of the metrics. For example, the branch predictors missing more. There's more instruction misses to the L1 and L2, there's more TLB misses.
And so we thought, well, how can we solve some of these problems? And so one thing we did was we doubled the L2 cache size. This means that more instructions are closer to the cores. And with the core, we've chosen Neoverse 2. It's better at predicting branches. It's got larger BTBs. It can fetch more instructions into the front end and execute faster. And then Graviton 4 has a number of architecture improvements. It supports ARMv9, supports SVE2 and it has some new control flow instructions like BTI.
So with all of that, how does it perform? For the next couple of slides, I'm going to compare workloads from Graviton 3 to Graviton 4 - an r7g versus an r8g. All these cases, I'm using vCPUs. So it's a like-for-like system except I've taken out a Graviton 3 system and I've put a Graviton 4 in there.
In this case, we're running MySQL and using HammerDB as an open source load generator. HammerDB is meant to mimic a company that has inventory, keeps stock in warehouses, sells items, processes payments, invoices, delivers orders, things like that. It's typically measured in new orders per minute, or how many new orders can the database sustain.
Comparing a Graviton 3 to a Graviton 4, here we see a 40% increase in performance. So a really substantial increase in a single generation.
Similarly, doing load balancing - you're doing load balancing with Nginx. Nginx is a web server, it can also be configured as a load balancer. And so we're keeping the system otherwise the same, we've got a load generator running work, we've got a number of backend web servers that are load balancing too and those are static as well. And we've just swapped in the Graviton 3 or a Graviton 4 system, the same number of vCPUs and you can see a 30% increase in performance here - a really big number.
Here we're showing a Grails application. So Groovy Grails - Grails is an open source web application framework and it runs on Groovy which is a JVM language. We found these to be more representative of performance you get from Java workloads than traditional Java benchmarks. Here too, again, vCPUs and we see a 45% increase in performance going from Graviton 3 to Graviton 4.
And lastly, I have Redis. Redis is a popular key-value store, key-value stores let you look up data much faster than you can from a traditional database. So it improves interactivity. And here we have three load generators, two are generating load and one's just measuring latency as we do it. And comparing again, an r7g to an r8g, we see a 25% improvement in performance.
So we've got 30%, 40%, 25% - really big numbers going from generation to generation. In fact, if you put those together and look from the a1 instance that Zir talked about at the top of this talk back in 2018 through Graviton 2 in 2019, Graviton 3 in 2021 and today Graviton 4 in 2023 on different workloads, you see almost a 4x or even more than a 4x increase in performance in those four generations in those five years.
So with that, we have our eighth generation of EC2 instances. They're powered by Graviton 4. They have up to 3x larger instance sizes, they've got DDR5-5600, the best price performance in EC2. And there's also substantial energy savings that Ser mentioned.
Now, I've told you what we found by measuring workloads internally in AWS. But I always find what customers have to say about our products is a lot more interesting than what we have to say. And we've had a few customers look at R8g over the last few days and give us some feedback.
There's Datadog - Datadog is an observable and security platform. They run tens of thousands of nodes in AWS and already half of those run on Graviton today. And what they found when they tested R8g was it was seamless to switch and it gave them an immediate performance boost.
Epic Games is the maker of Fortnite. And when they tested R8g, they said it was the fastest 2 instances they've ever tested.
And lastly Honeycomb - Honeycomb is another observability platform that enables engineering teams to find and solve problems they couldn't before. And with their testing of a Go-based workload for OpenTelemetry ingesting, they saw a 25% better performance, 20% lower median latency and 10% improved p99 latency. That's all some pretty big numbers.
Now to talk to us about the journey of moving to Graviton, we've got Bo from Pinterest.
So quick recap of Pinterest - it's a visual inspiration platform for discovering and shopping the best ideas of the world. And a quick overview of our infrastructure - we have over 300 services that are built using open source technologies and shared services framework, and that run on tens of thousands of instances.
We are an X/Y scale platform on Amazon S3. And why we cared about Graviton and why it caught our attention was because: one, price performance, and the second was energy footprint. In both areas, we saw an opportunity that we could leverage and deliver value to Pinterest as well as our customers.
But given that we have 300+ services, it's important to look at how we will methodically evaluate these, how we will prioritize them through an architectural change here.
So what we did was take an ROI driven approach, which is not just what's the price of the service but also how much effort is it to evaluate on Graviton and what would be the effort to migrate on Graviton.
We also did a root cause analysis for compatibility issues - some of these things were known up front, for example JDK changes, enhancements to native precompiling libraries to an ARM architecture as well as things that we discovered during runtime where we would go in and convert that into a unit test and put that in the pipeline so that we could repeatably check these compatibility issues in the future.
How did we do the testing? We ran A/B tests as you can see the pipeline - ARM built and x86 built running in canary and prod environments. And then we care about comparing in the KPIs - this is including resource utilization at the hardware level. And then also understand the service SLO - which is are we seeing any degradations or any unintended consequences on our services?
And finally, we establish a decision criteria - so what happens when we see better performance? What happens when we see equal performance? Do we have a go/no-go decision? And who makes that decision call? In a lot of cases, that was the service owners themselves.
And lastly, we also had to do profiling and optimizations as we go. And an important point that I'll communicate here is that it's an opportunity to right size your workloads and your hardware. So when you're seeing optimizations and as hardware matures gen over gen it changes potentially the shape of the instances that you're using. So it's an important opportunity to do these tests in a methodical way, have the validations in play.
And similarly, what was shown earlier, we also use radar charts for resource utilization and visually comparing them. And it quickly shows us how well an instance fits into a given workload.
Now let's go through individual services and our tests specifically. I'll spend a little bit more time on this slide, try to explain the architecture diagram that we have here. It's a quick overview of Pinterest. Obviously we are running 300+ services, and if we start drawing the call graphs of everything, you can imagine it's probably very, very unreadable for the audience. So we have a very simplified version, we're gonna traverse bottom up the stack here and we will go through services and we'll see what the KPIs and scale is as well as what the results were.
Memcached - very popular open source distributed caching platform. It's used very, very widely at Pinterest. We cache profile metadata and any kind of metadata and information that is delivered to our apps is cached in Memcached. It's interfaced to the clients via Moxi router which is another open source project for scale - around 200 million requests per second, much higher than that, 200 gigabytes a second, over 500 terabytes of data in the cache. As you can imagine this is a memory bound service, so more memory better.
The KPIs here are latency, QPS which is throughput, and the hit rate - making sure that you're able to find the data in your cache. We evaluated for this case r5, which were our baseline instances, then r6g and also r7g as r7g rolled out while we were rolling out r6g with Moxi. You can see the performance numbers here.
I'll quickly describe how to read these charts - when you see charts, bars that are equal, it basically means that these are constants, these are things that we are keeping constant for competitive comparison reasons. And what are the variables are things that are moving. So you're trying to figure out for constant latency, constant memory, what is the CPU utilization and what's the cost? Because that's how you can do apples to apples comparison.
In this case, for constant latency, constant memory, these are what the numbers look like. And our results were, we saw a 30% reduction in cost for this workload. Specifically we picked originally r6g but then also r7g rolled out over time as the new machines launched.
Let's talk about another workload again, fairly popular used in databases pretty much everywhere, which is TimescaleDB database. Ours is called PinDB.
It's powers, our interest, application performance monitoring and it's built on top of Metas Beren and Gorilla for scale of over a billion time series stored here. 55 million data points ingested per second. Once again, very much a memory bound service because we want to deliver charts and graphs to our engineers as quickly as possible.
KPI here is throughput - data being ingested, we should be able to ingest as much as possible. And then read latency.
What we did here was evaluated R5, DR6, R7 GD and X2 GD. When we considered these instances, we did a theoretical evaluation, right? What's the price performance based on just what our memory requirements are? And then we pruned that out to see what we actually test.
So we tested R5 DS and R7 GDs, X2 GDs. And you can see with R7 GD and X2 GD, you can clearly see there's a 2x difference in memory. Therefore, we pruned R7 GD out again and instead stuck with X2 GD.
There was also an opportunity for us to do right sizing here because what we end up doing as you can see is instead of 4XL, we can actually pull off the entire workload with the 2XL, which is dramatic reduction in both cost and improvement in net resource utilization.
So as you can see with half number of cores, the CPU utilization doubled, right? The cost drops for a constant memory, constant latency. So in this case, about a 40% reduction in cost is observed.
Talk about Java microservices - this is a subset of our microservices that we're testing here specifically our core microservices which are serving things like Pins, Boards, Follow graph, metadata information to our app built using our shared services framework comprising of Finagle, Thrift, Netty, JSON and GDK and Java itself, which are again very common in the industry.
Scale of over 5 million events/requests per second. It's a CPU and memory bound service. These are written in Java so there's a heap and you want to compare with constant heap what's the performance, what's the GC doing, how frequently can you run things out?
KPIs - latency and success rate. We want to make sure that because these are core services and they don't, we don't want request failure. That's our core priority because that starts impacting service availability for people. And then secondary is latency here.
We evaluated C5, C6, IC7G and we picked C7G because we saw a 15% reduction in cost compared to the baseline of C5.
Another one of the workloads is Python microservices - it's our core API actually. It's built using GraphQL serving GraphQL and REST APIs over a million requests per second. CPU and memory bound application.
KPIs here being success rate and latency once again. Evaluated C5 and C7G and one thing that you'll see nuanced here is I don't mention sizes because we did some very complex right sizing here and it's difficult to depict how we will show constant memory, constant latency and everything.
So we're deliberately not showing that information but about 25% better price performance here.
Key takeaways:
- We're able to run Rabbi on at scale in production at Prust.
- We see wins across various workload types - microservices, data systems. Specifically we see clear price performance advantage in memory bound and task bound workloads. And by task bound is where you need the core, not a vCPU, not a hyperthread, you actually need a core to do the work. We see a clear price performance difference there.
- And the last thing is that what might help you in your Graviton journey is to clearly create a scientific methodology, have the guardrails in place so that you're able to do scientific evaluation. Figure out what works, what doesn't work, what are the opportunities for your teams and then make a decision.
And lastly, I also want to thank all our teams at Pinterest who have been an integral part of doing this Graviton evaluation and migration.
We have one more thing for you, which is Graviton for Performance Review from Pinterest. These are initial results so big disclaimer here that these are just initial performance tests. But these are the two services that I mentioned earlier, which were our Java and Python microservices.
And we evaluated, what we did was we ran them on R7G and R8G instances just to see what the CPU would do. So this is just purely a CPU test and we saw a 30% reduction in CPU usage here.
So with that, I think we get to the summary:
- For best price performance for a variety of workloads in EC2, Graviton is going to be a great starting point.
- The newest Graviton 4 powered instances are now in preview. Anyone interested in signing up, the link is on the Graviton webpage - it's a pretty lightweight, easy form you can fill out and we can enable access.
I want to thank you all for taking time out and attending this session. We really appreciate it. Please do fill out the survey in the mobile app when you get a minute and hope you have a great rest of re:Invent. Thank you!