My name is Graham McAllister. I'm a Senior Principal Engineer, Amazon Web Services. I wear shorts, kind of one of my defining characteristics.
Now, I work, I work on Aurora and today, what we're going to do is we're going to deep dive into Aurora. And the reason we're going to do that is I'd like to explain how Aurora works. So you can understand how it can benefit your running of databases or your application. And we're also going to dive into the innovations that we've done over the last couple of years with Aurora, the new features, some of them you may have heard in other announcements, but we'll, we'll get into those.
So what is Aurora? I'm not going to go through this whole slide. It's a lot of things you might have seen it before. But the thing to understand about Aurora is it's a purpose-built cloud native database for MySQL and PostgreSQL. And that allows us a lot of advantages because it's built for the cloud and we're really getting into the architecture of how that works.
So let's go right into the architecture.
So the first thing you'll notice about Aurora is its storage is quite a bit different than most databases. That blue big box is what I think of as an Aurora volume and it's spread across three availability zones, right. So you have durability by default across the availability zones. Those yellow boxes are storage servers. Now, I'm showing nine but in like production, there's thousands, right? And that's where it's a multi-tenant system that we build to support this.
So when you start an Aurora cluster, you get that volume and then you can add one instance. And the first instance you get is the read write instance. That's where you're going to connect to. That's what your application is going to go to when it writes, it writes log records, not blocks. So this is one of the fundamental differences and you'll notice that I have six circles there because we write to six locations, two in each availability zones. And this is a four of six quorum. So we need four of those to get done before we'll acknowledge that, right? And this is to provide that durability. So that even in the failure of an AZ, we're still fine, right? We can continue and we're actually durable even if another storage node fails.
So, but we read blocks just like a regular Aurora or regular PostgreSQL or MySQL database. So there's no difference there and we only have to do one read. So we don't have to do a quorum read because our system actually understands what the sequence number of the rights are. So it can go to the one that's up to date and it will always try to go to the closest one to give you the fastest read performance.
Now, let's say this one block and those are segments, each of those is 10 gig, that one segment missed a write, because it got lost in the network. Well, that's bad. Right? But we have peer to peer replications. So we'll just get that from one of the other copies. And this all happens behind the scenes, right? Nothing has to be done. And if a whole storage node has a problem, we'll just make a new copy of that segment on a new storage node. So we do this repair and detection all automatically. We also do heat management this way. And the nice thing is because it's per segment, we can do this at scale, right? It doesn't matter how big your volume is.
The other major difference is if you want a replica, if you want some place to read your data, typically with MySQL or PostgreSQL, you'd have to spin up a whole another copy. But with Aurora, you're just attaching to that shared storage. Now, that's great from a read perspective. But how would your database know what's changing underneath it? Right. So we forward over messages from the read, write node to the read only node that either do updates or invalidation to the blocks in memory to keep it up to date. Also about what's going on with transactions. It's asynchronous. So it's a little bit behind, but typically it's pretty good for doing a lot of read only applications and you can have up to 15 of these read only nodes in multiple availability zones and they can be of different types.
So I'm showing on the one side, I'm showing a Graviton instance in the middle, it's an Intel instance. And on the far side, it's a Servius instance. So you can mix and match sizes and types and families for whatever you need. And this storage grows and shrinks on demand. You don't have to do anything. You don't have to think about how big your storage is and you only pay for what you use. So if you go up to 10 terabytes, but then you drop a bunch of stuff and you go back to five, you'll be paying for five, right? So this is a great flexibility.
Now, if we have a failure in one of the availability zones or just one of the instances, we'll fail you over to one of your read only nodes. And at that point, your application has to figure out how to connect to that. Now, you're connecting through an endpoint which is a DNS. So you're going to go to Route 53 and you're going to get that new DNS record, right? That works pretty well. But it takes about 20 to 30 seconds for that propagation through the DNS system.
So to help that out, we actually wrote a wrapper for JDBC that helps make that quicker, right? So it knows the topology of the cluster. So when a fail over happens, it doesn't have to wait for DNS. And this reduces the time by about 66% on fail over. So this is a nice, if you really want really quick failover to implement that this basic foundation allows us to do really neat features like global database.
So here I'm showing in region A the exact same setup we had. And if I want to have global resilience, I can just now enable global database. And what you're going to get there is you're gonna get another Aurora volume in a different region of your choice and you get these replication servers and agents, but that's behind the scenes, you don't have to manage them. That's all just stuff that we do, right?
So the nice thing about the segment based architecture is it allows us to do all of this work in parallel. So instead of a single replication stream like traditional MySQL or PostgreSQL, you now have a lot of parallelism so we can run a lot of throughput through this system.
So when your write comes in, it's going to get forwarded out to all the normal places it does in the local region plus to the replication server, which is going to send it across to the agent and then from the agent, it's going to go to storage. And at that point, you're good from a DR perspective, right? So your boss is happy, you know, you've got global resilience, but you can also add read only nodes. So you can do reads in that local region.
And the major difference here now is that the replication agent is also doing those asynchronous invalidations to the new RO nodes, right? So exactly the same as the in region ones. And of course, you can have multiple RO nodes and you can have your application connect to them and do read only operations. So this is very handy, we also do storage to storage work because if we need to do any repair operations, we can just do that directly from storage to storage. And this again just cuts down on the amount of work that has to happen for the system.
Now this is all great. But you need to test this right? If it's a global resiliency, you need to be able to prove to your regulators that you actually are able to do this. So that's why we have the command switch over. Now, we used to call this managed fail over that caused a lot of confusion. So I didn't actually like that name either. So we've changed it to switch over. So the command now is switch over global cluster so when you run this, the very first thing we do is we make region A read only. So we stop taking writes there and we let the replication drain out of the system. And then we verify that the volumes match, right? And at that point, everything is symmetrical and correct and then we can promote region B and you're done, right? Except for what's happening with region A.
Well, we want to get back to our same configuration. So automatically we'll rebuild the replication flowing in the opposite direction. And we don't need to do any verification or anything because these volumes match. We know that right. So we have customers who do this, they test monthly, they do the switch over to a region, run their stuff, switch back. We have customers who do this just to move regions, which is really handy. If you're now more centrally located in a new region we have, right? You can move your stuff to the new region or we actually have a customer who rotates this three times a day to follow the sun for read latency. So that's kind of a cool application of this.
But of course, the main reason you have global database is in case of an issue, right? So that's why we have fail over. So if there's a problem with that region, you can fail over to region B, right? So this command is fail over global cluster, but it's got an extra flag called allow data loss. And this is because this is a synchronous replication and there will probably be some small amount of data in flight that will not make it over to region B, right? And so you have to acknowledge that there's going to be a difference once you do that, this takes about two minutes, 1 to 2 minutes and you're failed over and that's your RTO time, right? So this is very quick to get up and running during an issue.
Now, what happens with the original region as soon as that region is contactable from a network perspective, we're going to make it read only. So no more writes happen in that region, which is very important. So you don't get in a network partition case and then we're going to reset up the replication. But the first thing we do is we take a snapshot of region A and we make that available to you so that you can go look and see if there's any data differences that didn't get captured during fail over. And then we reset up, we do the synchronization between the two regions and then we reset up replication. So you're back going and this all happens automatically, you just had to do that one fail over command and outside of looking at the snapshot to see what's different, it's all automatic.
Now, what I've shown so far is just two regions and that's what I'm showing here. But you can have up to five destination regions and they don't have to be the same. So region B, region C could be using serverless nodes and region D could just be storage only, right? So you can mix and match the configurations. We have a lot of customers who use multi region to have local copies closer to their customers to get better read latency for their application. So that's a common use case for global database.
But to do that read application is kind of tough, right? What we're talking about is changing your application and separating out the read and write portions of it, right? And if you have a mostly read application, but a little bit of writes, well, if you wanted to try to run that in region B, that's not going to work because the database is read only, it's going to say no, right? It's going to fail. So you could build yourself a VPN tunnel. And as part of your application, you could send those writes over via the application to the other application, but that's kind of painful, right?
So based on customer feedback, we added a feature called write forwarding. So you enable global write forwarding on an instance. And now when you do a write that write gets forwarded over to the read/write node in region A and this looks just like the write came from another client. It just happens to be another one of the databases and then it's going to flow back, of course, to all of the places that a normal write goes, including yourself. And the basic mode here is that you get read after write consistency, we have a session level wait that happens. So your select will wait for that write to get back to you.
So that you don't have any weirdness with your semantics and that works quite well. So let's dive a little bit deeper into the storage internals because that's one of the special sauce in Aurora.
One of the keys is that we do a lot less work in the head node when it comes to making changes to the database. So I'm showing Aurora PostgreSQL here, but this is applicable to both Aurora MySQL and Aurora PostgreSQL - pretty much exactly the same, slightly different technique but pretty much.
So when you go to do an update in memory in PostgreSQL on the left, you get a new tuple, right, a new row and that's going to get sent to the WAL stream. But you'll see that we've got a full copy of the block as well on the WAL stream. And PostgreSQL does this for crash recovery reasons and I'll show you that. But if you do another update, it doesn't send the block. And this is because it only does a full page, right, or a full page image the first time after you touch a block after a checkpoint so we only did one checkpoint and then we did these two things, we only get one full block, right?
So at some point later, we do the checkpoint. So this is work we have to do to write all the changed, data blocks out to the data files, right? We also have to archive the WAL out and we have to back all this up to S3. So that checkpoint we are doing is 8K on PostgreSQL 16 on MySQL. But the operating system is doing things in 4K chunks. If that system crashes in the middle of that checkpoint, right? You're going to get what's called a torn or split page and this would be silent corruption and that would be really bad.
So PostgreSQL has this great feature where it'll take that full block image and put it where that data file, right was and you no longer have any corruption. So that's great, works wonderfully, but it's a lot of extra action you can see happening. And of course, because we're generating WAL when we crash, we have to recover and that takes minutes, many minutes in some cases.
So let's look at what PostgreSQL, what Aurora does instead. We do that same update on Aurora. The only thing going to storage is the log vector, right? The change whether it's the first tuple, the second tuple, the third tuple doesn't matter, it's always the same.
So we don't actually do any checkpointing and no full page writes. So a lot less work, right, we continuously and in parallel coalesce those log writes into blocks. And I'll show you more detail in the next slide. And the key to that is the recovery is now happening in seconds instead of minutes.
So we're doing less work and we have better recovery time. So on the storage node, we have a lot of different components, but we'll walk through each of them. We have the read write node that's talking to storage and it will send a change down and call this change a it will come to the incoming queue. That's an in memory queue. So we don't acknowledge that because it's not durable, right? We have to wait to get to the hot log. That's the durable portion.
Then we acknowledge that to the read write node. So once that's there, you know, let's say another change flows through into the incoming queue into the hot log exactly the same. But we're keeping track of where we are from our sequence numbers and we notice something's missing. So we ask our peer node, where is that? It says here it is and it delivers B to us. And now we have the entire sequence of the changes that have happened.
We can move that into the update and then that we can merge that and coalesce that into a block and those blocks look exactly the same as a PostgreSQL block. So when we do a read, we're just reading a PostgreSQL or MySQL block back to the to the regular system. So it's really only the internals of the storage system that are the difference here, right?
But this also allows us to back up those log records and the blocks in storage, not through the head node in an intelligent way. This enables nice features like our fast clones feature. Let's say you had a reporting application that you wanted to run, but it needs some materialized views. It needs some new indexes. You might restore a snapshot. But let's say this is a 10 terabyte database that would, you'd have to restore 10 terabytes. You're paying for 10 terabytes for a couple of hours while you do this work.
Instead what you can ask for is a clone and you're going to clone storage that you can now attach an instance to. And that instance is essentially looking at the clone storage, but it's all pointers to the actual primary storage to start with. When you do a write or read. In this case, you're reading the pointer redirects you to the original storage and it happens just like that. A write. On the other hand, if you're doing one that's shared, it's going to now have to do a copy on write and make your own copy because it's different.
These are not replicated, they're separated at the time you do the clone, right. If you do your own write? They're just two your volume, nothing touches the primary volume. If the original database is doing writes again, just its own volume. If it touches one of the shared ones, again, we do the copy on, right? And so in this case, you're only paying for three blocks for the changes, right?
So this would dramatically change if you're doing, let's say a testing of a very large database, this is a great way to cost effectively do this. And because we do all this stuff in parallel in storage, it doesn't affect your performance. You can have a number of clones on your primary database without any effect. So it's very nice.
So again, building features upon features. Now we're going to talk about a feature we built on top of clone. So we have an export to S3 feature and the original one works from snapshots. So when we have storage, we make a snapshot, then we have to restore that snapshot to a new volume. We have to attach an instance to it and then we do an export and this worked perfectly fine. But as you can imagine that takes a while to take a snapshot and restore a snapshot and do all this work instead.
Now with exporting via clone, what we can do, we can just do a clone very quick, set up an instance on it and then we can export it. And with Aurora MySQL, we actually have parallel export now that allows us to go even faster in parallel from that system.
So, speaking of MySQL, what are the changes that have happened recently? We now support for 8.0.32 and our 5.7.12 version. So we're getting, you know, we're up to date on the minor versions. We had support for the latest Graviton, the R7g I should say the almost latest now that they've announced the, the R8g um supporting pera the extra backup local write forwarding.
This is a really keen feature works exactly the same as the global write forwarding I just talked about. But locally in your cluster, your read only nodes can now write to the read, write node without having to do read, write splitting. So this is very handy parallel export to S3 as we just talked about is is something new and enhanced bin log is another feature and let's dig into that.
So Aurora by default doesn't need bin log because it's not how we do crash recovery. We have that built into our system, but bin is still used for CDC replication. So if you want a logical stream out of your database, turning this on required a lot of extra work because in classic MySQL, you're writing to an OTB and you're also writing to the bin log and they are two separate things.
So when you say commit, you actually have to gather all these things together and you have to do a two phase commit. So that's a lot of extra work and extra time to do that. So it adds overhead. So to get around this, we built purpose built storage for the bin log so that we can actually run these in parallel as the workload is going. And this means a lot less overhead.
We've got it down to about I think 5 to 7% overhead to have bin log on and it shortens crash recovery time because we no longer have to do crash recovery on the bin log if you had logical replication going so very nice feature and we'll see how that's used later on.
PostgreSQL. We've also been very busy. We added support this year for PostgreSQL 15, got a couple of new extensions from the community. Very nice. Our 7g support here as well. We've been spending a lot of time on our upgrades. We really wanted to shorten the time and make it more consistent for both our minor and patch versions.
We've improved the availability of the read replicas. So you have better um availability there QPM or query plan management, which is our method of allowing you to fix plans so that they stay the same no matter what on a system can now be captured and run on replicas, which is very handy.
We added a logical replication cache to help speed up replication when you're doing CDC. And we've added support for pg_vector uh which I'll talk about more in detail as well as pg_active, which allows you to start being able to build uh active active database setups uh with PostgreSQL.
And of course, the last thing that I'm really excited about is we now have PostgreSQL 16 already in preview in our preview environment. And that means we're much closer to getting 16 out in production. So you can see an idea that we're moving much faster on our merges. Now with Aurora, we're really very close now on the community timeline. So that's a, that's a very exciting thing. And I, I was, I was very happy with the team on that.
So PostgreSQL continues to innovate and one of the areas that they're innovating a lot is on extensions and pg_vector is kind of one of the big extensions that many people have heard about lately.
So what is pg_vector? It's open source obviously, it adds support for storage with the vectors, right? It has two different indexing types currently to index that data, they have searching metadata and distance functions. So this adds all the things you need to do to support your generative AI cases, right?
So this allows you to use your favorite database, PostgreSQL. Now with your new demands from your business around generative AI and you can see the versions that are supported there. Basically all of our latest versions support it. And if you want more details on that project, because it's evolving very quickly, you can look at the GitHub page.
Now how this works in practice is you can do a create table. And the main difference is you can see this line that says embedding vector. So that allows you to put the vector in storage and then this is the index that you can create on it. And then you can use your regular selects to find things to find the right the right items, right? So this is super powerful.
There's a great blog with a very, very long name. That has a lot more details about how to use this in a practical application.
So we've been working a lot on manageability. That's one of the things that we're really keen on is trying to make things easier for customers. And one of the big areas is Aurora Serverless.
So Serverless is, you know, a nice concept. But what does it mean in terms of databases? So if you have something like a Lambda and you have a database there on the right side, right? Like these are logical concepts. But in reality, you have actual Lambdas and an actual database server with a specific amount of CPU and memory on it, right?
And that's working fine when you have just one thing hitting it. But as your workload changes and more Lambda fire up and more types of Lambda fire up the workload changes and you're going to need to increase the amount of RAM and storage on that machine.
And this just will keep increasing, right? But guess what, there's probably periods where that backs off and you need this, right? A small instance again. And that's what serverless is all about. It's all about scaling in place in under a second and it's pay per use on a second basis, right?
So it really is this dynamic system and it works really well with Lambdas, but it works about with just general usage of Aurora, Postgres and Aurora MySQL as well. So to kind of show that I wanted to take a challenging workload and show you not just, you know, a typical serverless example and this is simulating a day to night, you know, work where over the course of a day, your workload changes and then, you know, that one piece looks a little odd, but we'll get into that.
So right off the bat, we look and we see, hopefully you can see that the low, the CPU is about 59% on this system when we're looking at one minute averages, right? Remember averages lie, um that's not bad and the load is at, you know, nine or 16. So it looks like this system is fine, right?
But we're looking at this is on the bottom, our Performance Insight view, which is a great tool if you haven't used it really check that out. So we have, that looks fine. This blue bit looks a bit weird, doesn't it like something's happening here as it turns out someone's run a report or a series of reports that's accessing, accessing a bunch of data that we normally don't use during the day. Right. And this is causing some differences.
So let's drill into this first peak, right? Middle of the day, it looks like we're doing fine, right? But when we zoom in what we see is we're not actually doing fine, we see that the load is actually at 20 right? So that means 20 things want to be on the CPU and we only have 16 CPUs. So we're for short, this is going to cause some latency problems right now. This is on a provisioned instance of four extra large. So that's it. You're just gonna get latency. If you're doing serverless, you'll see the exact same spike, right?
But if i zoom in here and i look at this is the ACUs that we're allocating and what that's the Amazon Compute Unit. And that's how we measure how much capacity that a serverless uh instance has. We'll see that we continually scale up each time this spike happens. And those are, remember that each of those lines is a one second. So that's a five second spike. So we're going up and down very, very quickly and this means we don't increase any latency based on this like jitter happening. Right.
So, even though this system looked fine, customers were experiencing pain and with serverless, they wouldn't. Now let's look at that next section, right, that blue chunk. So let's dig into what's going on there.
So when we look at PI, what we see is that it's doing IO we got a lot of IO weights because we've asked the database to read in a whole bunch of blocks we weren't using and that's gonna cause some problems. So again, our load is almost at 16, but it's not overly high, but that's probably not the problem. We need more CPU but we also need more memory.
So let's let's take a little closer look at that because it's not obvious how this is hurting your application, right? So i wrote a little canary that all it does is a point select of something we want to display on the website the entire time, right? So this is a measure of our website latency in, in, you know, as a, as a canary.
So we're running along fine, the latency is pretty good. And then as soon as this workload starts, our latency goes 10x higher, right? This would be very bad for your website. You're probably gonna get a call from your boss when this happens and then you're gonna be tracking down the person that ran that report, right? That's not great.
So with serverless, we can do something completely different, right? We can actually figure out how to scale this to not have this problem. So looking in serverless, we see the exact same spike, right? And then, but it's, it's going to be a little different.
So on serverless, look at how we don't have that much blue anymore, right? We have green, we have CPU usage and that is because in that section, you can see how the number of virtual CPUs has went up. We've scaled it up both in terms of memory and CPU here. So that we're we're solving for the demand, that's there, right?
And we go back and look at this latency chart, right? So the blue one is the regular instance and the orange one is our serverless. And what we can see is that when that workload starts, we see a little bit of blip for both of them. But then once serverless adds the capacity, the latency goes back down versus the latency stays up for the regular one, right? And we get an eight plus times reduction in latency by using serverless.
So this isn't just about using serverless with Lambda, although it's a great fit. This is for using it with every kind of application, right? That has demanding workloads.
So how do we do this? Well, one of the ways we do this and this was some of the really interesting technology that the team came up with is dynamically resizing the buffer pool based on demand.
So typically i tell folks about how you have to think about your working set. I'll actually talk about that more later in the talk. That's a hard problem. But with serverless, we're trying to make that easy.
So you have your buffer pool, 75% of RAM is our typical default for provisioned as well as serverless, that's where we start you off and as you do read, that's going to heat up the buffer pool and you're going to start reading, right? That's great. We kind of watch that metrics and then as more reads come in, we'll actually make more room in the buffer pool, we'll expand it and then we'll fill that with the blocks.
So now we've got the buffer pool where we want it like when that workload was happening, right? We've scaled up, everything's good that workload stops and now that memory is going to cool off, right. And so we're going to evict those cold pages and we're going to shrink this back down and we can do this very quickly.
So we're moving that memory around, you know, on demand. So this works really well for kind of when you don't know how big your working set size is. This is another advantage of serverless. It's a really nice technology.
So the other area of manageability that we've been working on is really around how to stage changes and we use the blue green deployment model, which isn't really necessarily the deployment of code, but it's really around making changes to your database.
So I've got a database here. I'm showing MySQL, but we now support this on both MySQL and Postgres. You've got your application connected to your database. In this case, it's a 5.7 MySQL database that you'd really like to upgrade to 8, right?
So you can create a blue green deployment. And what that's going to do is it's going to go and make a copy of the entirety of your setup, right? So it's not just the volume, it's not just one of the instances, it's all of the instances, all your parameters, everything. And you'll notice it has a similar name on the target on the green side as on the blue side right now at the moment, these are both 5.7.
This could be, you could be doing Postgres as well as I said, you can be moving from 14 to 15. But in this case, we're going to upgrade from 5.7 to 8. So the first thing this workflow is going to do because you're told that you want to go to 8, it will automatically upgrade that and you can do various things with this feature like schema changes and parameter changes.
But one of the main things we put in is base automation was doing major version upgrades because we know a lot of people are going to want to do that. So the workflow will change this to an 8.0 database. And that's great. And you're like, well, now i have another database. It's 8.0, how does that help?
Well, we start replicating from the 5.7 database to the 8.0 database and we catch it up and once the replication lag is down to a minimal amount, we know they're almost in sync. What we're going to do is make that available for switching.
Once it's available, you can run the command to do the switch over. At that point, the workflow will essentially go to switch over in progress. It will verify everything's good from a configuration that nothing has changed. And then what we're going to do is we're going to stop the writes that are going to that the blue side finish replicating all the changes that are in flight. And then we're actually going to do the switch over and you'll see switch over complete as the status and you'll notice that your application because we're changing the endpoints automatically moves to the new cluster, right?
And not only that, but we rename the target to the same name. So you don't have to change any of your other scripting or handling because everything has moved, it's not just partial, right. So this simplifies a lot of the change that you have to do and makes it possible to do complex changes. And this is all stuff that you could do manually but is a lot of work to schedule and do. So we've just automated that down to making this very easy.
And when you're done with your blue green deployment, you do a delete, right? But that doesn't necessarily delete the, the original source. What it does is just decouple these two things and it allows you to go do verification of the original source. So you can say, is everything working the way i expected it to. And once it's done, then you can go manually, just delete that, that original source cluster and then you're done, right?
So this way we wanted to decouple it just so that it's safe. Now, as I said, we now support PostgreSQL, but PostgreSQL logical replication still has a few caveats with it. So you have to fit within those boundaries for this to work. And one is you can't do DDLs because it's not supportive logical replication.
So during the time you're doing blue green, just don't, don't do DDLs. It doesn't support labs today and it requires a primary key. Um PostgreSQL doesn't also support sequences, but we handle that for you and we're working with the community both through our committers and others to try to remove these restrictions in future versions.
So it's going to continue to get better and, and you know, better for each use case, but it's pretty useful right now. If you can fit within those restrictions.
The next area really around manageability that folks were telling us they wanted to see was help around ETL, right? So we have this concept called Zero ETL what is that?
So on the left, i show you the traditional sort of set up, you've got your OLTP talking to Aurora, maybe some light analytics going on. And on the far side, you have Redshift where you're doing your deep analytics, right? Your major work, you want to connect these two systems.
Well, we have a lot of services that can help you do that. But guess what? That's a lot of stuff to set up, maintain and run. And if the only thing you want to do, if you're not really doing a lot of transformations, you just want the data from one side to the other, that's more complexity than you want to deal with.
So hence Zero ETL. So we're trying to make something that's easy, secure and near real time, right? So it's very quick, it's not batching, it's streaming changes over between Aurora and Redshift.
So one of the keys here is it's storage, the storage and i'll get a little more detail on that. We do initial seeding for this, we do the ongoing replication, we detect changes, we do repair. So all of that is automated, we have the metrics that are available for you to see how this replication is functioning.
So you can see what the lag looks like and if there's any issues, right? So it really takes a lot of the handling out of the system. And so again, the special sauce is using our storage system, right?
We have, we're doing both transaction logs and CDC. And in the case of MySQL, that's binlog for binlog. We have the sorry for MySQL, we have Parallel Direct Export again. So we're using that from Aurora to go across to Redshift. And now you have the base seating done and then we can use that enhanced binlog. I was talking about to then feed our CDC pipeline to do those ongoing changes to Redshift.
So that works really well and you get a very low lag for your updates to Redshift. Now this is available, it's GA for Aurora MySQL and it's now in preview for PostgreSQL. Um the technology is a little different because of course, how logical replication works with PostgreSQL is different again. So there is a little bit of difference, but the base functionality is the same.
One of the other innovations this year is really we got a lot of feedback on a few things about our storage that we really wanted to do something about. And so we introduced a new storage type and this is really to solve two issues which is around price predictability and and around a few things around performance
So around price predictability. Our original storage, which we're now calling Standard, I'm gonna walk you through how that pricing works.
So, if you do a read from memory from the cache, of course, it doesn't cost you anything from IO because you didn't do any IO. Right. But if you do a read from storage 8K for uh Postgres and 16 for MySQL, you're going to get charged that fraction of a, you know, of a penny, right? For each read, that's pretty easy to understand, right. That's not too tough.
Um but if you do a write, we build this on 4K increments. So one log, write, of 4K costs the same as a read. And you can see that even though we're doing those six copies, we only charge, you know, for the one logical write. Now, if you do a single one K, write, you still charge the exact same amount. Ok. So, you know, less than 4K, that makes sense. But if for some reason where those block where those rows were, have moved around into more blocks across the storage system, that 14K, write might have turned now into 40 1K writes and your price just went up by 4x, right?
So this was the thing that customers didn't like because if you're doing a lot of IO it meant a lot of unpredictability about the cost. So a lot of feedback on that, we spent quite a while figuring out what we wanted to do and that's where we came up with the Aurora IO Optimized uh storage type.
So this is predictable pricing and better price performance for Aurora when it comes to high IO workloads. This is just a shot from the console showing that it's just a simple selection between the two when you're doing a create or modify.
Um the big thing that this is aimed at is folks that are doing more than doing a lot of IO. So if you're spending more than 25% of your total Aurora spend on IO costs, then this is probably going to be super helpful for you. It, we've seen, you know, 40% there is the number, but we've actually had customers that save more than 50% on their, their costs by using this feature.
It's available for both Postgres MySQL, serverless on demand, use with our RIs. It works with all of them. So let's do a little more detail on that pricing standard. Fairly simple. Just our on-demand pricing as it's always been for IO optimized, you have an increase on your compute costs of 30% and on storage, it's 125% but there's no IO cost.
So you're paying more to do the work up front, but it's all predictable. This is set at the cluster level, it's not an instance level configuration, it's at the cluster. So it applies to all the nodes. Uh you can switch back and forth like it says you can switch once a month, but that means you can switch to it and back in a month. So if you want to try it, it's perfectly fine.
Um it's supported for 13 onward as well as MySQL 8.0.3 onward and it's all the current instance types, right? So you can pretty much get this on anything. And one of the interesting features, you can have a different configuration for global database if you really want, I'm not sure why you'd want that, but it's available.
So let's go back to that same shot of the cost model, right? And this is just simplifying saying if you're on IO optimized, none of those things cost you anything. You don't have to worry about if you increase your read rate or your write rate, your costs aren't going to change very predictable, right?
But I told you, we also did some performance things because we aiming this at at workloads that do a lot of IO so we actually made changes into batching and some other optimizations to improve performance. And at sort of a midrange scale on a 2XL what we see on a write, heavy workload, we get up to about 30% at kind of best. But you'll notice that that yellow line for IO optimized is always higher than the standard line, right?
So in general, we have confidence that this will get you a performance benefit might be small, but it's there on larger instances, the benefit is bigger because of some of the way the changes work. So we saw up to 50% performance improvement for those changes. So that's super helpful if you do a lot of IO right?
But we also have another new feature if you're doing a lot of IO and that's called Optimized Reads and Optimized Reads. There's two sort of sub features inside of it. The first one's for temporary objects. And when I mean temporal objects, I mean, when you're doing sorting, spilling to dis anything like that.
So I'm just showing Postgres here, Postgres instance the regular thing with Aurora storage, if you do a sort in Postgres, if you go larger than work_mem or mean it work_mem, you're going to spill to disk, right? And on Aurora today, when that spill happens, it goes to EBS. So it goes over to EBS and then at some point gets read back, right? And we provision 2x the memory in EBS for you to do that kind of work, right? Works fine.
But now you have the option to run with Optimized Reads. And how you get to Optimized Reads is you order the instance type of one of the D's that we support. So currently that's the R6GD extra large through 16XL as well as the R6ID uh 624 and 32XL. It's always a little confusing those names.
Um so when you order that, what you get is you get this NVMe storage that comes with it just like if you bought one from EC2 and we're going to allocate 90% of that NVMe to temporary objects. If you're on the standard IO plan for Aurora, we reserve 10% of this for wear, leveling and to help the performance of the SSD.
So now when you do a spill, it's going to the NVMe. So there's two advantages here. One reduce latency because the NVMe is faster two, we use 6x the memory in NVMe. So you can do larger sorts. So if you were ever increasing your Aurora instance size, because you needed more area to sort to build big indexes or a lot concurrently, you can now use this feature instead.
So that's if you're on standard IO, if you're on IO optimized, we're giving you 2x the memory. So basically the same as what you get on EBS for doing spill, right? For doing temporary objects.
So what's the rest used for you're asking? Right. Well, that's the second feature. So that's Tiered Cache. So regularly in Postgres, if you're doing a read from buffer, right? Like if it's in the shared buffer, you do this read, you get the answer back easy, right? If it's not in the buffer has to come from storage, takes a little longer and then comes back to the Postgres, right? That's how it always worked.
But now you can order IO optimized, right? Tiered cache, we're gonna still put part of it for the temporary objects. But now you have 4x the size of your total memory for an extra cache layer, right? So this means you can do a lot more with this instance and I'll, I'll dig into that.
So we do reserve a little bit of the shared buffer about 4.5% as metadata for this cache. So how this works is when you go do that read with Postgres, it's first going to go check the tiered cache metadata to see if that block is in the tiered cache. And if it's not, then your normal read happens, right?
So let's assume that block hits read into memory and it gets read a few times and then those reads stop and that block ages out, right? What we're gonna do is we're gonna update that metadata and saying this is now in tiered cache and we're gonna destage it and that happens asynchronously. So it doesn't block anything, right? It doesn't block page evictions. It's really nice.
The other thing that you'll notice is that this is how we get blocks into the tiered cache notice. We didn't put them in when we did that basic read. So we didn't force a write to happen during our read request and that's always been one of the challenges with these kind of caches is they tend to have weird side effects and that's what we've in this design. We've tried to really avoid having any of those, right.
So now the blocks in cache, so the next time you read it will know from the metadata that it's there and it'll just read from tiered cache. And that's going to be much quicker than going to Aurora storage because the latencies are lower because it's on box right now.
What happens if you're gonna do an update? So you've updated that block. But this usually is a problem for cache is because you have to do invalidation or updating the cache, right? So what we do here is we go and update just the metadata. So it's super quick, has almost changes no latency on our updates. All we're doing to that metadata is saying, remember that block you have in cache now it's free, don't, don't reference it. So it's super quick. Again, we've avoided a complication of destage or making changes right now.
Speaking about how things get evicted from the cache in general at the moment, I say at the moment because we're always making changes. But at the moment, we use a random eviction algorithm and the benefit of the random eviction algorithm is it's very fast and efficient, right? Because it doesn't have to go through an LRU and you don't have to maintain an LRU the downside means that for at least some use cases, you have to be careful about trying to cache more than the size of your NVM because of how, what you're going to evict and we'll talk about that in a little more detail.
But everything I've talked about requires you to kind of think about how big your working set size is, right? So I want to talk a little bit about working set sizes just so you have an idea of what I meant by that.
So if you have this database that has orders and it's all nicely partitioned really well done database, it's 800 gig. You might think my working set size is 800 gig. I need a huge box, but in reality, orders are not referenced that often the old orders, right? You mostly are working with the current year, maybe the past year, maybe even just the current month. So this database might have a working set size of 80 to 100 gig, not 800 gig, right?
On a different database that has like inventory information by location. And can you guess where the last one is? That's where I live. The heat is different, right? The heat will be, each inventory system will have some heat and some coldness. So that one's not just going to be the same as your orders. So in this case, it might be 200 to 400 gig of working set that you need, right?
So you kind of need to understand a bit about how your application works to know that. So let's examine how this works in practice. So I'm running sysbench read only point select workload, right? This is the uniform distribution. So this is fully random on a uh a 4XL. So it's 100 and 28 gig of memory. So 75% of that, you can kind of see how much we're going to get there 80 some gig. And on the vertical access, I'm I'm the number of queries that are at that latency point, right? And the latency goes left to right from smaller to bigger. So you want to be on the left hand side with your queries, right?
So I run with an 85 gig working set, right? And it doesn't matter whether I'm using the tiered cache or not because this all fits in memory. So we're not using the tiered cache anyway. And what we see is about a 23 millisecond average latency and everything that's happening. Sort of on that blue side that I'm labeling mem memory is happening in memory, right?
So we like this, this is how we like to run our databases right now. One word of explanation when I say random, just because the workload is random doesn't mean all the block heat is going to be uniform because we have indexes. And when we access an index a random index, we're going to go through the tree, right. And we're going to get to the heap and we're in a row and the next one's going to go a different part of the tree. But you'll notice that we're starting to heat up the blocks at the top. right. As we continue down, more and more blocks heat up until we actually have like the root node is super hot and the next level down is hot, right?
So even with a random workload, you do have some distribution of heat. So back to our workload now, I'm doing a regular one, no tiered cash here. So just our standard setup uniform, but I've increased to 340 gigabytes, right? So this is four times as large, way bigger than memory, right? And what we see is the amount of results we're getting from the memory side of the graph has gotten a lot smaller, right? And we're starting to see stuff that latencies are coming from storage. So that's slower.
We actually see a 4x increase in this workload of performance or latency, which means a lot less performance, right? What you can do here is you can live with this latency or you can buy a box that's four times larger and that's not a very fun conversation to have with your leadership, right? Because that's, that's awkward. You're like, would you like performance or would you like a lot of cost right.
Now, what happens if we go even larger? Right now, we're out at 8x and we see again, the amount of ones being done in memory are less and the amount of ones going out in storage or even larger, right. So this just pushes our latency out even further. And we're now at 5.5x, this is a dramatic increase, right? It might be ok for your application. But if it's not, this is, this is a pretty big deal
so now let's try it with the tiered cache. that's the one in pink. and what you see is we have still not a lot happening in memory, right? because this is a very big workload at 340 gig. it's not only maybe the index is fitting in memory. now, most of the data is not most of the heap isn't. but what we see is this next section, this is the nvme piece, right? this is where we're getting latencies from the nvme that are much better than from auror storage. and so we see now we're at 0.35 milliseconds, which is only 1.5 times higher than our base in memory case, right?
so we went four times larger. so this means today, if you're running on a box that you don't need all the cpu on, you can possibly shrink down by one or two instance classes and save a lot of money by going to this, this instance type right now, again, this is a uniform workload. i went to eight x. now, the eight x doesn't actually fit in the nvme anymore. so now we're going to end up doing wreaths from storage. so you see the amount of latency i'm getting from the nv has moved and we're starting to see some of it going from storage. right? and my latencies went up. so i'm at three x. so it's not great. it's not bad, but, you know, it could be better, right.
but again, this is a uniform workload. how many people think they have a completely uniform workload in, you know, at, at their job? yeah, exactly. so, it's a good benchmarking thing. it's a good thing to illustrate things, but it's not how actually things work. and part of the reason is like, i'm just gonna show you a right leaning index, right? like if you have an index on an order date, then as the orders keep coming in, they're going to all be on the right side because that index will just keep moving to the right. and so you will be heating up a set of blocks that is much smaller than, you know, all of the database. right.
now, again, you may get some requests for some old world orders, but that's going to be less frequent than you're going to get for the current stuff and the ones that are being inserted, right. so to show how this will work. i've ran a parado distribution on cis bench, which is really focused on a small number of blocks, but it still touches all the blocks. and so that's the same workload at 680 gig in yellow. and we now see that we're getting a lot more in ra m and we're getting almost the rest of it in nvme and we're only 1.4 times worse, right?
so with this thing on this kind of workload, we've managed to support an eight x larger working set size and not have a real material change in our latency, right? so this is where the tiered cash feature is a real advantage. i have one other example here since we talked about pg vector, this is from one of my colleagues where they ran on a 12 xl, a very large 1.28 terabyte with a very large index of over 700 gig because the vectors tend to have very large indexes compared to the data, right? and we see a nine x performance improvement by using tiered cache on this system. so if you have a lot of data on the biggest boxes that doesn't fit in ra m, this is the other way to get the performance you need.
now, i have one last feature to talk about that. probably no one's heard about limitless database. you probably no one saw peter's announcement on monday. this, we've been working on. we're very excited about, uh the team has done just a great job getting this to, to preview um what is limitless database though? well, it's about scaling really. this is why we built it, we built it to help scale databases and that roughly equivalate to sharding. right? and what do i mean by sharding sharding? what we're talking about is you have your database and over time, your utilization grows, what you're going to do is you're going to keep breaking it into more and more pieces, right? and it'll continue to scale. this works great. but you'll notice at the end you have a lot of things to manage, right? that makes a lot of complexity.
so i actually don't believe that scale equal sharding, i think scale equals man is sharding because there's a lot of complexity to sharding, right? so one of them is reshard. so you start off with your database originally and you figure out how to do the initial sharding and that takes a bunch of work, but it's a one time thing and you think i did sharding, i'm done, right? but then over time, what happens is one of those shards gets too large and now you need to rechart, right? and that's problematic because this has to happen running and it has to happen probably on a somewhat frequent basis. so that's a challenge. you have to deal with one of the other ones and i think is the biggest one is consistency.
so let's say you want to add a column to a table, right? so you're going to do ad dl well, that means you got to run it on all these things. one, you've got to figure out how to run that in a way that's going to roll out for your workload, right? but also you might miss one, you might have a configuration change, you might be missing a parameter. you might, you know, all these kind of things you have to keep synced up. that's a challenge. right.
let's say someone wants to know the number of open orders at midnight on these databases. well, if you try to run that query, you can run it at midnight, but it's not actually going to execute exactly at midnight on every one of those databases, you're going to get a slightly different answer. right. and the same thing for backups, if you want to do a backup at two in the morning, you'll run a job. but how it gets run will mean that you get slightly different backups. so if you have any cross shard transactions, it might be in one of them, but not the other. that's hard to explain, you know, when the next day. right.
and the other one that's sort of challenging, that's sort of a little harder is capacity management. when you have that one big database, you can think of it as a very big container. so as the heat goes up and down during the day or fluctuates seasonally, you have this big container to kind of absorb that load and, you know, fit those waves of demand together. even when it gets smaller, it's fine. but uncharted, you'll see that you have heat in different places. right. and that heat's gonna move around during the day or during the week or during the month. and that causes a lot of problems because it you're in a small bucket. so it's really easy to overwhelm one of those.
so what happens, people end up with bigger buckets, right? bigger containers and that costs you money, right? and you're not utilizing that system. so that's not great. so with limitless, we've really tried to solve some of those problems. so you start with your regular aurora cluster. this is not a different thing. it's part of that. um you have your storage and we have this shard group which is basically a container that collects these things. and the first thing you get is the distributed transaction routers. i call them just routers for short. and that's though this is where you connect your application to. so your queries go through here, they get routed to the right shard. it handles the distributed queries, distributed transactions, ddldml, all that stuff, right? and your data is stored down on the shards and you can see that there's different sizes we do hash sharding. and so you depending on your distribution of your data, we'll cut it up to match that right now.
this does scale. one of my colleagues ran a benchmark up to 2 million tps just recently just to, you know, kind of show that that works well. so we know we get scale out of this system. we know that when we have hot shards, like i'm showing this shard that's gotten very hot and is too big. now, we have automated reshard. so we break those apart, we have serverless capacity. so we're using that serve technology i was talking about for both the routers and shards. so during the day as all that load moves up and down, we handle that without having to do splits or adding anything because we have this server capacity and we handle the consistency.
so when you do ad dl, all the routers know about all the ddl and so do the shards. so we're doing a distributed transaction across the entire system. and the reason we can do this is because we have a global clock and that allows us to do a bunch of things that is different from most charter systems and that we do have a consistent system, right? we can do a backup that is consistent. so you won't have that split transaction case i was talking about in the previous example.
so let's walk through a little bit about how that works in detail, right? so we're using a global clock now and i'll walk through a set of transactions. so the first transaction is just doing a select of a particular bid from pg bench, right? and what happens is the router assigns a time and then that time is what's used on the shard to set the, you know, what would be the snapshot in post rt's terms, right to run it. so it's going to run that query and give you a result, right?
transaction two that starts just a little bit later, runs a select for a different b of 801 right? it's going to get at a later time and it's going to run and it might be the same shard, it might be a different shard and it's going to get an answer of one, but let's go do an update to that. so we're updating that one now to the value of 1000 and one. so this is going to commit and this works through the router and the router goes, this is a single shard transaction. so we don't have to do two phase commit. we can do single phase, we can assign a time and then once the time is assigned, we can make sure that we do this correctly so that everything remains consistent. so that commits, let's say we do transaction three after that, right? and it's going to look at the same bid of 801 right? it's going to get a signed a time later. t 125 and just like how postgres works because t 125 is greater than t 110. we're going to see the current value of 1001, right?
but we'll go back to transaction, t one and it hasn't committed, right? so it's still in the same open transaction and it does that same select of b 801 this gets assigned t 100 right? because that's our snapshot time for that transaction. so it's going to go and do the same semantics we normally do with postgres and say, what was the value of that thing at t 100 which was one.
so, oops, sorry. so we get the result we're expecting because we're using these clocks and we use this for, like i said, for backups for ddldml, all the things consistent reads. and this is sort of a fundamental difference with limitless database is that you're not giving up that consistency by going with sharding, which is one of the downsides.
so that's all the time i have for today. um there's a lot of great related sessions and i'll leave this up for a little bit. um the 1.3 ft 44 was the embargoed one on limitless. uh that was done yesterday. so it's a great one. you'll be able to see that on youtube to get a lot more details. about a lot of these things. the hyperscale one talks about some of these other challenges that uh around charting as well. and uh with that, thank you very much. and i'll uh take a few questions uh in the last couple of minutes and remember to fill out your surveys.