What’s new with Amazon RDS?

最新推荐文章于 2024-09-12 10:03:37 发布

李白的朋友王维

最新推荐文章于 2024-09-12 10:03:37 发布

阅读量142

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134831499

版权

Suresh (General Manager for Amazon Relational Database Service): Hey everyone. I'm Suresh, I'm the general manager for Amazon Relational Database Service. I hope you've been having a fantastic and productive week so far Dream. And thank you all for coming to this session.

Over the next 60 minutes, I'm going to talk through some of the major capabilities that we have introduced in RDS this year. It's a 300 level talk. So assume that you know what RDS is, perhaps you even use it, but not necessarily that you know how it works under the hood. And that's a little bit of what we get into today.

To level set, RDS is a fully managed database service that makes it easy to set up, operate and scale relational databases in the cloud. We offer customers a wide choice of databases including cloud native Amazon Aurora, open source databases, and as of this week, we're excited to say three commercial databases including DB2, which is our newest.

And for customers who bring these databases to RDS, what's our core value proposition? The main thing is simplicity. Customers come to us because it's simpler than self managing your database, whether on premises or even in the cloud. And so the first thing that we look for is are we doing what we need to do in terms of making it easy to develop on RDS? For example, if you had a generative AI use case, what are we doing to help make it easier for you?

The second thing is, are we making it easy to operate and administer a database in the cloud? Right? And this is a big part of what we do. We offload the undifferentiated heavy lifting associated with operating databases in the cloud - backups, patching, all that stuff. So that's the core value prop.

But in our quest for simplicity, we also wanna make sure that we're consistently hitting and improving and raising the bar on the core things that anyone looks for from us when you're operating a transactional database. Any mission critical database has to be up, right? So availability and durability are paramount. We also focus a lot on performance and scale for two reasons - as your business grows and as your workloads grow, we want you to have confidence that RDS headroom is also growing, right? As our ceiling grows. And second frankly, we want to deliver better price performance every single year so that you can do more with less.

And then last but definitely not the least, security and compliance almost goes without saying - data and security two pieces in a pod, right? You have to have secure data for your enterprise workloads and then compliance is critical because you need compliance both for your own internal business operations, but also you may have products where your customers expect specific compliance of you.

So that in a nutshell is RDS. And so when we think about what are we going to deliver for our customers, and most of this comes from conversations with customers such as yourself, whether at re:Invent or through the year, we look at innovating across these six different dimensions - increasing customer choice, availability, durability, performance, scale and security compliance.

So let's get right to it. We're going to organize our conversation today along these different dimensions starting with customer choice...

[Transcript continues]

It costs $1,999 right? So let's click one level deeper and look under the hood and see how does this happen.

So at its core, an important concept is called vector embeddings. So what we do in this case is you take your product catalog, you chunk it up, right? We call it tokenization into smaller pieces and you run those smaller pieces through the foundational model.

And what comes out on the other side are these vector embeddings. So vector is nothing but a list of numbers, right? An ordered list. If it was a two dimensional vector, you call it x and y axis. You know, if it's 1536 dimensions, same thing. It's just a much higher dimensional space.

And the key is in these higher dimensional spaces, the closer you are when you when you're clustered, it's more similar. This is like the fundamental core concept, right? Similarity in a higher dimensional space.

So let's bring it back to your example. You have this catalog application, right? So you have your catalog, you break it up, you put it through a foundation model. And what comes out is a bunch of embeddings which is stored in a database and this, by the way is a link to IDs, but that's a one time thing or maybe uh an infrequent thing where you know, you're updating your metadata and the actual application, your user asks a question, the question itself is tokenized.

You use the token or, or, or you create a vector embedding out of it, you use that vector embedding to search what you've already stored in the database to find similar data, which then gives you not just the user's question, but also additional context that's specific to your enterprise, which then you run through another limb.

And then what comes out is a more specific answer like, hey, this is $19.99. So this is the transaction part of it. And when you look at sort of the intersection of this, it's the vector storing database.

So the question is, I mean, why is this a big deal? Right? Turns out there are several challenges with storing vectors. So the first is it takes time to generate these vectors and to actually index them. Second is you know, if you're storing 1536 numbers with every single column that starts adding up and you can't easily compress these things because they're mostly random numbers.

And the third is the quitting time is actually complicated because again, when you're trying to do a similarity search, it's not like you're comparing one number to one number, you're actually comparing fif 1536 numbers on one side to 1536 on the other side, right? And so it certainly takes longer than the animation.

Um and so there's this concept of approximate nearest neighbor and effectively what you're doing is you're not looking for exact matches, but you're looking for stuff that's close enough. I mean, think of a search that you do on an internet search engine. You don't need all 50 pages. What you care is, you know, good enough answers in the first page. And that's kind of what this is, right? It's faster than getting the exact comprehensive list and you get a share of the total results, which is what we call recall, right? So that's what you want out of the vector database. Fast builds fast searches and you know, something that's also efficient from a data and cost perspective.

So the question becomes, what do you use as your vector database, right? Do you create something bespoke something like pine cone which we support through bedrock? Again, we're all about choice or do you want something in postscripts itself as an example or opensearch, then you have to question how much data are you storing? What do you really care about? Is it the performance or is it the late uh late relevance of the answers that comes right, speed or relevance.

And then last, what kind of application do you have? Is your application more in sort of you're constantly updating your metadata? So you need better indexing or is it very high throughput? So you favor credit time and your ability to rate your apps through new schema.

One thing we hear from our customers a lot is again, they want choice. What they want is the ability to use vectors within the existing workflows without having to go create something else like a new vector database. And you know, having TL set up. And that's where for ideas for posters BG vector comes into the story BG vector is a open source extension that allows you to store index search vectors.

It's co located with your metadata. So it's easier to create and it gives you a range of options in terms of distance vectors, you know, things like cosine distance or euclidean distance. Um it's fairly full featured when you use as a builder PG vector and I DS post dress and it comes out of the box.

Um a key decision you'll have to make is what kind of index and scheme do you do you want to use when PG vector was launched, it had one version called uh IVF flat. After that, it added a second index type called hnsw. So what are these things? And why do you care as a builder?

So with IVF flat, it's essentially organized as a list of vectors and HNSW is more like a graphical model. So this has implications of performance in terms of regal uh search performance. And then both of these again have different characteristics in terms of build IVF flat requires you to have a prepopulated data. But HNSW is able to iteratively add more data and it's also much faster in terms of searches.

So as a builder, when you think about using PG vector with IDS for posters, your choices are gonna be between fast indexing if that's your main priority, use IVF flat. But if you're interested in search and recall, um uh higher performance for uh higher recall or ease of management use H and W

Couple of charts, you know, on the left side, uh yellow is IVF flat. It's faster, lower is better on the right side, um higher is better and HNSW wins on credit throughput. But the most important thing here to note is this is a very rapidly moving space.

So I wouldn't take anything as a fixed point in time when we first introduced IVF flat. A lot of our customers said, hey, you have to be better in terms of search performance. So we worked with the maintainer to get H and SW and when we did that, they said this is great index speeds need to be faster, indexing speeds.

So 0.5 0.1 is even faster, so very fast moving space and there's a lot more coming. So as a builder, how are you going to keep up with this? And this is again where one of the benefits of using a fully managed service is you don't have to think about it. Just stay current with the latest minor version of IDS for post. And we will keep making sure that you get the latest and creators in PG vector and it's not just the latest minor version also stay current on your instance types.

So we have an example here of graviton three versus graviton two hires better graviton three. Outperforms, graviton two. So again, as a builder, if you're looking to build generative AI use cases, one of the reasons to consider using IDS for process is you can stay current without much work. So then you get to focus on the things that are specific to your business.

So that's on generative UI, AI switching gears a little bit. Another thing, a lot of our post crest developers have asked is, hey, how do I get more flexibility in terms of innovating in and around IDS for post crests? Right? And so last year we introduced something called trusted language extensions which gives customers a lot of flexibility in what in the ways they can use postgres.

But let me just start with a recap of what extensions are. Extensions are a very powerful mechanism in post. So lets you take literally any part of the system which has hooks and alter the behavior. So PG vector itself as an example is a extension. The reason this matters is that it allows you as a builder to rate a whole lot faster than core.

Post christ does coors crest releases a new major version every year. And frankly, if you're waiting a year to get a new index type into PG vector, the world would just move faster, right? So you get a lot of speed, a lot of innovation by using the extensions. But there are challenges, right? And the challenges is many of these are C they require deep access into the system.

And when you do a major version upgrade, you know, it doesn't always work out of the box. And so for someone like us as a managed service provider or for someone you like you using a managed service, it becomes a little problematic because we don't want to give deep access to extensions because there could be security issues of, you know, stability issues.

And then of course, when we offer a new version, we want to make sure that it just works like we don't want you to choose your own adventure, right? If you use an extension like PG vector. The next major version that comes, it should just work the way they have historically solved. For this is we do a lot of the qualification ourselves, certification and we support 90 plus extensions works great.

But the truth is there thousands of extensions out there. And so that's what trusted language extensions is meant to solve is to take away the dependency in IDS when you're a builder or a customer so that you can innovate faster than a new major version or a new minor version.

So as a builder, you build an extension using the TLE framework, the trusted language extensions framework. As a customer, you say, hey, I want to use it as a DBA, you say which extensions are my customers who love to use versus not easy.

Last year when we launched, we supported three open uh three of the trusted languages. These are the three most popular except for C which a lot of extensions have written in again for the reasons that C is not really a safe language. Our customer said we love this framework but we need something that's faster.

And so we now support P rust, it's fast and safe. And when we ran a test, you know, this is an example of normalizing. Uh uh I think it was 100,000 vectors, 768 dimensional vectors. The P address implementation is between four and 1817 X faster than the alternatives.

So that's on making it simpler to develop with RDS as a builder. What about the operational side of this, as I mentioned earlier, you know, a big part of what we want to do for you is offload the undifferentiated heavy lifting associated with running a database. It's all the stuff on the right. We want to take it off your hands so you can focus on the left schema design, create construction query optimization that's more specific to your business.

And our customers love this, right? A lot of customers use RDS and it's something like updates, we provide one click updates. But our customers have said, hey, as good as this is, it's still, you know, there's, it's causes trepidation every time a customer updates. And why is that first? There's down time, right? No one likes to take down time. You want to be in business 24 7.

And the second thing is if it's a major version upgrade, you have to qualify your application because the new major version of a database may or may not work exactly the same as before, right? And so that's a heavy investment of your resources to make sure that it works for the new major version of the database.

So excited, announced three new improvements we've made this year. The first is extended version support for postscript and mysql. We now support both of these databases three years past community end of life. And so if you are a customer that loves your application and loves your major version of post and my sql and just don't see a reason to move.

We will help you stay on it and make sure that you're getting all the critical patches for CD es. Any major bug fixes, you will be covered under a SLA, you will have the multi AZ support everything that you'd expect. And this essentially gives you three additional years where on your time, you can figure out when you're going to qualify for the next major version. So that's one.

But what if you're doing a minor version upgrade, you know, these come more frequently and they're really a good idea to apply because they often have security batches in the case of multi AZ DB clusters, which again is available for my sequent progress. We've made a major innovation which is to support one second rolling upgrades.

So how do we do this? Let me tell you a little bit about what multi AZ DB clusters are and then, you know, it, it'll make more sense. So essentially for both posters and my sql, we now have a clustered model of deployment across three AZs with one primary and two replicas and they use native database replication under the covers.

It's got lots of benefits. It's about two weeks faster from a writer standpoint, you get readable replicas, you can lower cost, it's faster from a fable perspective. It takes about 35 seconds versus let's say a minute to two minutes in the classic multi AZ design customers love it.

But in the context of upgrades, when you do an upgrade, we first upgrade read replicas. During this time, your database is still available because your primary can still take both read and write workloads. Then you do the switch over. This is the period of unavailability.

Now, if you use a proxy and this proxy could be IDS proxy, it could be something like PG bouncer proxy sequel. It could even be an advanced JD BC driver that knows how to handle DNS propagation. The downtime the switch over is less than a second or typically around a second.

This is the period of unavailability for minor version upgrade with audio postscript and my sequel when used database clusters. And then once you switch, your old primary is not a replica, it's still running the old version, you got to go upgrade it. But then you could do that in a way where you're still available because your new primary is now running the new version.

So this is also something we're very excited about. Our whole goal here is to make updates and not event for you. So you can do them more frequently and get access to the latest security patches, latest bug fixes, latest features. But what if you're not doing an upgrade? Right? What if you're doing something like a schema update? Right? Or a major version upgrade or a configuration change

Historically, there are two options:

You can do an in-place upgrade or you can do a switch over.

The in-place upgrade is super simple, but you're making a change to your live production system, right? So it's not safe, it's not fast.

The other option is you create a staging environment and cut over. Many of our sophisticated customers do this, but it requires a fair degree of sophistication and care. And so it's not simple.

So last year, we announced Blue/Green deployments which gives you a fully managed experience for switch over. It's simple, fast and safe. We started with support for MySQL and MariaDB and this year we now support PostgreSQL. And again, tying back to the work we're doing with the PostgreSQL upstream community, a lot of the reasons we waited an extra year was that we were not satisfied with the level of logical replication that's available in PostgreSQL.

We worked this entire year with the community to make sure that it's up to where we need it to be, where our customers need it to be. And now it's available as part of RDS.

It's not just for minor version upgrades, you can do any sort of change to your database and very quick you start with a Blue environment. It's a one button click, you get to a Green environment, we'll see it, we keep it in sync and when you make your upgrades, whatever changes you want to make you cut over again, it's your third button click and you're good to go.

So that's one on ease of upgrades. The next thing I'd like to talk about is data integration. And why is this relevant and even more relevant now? Amazon provides you a lot of choice again in terms of transactional systems, analytics, ML services. And then going back to what I said earlier, the best way to get more out of your ML models is to have your enterprise data make its way into the analytics systems and into the ML models.

So really what you're looking for is you're looking to shorten the loop, the learning loop from your transaction systems to your ML models so you can then come back and have outcomes for your customers.

The traditional way to do this is ETL (extract, transform, load). So let's say you had an RDS for MySQL database, you have Redshift. You want to get data from one to the other. You build a complicated pipeline and frankly it's a lot of work and it can be fragile, but in many ways, it's overkill for the use case of just getting your data without transformations from your transactional database into the warehouse.

So we now support zero ETL integration between RDS MySQL and Redshift. And for the simple case where you don't have a ton of transformations going on, it's easy.

So first reason we built it, it's easy. Second is it's fast. The replication happens in seconds. And the third is you can use it for more complicated setups where you have many transaction systems all consolidated into one data warehouse. We just make it very simple again for you.

So let's go through a little bit of what this is. When you think about zero ETL there's really four steps, you gotta set up, you gotta do the initial sync, you gotta keep the data in sync and then any analytics you want to do on the back end. If you're doing this yourself, lots of steps. What we want to do is fewer steps. So we want to get you from weeks or months of work to getting started in minutes.

And so how does this actually work? Right. So let's go through, you know, a few console screenshots. You go to the RDS console, you see you get started, just give the integration a name, you pick your source database, you pick your target database, target warehouse. If you have to configure some policies to make sure that you have the right permissions, go ahead and do that, then you go add tags and you're done literally in minutes. You can get started and all the heavy lifting, we take care of under the covers.

So what is this heavy lifting? You have your RDS MySQL database, we first have to seed the data into the warehouse. So this is very much like what we do in the Blue/Green setup, we will go and export the data from RDS onto Redshift. And as part of this export, if you have to do any transformations, we do all those transformations for you.

So if you have a data type like VARCHAR which is on both sides, hey, no biggie just go to, you know, go ship the data. But if it's something else that needs a slightly different form in Redshift, we go do all those transformations for you.

So once you've seeded your data, the next step is staying in sync. So we have a CDC stream that we manage under the covers. And essentially what we're doing is we're getting deltas on your RDS MySQL side and we're shipping them over to Redshift. And again, we've done all the heavy lifting to make sure that over 80+ DML and DDL (data modification and schema modification) types, we take care of all the translation for you.

So you don't have to do anything of it and we do it in a way that's transactionally safe. So you can be assured that the data you see in Redshift is transaction consistent. And we also do it in a way that it's resilient because you know, sometimes databases fail or you may want to change, you know, a configuration in your database, you may want to upgrade your database, change your instance type. Any of these changes, we make sure that the pipeline stays alive in the sense that even if we need to re-seed the data, we do it for you.

So again, a lot of what we focused on is just click those three buttons and then you're good to go. And so just as a summary, it's secure, it's correct, it's resilient and efficient and performant. The replication is literally seconds from RDS MySQL to Redshift.

And again, as I mentioned before, you can have multiple integrations which all have different sources, but all target the same Redshift data warehouse for data consolidation and analytics use cases.

And the last thing is, you know, while we take care of all of this for you, we also expose the monitoring metrics. So you can be sure that the replica lag is not too long and you can just make sure that you have peace of mind on this on how this thing works.

The last thing I want to cover in terms of ease of administration is around observability. We offload a lot of the undifferentiated heavy lifting for you. But at the same time, you're running applications on the database and you want to make sure that you know what your application performance is like, right?

And so we offer something called Performance Insights. It's a feature we have offered for years and customers love it. But one feedback we received is, hey, the observability experience for RDS is fairly fragmented because the data could be in Performance Insights. It could be in something called Enhanced Monitoring, which is more lowest level metrics or frankly, if you look at the whole stack, it could also be in CloudWatch, right? Something about your application.

So we heard your feedback, we acted on it and now it's a single pane of glass integrated with CloudWatch. And so you can correlate the changes happening in your app with your database. You can configure your dashboard as you see fit. So again, thank you for the feedback. Keep feedback coming. When we hear it, we will fix it.

The other thing the customer said is observability is great. But do you have any tools that can make troubleshooting faster for us? And so we took again your feedback and we said, ok, let's provide analysis on demand. This is a new feature where you can essentially select a time range. And when you select the time range, what we do then is say was the database behavior different in this time range than your normal database behavior in a longer time window.

So we're looking to see if there's something an outlier, if there's an outlier. The next step that we do under the covers is why was there an outlier. And then if you know why there is an outlier, the third step is we want to tell you what the recommendations are.

And so what could typically take a lot of work on your side, your team's side to go troubleshoot an issue. Again, we want to get from hours down to minutes, by doing as much of the lifting as we can because frankly, we're sitting on a lot of the data, right? And so we want to make sure that we go from just data to insights.

The third thing is, can we go one step further because even this is reactive right after the issue happened, troubleshoot, can we proactively tell you what is happening with your database? And so last year we announced something called DevOps Guru for RDS. It was available for Aurora and now it's available for RDS for PostgreSQL.

And on the console, you can proactively see if there's any issues with any of your databases. In this case, three, you can click on it, you can see which database is affected. Then you can see if it's an ongoing issue. In this case, it is you get the metrics that suggest where the issue is again, just like before we tell you what the root cause is where we know it and what the mitigation step would be. It's available for RDS for PostgreSQL now.

So that was a lot to cover, right? We talked about customer choice we talked about ease of development, ease of administration. So the main reasons customers use RDS is I want to pivot a little bit and go to the other part of what RDS does, which is to make sure that what you expect in a mission critical transaction system you get. So things like availability and durability, performance, and scale and security and compliance.

So talking about availability and durability, I'm going to focus on active active for both RDS and RDS for MySQL. So a lot of our customers said, hey RDS historically runs in an active passive configuration, right? Even if you have read replicas, what we really need is active active deployments.

And as we start talking to customers, we realize that active active is an overloaded term. Every customer has a different conceptualization of what they mean when they use the term active active. So we dug into it with our customers and we discovered there were three different but related use cases.

The first is when you want continuous read availability through host failure. So this is in a single region and I talked about rolling upgrades which is great, but a rolling upgrade is a planned event. What if you had an unplanned failure? Customers want the ability to keep going without a blip. And one phrase we actually coincidentally heard from multiple customers is "no moving parts", they don't want to failover they just want to keep writing and the fewer moving parts, the more likely you just keep working. So that's one that's an in region setup.

Another set of customers have a multi region ask. It's sort of an extension of what we talked about before. So think about, let's say you are a retailer, you have an inventory setup and you have two sort of inventory systems, one of the west coast, one of the east coast of the US, you're simultaneously writing there because you have business occurring simultaneously in both places. But you're probably not touching the same data warehouses, sorry, the real warehouses. And so you have very low or no write conflicts, so simultaneous writes, but very few conflicts. And in the case that you do have some catastrophic event, you want to be able to redirect your writes to the other side of the continent, right? So this is a second use case.

The third one goes one step further still and says, what if you have a follow the sun model and these are customers where you may have a global workforce or you may have a global customer base. And again, your workflow patterns tends to follow the sun and there tends to be very low conflict in terms of overlapping and writing the same data. But just as before, if you have a catastrophic failure, you still want to keep running your business by failing over a different region.

So these were the three use cases we discovered. And so when you think about active, active, you know, from a technology standpoint, there's a fundamental choice you have to make. Do you want to synchronously replicate between these nodes or do you want to have asynchronous replication? And it's a trade off between latency and conflicts.

If you have synchronous replication, you avoid conflicts because all your nodes are saying the same thing at all times. And this makes application development very simple, right? It's kind of like the same as writing to one node, but you do pay the cost of round trip latency.

If you have async replication, it's the reverse. Your application has to be aware of semantics. Hey, I thought something was done but I may have to undo it right? Or if you have a catastrophic failure, you may actually lose some of your data because asynchronous replication by definition, the other side doesn't have all of the data.

So in addition to this choice that you have to make as a builder, I also want to highlight that failure handling in active, active scenarios can be tricky, right? You don't want to be trigger happy because if you fail over and you realize it was just an intermittent failure or gray failure, you know, there's a lot of work to go fail over across nodes or across regions. And you also want to be able to do things like fail back. So this requires a fair degree of sophistication to be able to write applications that work well in an active active deployment.

But many of our customers do this today, many of our customers have active active setups and they self manage it. And the ask for RDS was we get that these, this requires a degree of sophistication, but we know how to do this. What we don't want to do is all the other stuff that RDS already offloads like backups and patching.

So can you make this capability available in RDS? And we understand that as application developers, we also have to, you know, have a shared responsibility in making sure that the application works correctly. And so for these customers, we now support active active deployments with an extension called pg_active for PostgreSQL and MySQL Group Replication.

So the pg_active extension is derived from BDR v2 which is an open source extension.

We've made some changes to it uh to improve its performance and accuracy. And my sequel is based on a community plug in pg active uses async replication by default and last updates. And the latter is a synchronous solution and they're available for post and i respectively. So very easy to get started.

So in the case of pg active, you create a parameter group, a database parameter group and you enable pg active when you create a new instance or you modify an existing instance, you associate it with pg active with this parameter group. Then installing pg active is just a create extension statement like just like you create any other extension process. And then on the first end point, you set up all the connection information about the entire cluster and you initialize the application, then you do the same thing, steps three and four on the other parts of the cluster and then you join the replication group and that's it.

There are limitations you should be aware of. I won't go into detail on this, but I'd encourage you to read our blogs or documentation for this. Things like details and large objects not replicated.

And my sequel side again with group replication, very similar workflow. You create the parameter group you provision an instance of the parameter group, you configure the application and then you get going. And again, on the my sql side, we have learned a lot of best practices talking to our customers. So again, I would encourage you to read the documentation um and i blog post on this uh an example would be make sure that your transaction size is not too large.

So that is on availability and durability. The next area is around performance and scale. As I mentioned, this matters because we want to make sure that we're raising the ceiling and keep giving you headroom as your business grows and your workloads grow. And the second is that we want to keep delivering price performance improvements. So you can get more with less with every single year.

And so to this end, we are always supporting the latest instances from ec2. You can see some of the instances that we have launched with the open source databases. I particularly call out the newest version of graviton offers you plus 27% in price performance. I'd also call out the n instances n is for networking high high through networking. It is particularly valuable in a multi a setup when you're communicating across azs and also when you have network storage, which we do in r ds. So that was on the compute side. And really our strategy is to make sure that we offer you the best instances that are available on ec two that are relevant to database workloads.

This year, we also took on an effort to make our storage performance better and more stable. We don't often talk about storage. So i'm going to explain a little bit more detail than, than otherwise. So the way a storage works, you have the compute instance, which is the chip icon and under the covers, you have between one and four volumes of network storage. The data is striped and mirrored across the storage and each of these volumes are using reliable disks. So this isn't one a right. So if you have a multi a system in classic instances, you'd have two sets. So you'd have between two and eight volumes. If you use the multi a database cluster you'd have between three and 12 volumes between these ass.

We also synchronously replicate the data, right? And the thing that is important to note here is that each of these volumes has both the transaction log records. It also has the data pages. Now, why am i seeing all of this right? Transaction log records have different performance needs than your data pages, transaction log records. The writing of it is in your transaction, commit path. You want to be blazing fast. The database flushes from your buffer can happen asynchronously.

And so when we have as many volumes as we do and why do we have so many volumes? It gives you more iops by the way, you know, it's addictive, but when you have as many volumes as we do and when they're synchronously replicated, and when you're colo your transaction logs with the data pages, a couple of things happen, you have contention for resources between the transaction logs and the data pages. That's one. And the second is that if you have any jitter io jitter latency in any one of these volumes, it starts cascading and backs up into the database and databases typically don't do well with jitter, right? It ends up backing up into your data application.

So customers asked us to fix this, fix this and the way we are fixing this is through this new feature called dedicated log volumes and especially what we do under the hood is we take all the transaction logs and we place it in its own network storage volume separate from the data pages. And so what does this give us first logs are written sequentially? So you get better performance because you can do iom mege, you also are not contending for resources. So that also gives you slightly improved performance.

And then because the logs are not spread across as many machines under the covers, you have lower probability of encountering jitter. And so you get better performance and more stable performance at the database level, which is crucial for your apps. It's available again for the post chris engines. And we plan to also continue supporting it for the other engines.

So here's just a quick example. We ran pg bench, you know, the concurrency level was 64 and we ran it for a full week with about 1000 tps. We use data volumes for 20 k iops and the dedicated log volume was only three k iops. So the the orange lines that you see here are for the baseline. And in fact, you cannot see the dedicated log volume at all because of how effective it is.

So what this shows you is the number of outliers greater than 50 milliseconds we saw over the week long period and it's a histogram. So you know, when you go to the right, it's even bigger outliers and we saw about 441 outliers at the 50 millisecond level. But with data dedicated log volumes, it came down to just one. So we buffered most of these outliers from the database and from the application.

And then when you look at a more rigorous threshold of 20 milliseconds, the number of outliers were slightly north of 2000 in the base storage. But only three of these manifested in the application and two were under 50 milliseconds and uh one was above. So that's on more consistent performance.

So we've talked about improvements we make in the compute area, which is frankly just following two, we've talked about an improvement we made in the storage, but we also do work to optimize the entire stack.

So last year, we introduced two new features, one called optimized rights, the other one called optimized reads and essentially they double your right and read performance respectively. And let me talk to optimize rights first as a quick recap.

So in the case of my sequel, it uses 16 kilobyte pages. And historically, what we have been able to write to storage, persistent storage is with four k atomic. So your 16 k page when it gets flushed to persistent storage gets written in four k chunks 44 times. And what happens is that let's say that you have some sort of a failure after you write two out of the four, you have what's called a torn page. So if you have a crash and your database recovers. You don't actually know whether you have a consistent 16 k page.

So the way my sql solves this traditionally is you have something called a double write buffer, you write the storage twice and then what you're doing is you're verifying that both copies are the same. So it works, it's durable, but then you're writing twice, right? And you're paying the penalty with two nitro instances, we can now write with 16 k atmos city. So you don't need that extra storage, you can do away with it and you're effectively doubling great performance.

Last year, we supported my sql this year, we added support from my j db. That's on the right side, on the re side. Um again, we have a persistent storage on network storage. And when you do a read, let's say you're doing a large joint or sort or you have a common table expression, you end up creating temporary files which don't need to persist through failure. Like if you're running a joint and you have this temp table or you have a temp table and your database crashes, when it comes back up, you basically wipe all of that out anyway, for consistency.

So there's no reason to fetch that all the way from network persistent storage. You can actually place them on local instant storage right next to your processor, right? So with the instances, you can put temp storage in any temp object on the compute instance itself and you essentially get two extra fast equities.

So last year, we added support for my sql and the db. And this year we have added support for progress. So again, if you run a hammer d benchmark, whether it's a t pc style benchmark for rights or t pc style benchmark for analytics. In both cases, we have seen up to two x performance improvement and then this is not quite a feature, you know, it's maybe a look under the hood of what we do. We're continuously working on optimizing our stack even in cases where it doesn't manifest as a feature to you, right?

So you may not see it in a what's new launch, but it ends up being something that benefits you when you upgrade to the latest version. And these come from two types, one is changes in the community database or the third party database itself. And so in the case of my sql, they introduced a new system called f data sync instead of f sync, which is more efficient in terms of flushing data disk.

And then across both rdo my sequel maria db, we you know, updated a tool chain, right, new compression library, new gcc. And effectively what happens when you use r ds my sql 8035 you get up to a three x higher right throughput versus all the minor versions both on the basis of the community change, but also the changes we have made internally.

And in the case of me db 1011, you see a 40% increase in transaction throughput again, just based on the improvements we've made under the covers.

So that's on security and compliance, uh sorry, performance and availability and last but not the least security and compliance.

So if you ask any amazonian what their job is, let's say my job number one is security. So i obsess all amazon is obsessed over keeping your data secure. It is our job number one. And so a lot of what we do from an engineering standpoint is under the covers. We're hardening the systems keeping you safe and it never manifests again as a feature, right? And so while we don't talk a lot about it, it is a big part of what we do under the covers that said there are sometimes when we do announce features that are directly visible to you.

So i've talked through a couple quickly. Um but this is only a small sliver of all the work we do in terms of security and compliance on behalf of our customers. Both of these happen to be in the sql server space and they're just to highlight um not, not a complete list.

The first is database activity streams. Essentially. What this lets you do is it captures it audits any event in your database like invalid login attempts and you can export it using kinesis um into opensearch red shift so you can analyze it right. So it's, it's a good audit mechanism for you.

And the other one is self managed active directory. Um a lot of customers who use sql server use active directory either because they're required to or it gives them the permission that they need. And it's now available for s for sql server.

So we've talked about a number of different types of innovations. We started with customer choice. We talked about simplicity, both for developers and administrators. And then we also talked about some of the core value propositions around availability, durability, performance, scale, security and compliance.

So when you look at the long arc of i ds, we have innovated on all these dimensions over the last many years. As you can see from the list, our rate of innovation is a rating and i hope that you find value in this innovation. As i mentioned earlier, 90% of our road map really comes from conversations with customers, we work backwards. Um and so keep the feedback coming, let us know what would be valuable for you. And next read. I hope to have another set of exciting announcements for you.

Thank you, everyone.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫