Take a load off: Diagnose & resolve performance issues with Amazon RDS

最新推荐文章于 2024-07-25 20:40:52 发布

李白的朋友王维

最新推荐文章于 2024-07-25 20:40:52 发布

阅读量69

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134786247

版权

Good morning, welcome. Thank you for joining us. Before we get started, I'm curious to know how many of you have faced database performance challenges. If you could raise your hands... Alright, so basically all of you, which is great because it means that you came to the right place.

In today's session, we are going to cover performance monitoring and diagnostics in Amazon RDS and Amazon Aurora. The goal and outcome is that you will be equipped with knowledge about the different AWS monitoring tools and services, and you will learn how they can help you overcome even the most complex database performance challenges.

I'm Pin, a senior database solutions architect here at AWS, and I will be joined later by Maxim Kachka, who is the principal database engineer here at AWS.

Before we jump into the agenda, I would like you to take a moment and think about what these three pictures have in common: a flashlight, a radar, and a magnifying glass. Exactly - you basically all guessed it right. I assume these are all pictures which show tools that improve visibility and detection. And the reason I'm showing this is because I believe it is a very good analogy for today's session - when it comes to database performance, you can't fix a performance bottleneck that you can't see.

Let's go to the agenda. We are going to start out slowly by setting some context, telling you a little bit about Amazon RDS. Then we will dive deeper into the various monitoring services and tools that help you overcome database performance challenges.

In the second part, Maxim will show you some of the latest innovations when it comes to database performance diagnostics and troubleshooting in Amazon RDS. So you will learn about the newest features related to performance.

We will finish with a real world example of an online web store which faces performance degradation, and you will see the steps we took to address the performance bottleneck and fix the issue. So it's going to be an interesting session - stay tuned for that.

I assume many of you are already familiar with Amazon Relational Database Service, or Amazon RDS in short, but just to level set - it is a fully managed relational database service. With Amazon RDS, you are able to spend time innovating and building applications rather than spending time on all the operational tasks of database management such as backups, upgrades, disaster recovery, provisioning, and more. So you can focus on innovating while AWS can automate and manage all these operational tasks.

Amazon RDS supports various popular engines - Oracle, SQL Server, PostgreSQL, MySQL, MariaDB, and Amazon Aurora. Amazon Aurora is unique because it is a cloud native database engine designed to provide enterprise grade security, availability, and reliability, but with the cost effectiveness and simplicity of open source databases due to some of its unique features and characteristics. It is actually the fastest growing service in the history of AWS and tens of thousands of customers use it today.

Now let's start to dive a bit deeper. There are actually several different levels of monitoring in Amazon RDS - instance level, operating system level, and database engine level.

For instance level, we encourage you to use Amazon CloudWatch, which I'm sure many of you are familiar with. Amazon CloudWatch is the main AWS monitoring service for monitoring and observability, and it allows you to monitor all of your AWS resources and your entire application stack. With Amazon CloudWatch, you can also monitor instance level metrics in Amazon RDS such as CPU utilization, storage, IOPS, network utilization, and dozens of other instance level metrics.

However, sometimes customers require deeper visibility into granular operating system metrics such as Linux huge pages and swap activity. Maybe some customers require deeper visibility into operating systems processes and threads. This is where Amazon RDS Enhanced Monitoring comes into play. We will also talk about this tool.

But when it comes to deep dive, database engine level monitoring and diagnostics, including SQL analysis and analyzing the SQL statistics and wait events - this is where RDS Performance Insights comes into play. RDS Performance Insights is going to be the main solution that we will talk about throughout today's presentation.

For more proactive and automated performance diagnostics, we have Amazon RDS Performance Diagnostics for RDS, which uses machine learning to analyze past metric behavior and provide recommendations on how to remediate performance bottlenecks. So it allows you to be even more proactive when it comes to performance monitoring.

I would like to start with the fundamentals. Amazon CloudWatch, which we just mentioned, also allows you to have monitoring and alerting so you can get alerted and notified if there is any issue. It has an intuitive alarm wizard which allows you to define things like:

Metric threshold
Evaluation period for alarm
What statistic to use to calculate the metric over that evaluation period (e.g. minimum, maximum, count, average, sum, etc.)

One thing that is important to bear in mind is that the threshold doesn't have to be static. A static threshold would be something like "if we exceeded 100,000 IOPS for Oracle, for example, then notify us." So we basically get a specific, literal value for the alarm threshold.

But alarm thresholds can also be dynamic, which uses machine learning behind the scenes to analyze past metric behavior and create a model of the expected baseline range based on typical hourly, daily and weekly patterns. Essentially, we know at any point in time what is the expected range for your workload.

Let's look at an example in this screenshot. Here we can see the Anomaly Detection metric for RDS IOPS. We can see the band which represents the range of possible values. As you can see, we know at any point in time what is the expected range. And by the way, you can make it more or less sensitive - so you can modify it to be more or less likely to be triggered based on business needs.

While Amazon CloudWatch is great for instance level monitoring, sometimes customers require deeper visibility into operating system level metrics such as Linux huge pages and swap activity. Also analyzing processes and threads running at the OS level could be very useful sometimes for diagnostics and troubleshooting. This is where RDS Enhanced Monitoring comes into play.

It's a fully integrated feature in Amazon RDS - in other words, you don't need to install anything, you just need to enable it. Behind the scenes with RDS Enhanced Monitoring, we actually deploy a local agent so we can collect the data directly from the instance level, not from the hypervisor.

It has a default retention of 30 days, but you can modify it to be longer. It also has a default resolution of one minute. Sometimes customers require higher resolution frequency - they can change it to up to one second. So that's also something you should be aware of.

So far we've covered instance level monitoring and operating system level. But when it comes to deep dive, database engine level analysis and SQL analysis, including wait events and statistics for each SQL query - this is where RDS Performance Insights can be very handy.

Performance Insights helps you measure database load. What we have found, which is very common, is that many developers and DevOps engineers who don't necessarily have deep database expertise still find the tool to be very effective and efficient for diagnosing and resolving database performance challenges.

Similar to RDS Enhanced Monitoring, it is also fully integrated and managed in Amazon RDS, so you don't need to install anything - just enable Performance Insights and you can benefit from deep diagnostics and analysis.

We just mentioned that Performance Insights measures database load, but let's talk about what exactly we mean by that. What do we mean by database load? Behind the scenes we are collecting average active sessions, and an active session is a session that is running a SQL query and is waiting for a response back from the database.

In order to have an informative picture of your database load, we collect the sessions and the SQL queries. We also collect other associated information like dimensions - and dimensions could be for example the database user who executed the SQL, the application that executed it, the program, the application server, etc. So we can collect all those additional dimensions.

One of the key dimensions I want to mention is wait events, because wait events allow you to understand what is the root cause for a performance bottleneck. Put simply, wait events tell you why the SQL query is waiting. This is something we'll talk about more throughout the presentation.

There could be many reasons for wait events - it could be I/O latency when reading data from data files, it could be due to commit acknowledgement latency when waiting for commit acknowledgement from the storage subsystem, it could be due to table locks. So there could be various reasons for wait events, and Performance Insights allows you to identify and analyze those wait events.

As we sample the database every second, some engineers might be concerned about missing data in between samples. But what we have found is that the one second frequency of sampling actually provides a very accurate representation of your database workload. One analogy is that in movies, you watch 24 frames per second, yet you still get a very coherent picture. So in Performance Insights, we sample the data every second, but we allow you to get a very clear and coherent picture of your database workload.

When it comes to performance, we always encourage customers to use Performance Insights to understand the deep dive analysis of the SQL queries. We also want you to think about the database as this big ocean of fish, and each fish is a SQL query in the ocean. Typically we would like to focus on the "big fish" in the ocean - those that contribute most to our database workload.

So for example, let's assume you have a sequel that is running in a duration of 10 milliseconds once a week. It's probably not going to be the one that is the most impactful for your database workload. So we encourage you to think about those who really contribute to most of the database workload. And the reason I also mentioned that is is because it's important to understand that Performance Insights is not an auditing tool. Ok. Our goal with Performance Insights is not to collect and track every single statement that is running, but rather to collect every statement that is impactful for your database workload that you could really understand how to improve performance bottlenecks.

Capturing database load for multiple sessions is essentially what we're doing in Performance Insights. We are stacking concurrent database load from all of the sequels that are running. And in this case, you can see a lot of sessions, we have four users, each one runs sequels. And by stacking the workload, we can see a graph that looks like that. This is the database graph. So we know exactly you using this graph when the database was having a high level of concurrency and when the database was idle. But in this graph, you cannot tell why. And this is why the weight even breakdown could be very useful. Because as we mentioned earlier, throughout each query execution, we also collect all the various weight events. And by assigning different colors to each weight type or each weight category, we can provide you a very clear picture of database load over time. And this is how it looks like in Performance Insights.

So you can see a screenshot here from Performance Insights and you can see two spikes, we can see the first spike in green color which represents CPU utilization. And we can see the other spike with the red color which represents blocking los the average active sessions. The database load graph can also be very useful to understand the state of the database. And let's look at some of examples if the database average active sessions or database load graph shows zero. So no active sessions are running. You see the, you look at Performance Insights and you see nothing. It means that the database is pretty much idle. So if a user of complaining is about performance challenges, the performance are unlikely to be um because of database performance challenges, it's probably because of something else.

When the average active sessions is below one, it means that there are no major blocks. Imagine a user who is waiting because of a log for a couple of minutes. You would expect by looking at at Performance Insights to see at least one average active sessions waiting because there is one session that is blocked for a long time. But if you see that the average active sessions is below one, it gives you an indication that probably there are no major blocks running in the database.

When the average active sessions is below the number of vCPUs in the instance. So in this example, you can see the dotted horizontal line, we have eight vCPUs in this specific instance type. And you can see that at 1220 there were roughly two average active sessions, but the total number of vCPUs in the instance is eight. So we know that there is CPU available in the machine. But when we see that the average active sessions is higher than the number of vCPUs. Like the case of um our 12 and five minutes, which we can see 10 average active sessions, but there are only eight vCPUs in the instance. This means that we have possible bottlenecks and we need to either right size the instance or tune the application or the database because this is an indication, a flag that tells us we have uh we have a challenge, a performance bottleneck that we need to address.

One of the benefits of the cloud, which again, I'm sure many of you are familiar with is elasticity because it allows you to scale your resources as needed. Typically, customers would like to make data driven decisions before doing right sizing before scaling the resources. What we have found is that many customers also use Performance Insights for right sizing. So in this example, we can see that the CPU load is significantly lower than the number of vCPUs in the instance. So we can see the database load graph and you can see mostly CPU um because you can see the green color and we, we look at Performance Insights and this gives us an indication that we have a potentially oversized instance, meaning that we have a cost optimization opportunity here because we have an instance that is much bigger than the required database load.

The reason I say potentially by the way and not definitely is because there could be other factors like uh this cation memory, et cetera. But it definitely gives us an indication that we have a cost optimization opportunity we will need to do the to run the numbers and do the testing. But this gives us a good understanding that we have an opportunity to reduce costs.

On the other hand, if we see a graph that looks like this, where the CPU load is significantly higher than the number of vCPUs in the instance, then we have a poten potentially undersized instance. Ok. So this again means that we need to think about right sizing, tuning the application, tuning the database to make sure that we don't have a major performance degradation or a performance bottleneck.

One of the key benefits of Performance Insights is that we collect a lot of dimensions. We mentioned it earlier, like the application server, the program, the database user, the execution plan, et cetera. The reason it is powerful is because it allows you to drill down slice and dice the activity by various dimensions. And by doing so we can understand the root cause of a performance bottleneck.

So for example, let’s say we want to isolate the sequels which were executed by us of david and were associated with blocking locks. So these are some of the performance diagnostics activities we can do with Performance Insights very easily. And we can also see here a screenshot from the documentation showing the different dimensions which are supported to different engines. As you can see there are different dimensions, some are supported for several of the engines, some for mo for most of the engines, it really depends. But uh we will continue to improve each of parity based on the market demand where it makes sense to do. So, we are very driven by customers feedback.

So in general, if you have any feedback for specific dimensions that you would like to be added or any other general feedback about Performance Insights, we will also love to hear from you uh after the session.

Now, I would like to hand it over to Maxim. So you will learn about the latest and greatest features when it comes to performance diagnostics. And also you will see a demo utilizing some of these uh newest features. Uh Maxim stage is yours.

Thank you, Penny. Well, over the last year, we've been working on a number of improvements to Performance Insights. And uh the underlying theme of these improvements is we want to make a database performance simpler to understand uh to achieve that we are focusing on three separate things.

Uh the first one is that we are uh unifying user interface so that if you ever need to work with a database metric, uh you can find everything available uh right under your fingertips. Uh second, we are highlighting important pieces of the telemetry, putting forward the ones that um have a higher chance to tell you something interesting about your database performance. And uh finally, we are introducing automated advisors uh that can highlight the problems that are otherwise might be difficult to see.

So let's dive right in now. Uh the first project that I want to talk about is what we call a single pane of glass uh in a nutshell, single pane of glass is an evolution of Performance Insights uh user interface uh that is designed to combine all of the database telemetry together. Uh as my cos speaker mentioned before uh r ds database telemetry uh is made up of three separate areas and these areas are the CloudWatch, the Enhanced Monitoring and of course uh Performance Insights and up until now, uh this areas not only had their own metric repositories uh but they also had their own display consoles and uh um that obviously created a great opportunity for us to, to improve things.

So starting this august, there is a new way to access uh database telemetry and Performance Insights. Now, if you need to search for uh work with or, or, or access uh metric in any of the uh database telemetry repositories being uh uh CloudWatch Performance Insights or Enhanced Monitoring, you can find all of the telemetry in the same centralized location and that location is Performance Insights.

Well, there is still a slight problem with uh with the telemetry though uh namely its size um on a typical database r ds collects uh about 100 different metrics. Uh well, if you don't count the SQL level metrics and uh pretty often and especially when you are in crisis mode, it might not be immediately obvious which metrics you actually need to look at to understand what your database is doing.

Uh so to help with that uh with, worked with a number of experienced database engineers at uh amazon and aws. And we've crafted a curated subset of metrics uh which are the most illuminating to describe uh uh database performance. And these metrics are now organizing in uh what we call a database half dashboard which we prebuilt. And these dashboards are now available for all r ds engines that we support.

Now, of course, you can also build your own dashboards, but now using all of the available database telemetry from all rds, telemetry sources, another new feature uh which was just released. Uh I believe it was uh less than three weeks ago is uh you are now able to take your uh Performance Insights dashboards and publish them to CloudWatch. And uh there's of course multiple reasons why you might want to do that.

Uh but uh one of those reasons is that uh uh CloudWatch has a very rich and uh well well designed uh graphical interface uh that you can use to tune your uh metric charts to the exact visual specification that you want.

Um I'm just going to highlight a few uh potential capabilities here. Uh so, first of all, uh you know, this is fairly trivial, but, well, you can set uh thresholds uh on your metrics and annotate them. And uh uh you can set a regular rule based thresholds as well as the, the more sophisticated uh anomaly detection thresholds.

You can also change char types. And CloudWatch right now supports um many contemporary chart types that are available out there.

Uh this is my personal favorite. So um what you can do is you can annotate your uh metric charts with relevant information. Uh for example, if you are working on a performance issue and you have a big team working on that, uh you can uh clearly communicate with the team and then maybe across teams as to uh how exactly that performance problem uh progressed in time as well as uh uh what actions the team took to solve it. And when uh so this, this makes uh communication uh within the team and across team uh uh very simple and clear.

Uh you can also construct derivative metrics uh well, as i mentioned, uh on a typical database r ds collects about 100 different pieces of telemetry. Uh but obviously, it's not going to collect all of them. Uh the, all of the potential ones that are possible. Uh but you can use uh uh math expressions to construct metrics from the underlying metrics and create new metrics.

Uh in this specific example, uh the database engine itself does not emit a buffer cache heat ratio metric which we were interested in. Uh but uh it emits the components for that and we can use uh metric math to construct uh construct this metric.

And the final thing is that uh you can uh share the dashboard with your teammates or well, perhaps a leadership team team.

uh so being able to uh work with the metrics uh build the dashboards uh uh is obviously great. but the hallmark of any well designed uh observable system is that you should be able to get alerts uh when things become sour. and uh what this means is that you, you probably want to set alarms on your matrix.

uh cloud was already includes a pretty sophisticated machinery to set uh all kinds of alarms on, on metrics that it supports. but up until right now, you could only set alarms on the cloud was native metrics. uh but not the ones that are coming from performance insights or enhanced monitoring.

well, this change starting uh this september. so now you are able to set alarms. uh and again, all kinds of alarms on all uh r ds telemetry sources including enhanced monitoring and performance insights.

probably the easiest way to do it is to uh well uh zoom on the metrics in the chart in in the dashboard that you built and then uh select and click on edit and cloud was button. and once you do that, it will bring the metric uh into the cloud was realm. and uh if it's a performance inside or enhanced monitoring metric. uh for the time being, it will bring it in the form of metric math expression uh that uses uh db par and size function.

well, no matter what the form of the metric is, uh you can still set the alarm on it, just like uh you do on any other regular cloud was metric. for example, you can click, click on the alarm button and just follow the standards workflow to, to set the alarms again, uh very similar to what you would do for any uh well, quote unquote, regular metric.

ok. so let's let's switch gears here and talk about a different project that we call performance analysis on demand.

now, um performance insights is, is already uh a pretty decent tool that can show you uh what uh your database performance looks like. uh but the thing is uh the, the interpretation of what you see uh was always up to the customer uh itself, right?

so we will show you what, what uh database law looks like, but it would, it would be up to you to interpret whether the database is healthy or unhealthy.

well, in r ds, we wanted to democratize interpretation of this information. uh we want to make it easier for everyone to understand what, what uh this performance metrics are telling you. and that's why we came up with, with the idea of performance analysis on demand.

so now you can uh go to your uh performance and size timeline and select any time range that you care about. and once you select it, uh a new uh button will appear with the name of analyze performance. you can click on that button and within a few seconds, receive a performance analysis report that will generate it for that time range.

uh that report will tell you several things. uh so, first of all, it will tell you whether the data database is healthy or unhealthy. and if it's unhealthy, what, what is the level of that unhealthiness, how unexpected this uh situation in is uh it will also show you the nature of the problem that the database is experiencing uh uh in terms of weight events. and it will provide descriptions of uh uh what these weight events actually mean.

um another thing it will point out is the major contributors in terms of sequels of uh that are well contributing to this problem. and uh finally, uh it will uh it, it may find uh correlated anomalies and major resources and tell you about them.

so, for example, it can pinpoint that cpu has been exhausted or uh memory was under pressure or there was an unexpected and high spike of database connections, which uh which is correlated to the time of the problem.

and finally, for the most uh problem types that, that we recognize we also link a contextual uh document that describes uh the next steps that you can take, how you can troubleshoot this problem more and uh uh how you can uh well potentially address it well.

so uh performance analysis on demand is an interactive experience. uh but there's also an automated way how you can detect database performance issues. and uh that is uh uh by using uh dell s guru service.

uh the dubs guru is a separate uh service under amazon umbrella and uh it monitors system telemetry continuously. and that includes uh all of the system telemetry that, that you register, not just database telemetry and it will alert you if it finds anomalous patterns in any of the telemetry.

now, if the database is a part of that anomalous pattern, then uh the s guru will also link uh a detailed uh database deep dive report which actually looks very similar to performance on on demand uh report.

so it will highlight uh whether the database is a performance issue and how big it is. uh it will uh record the nature of that issue and uh also highlight the main uh contributors to that issue in terms of sql. and uh finally, it will also link uh the uh similar documents that describe the next step that you can take to uh to address the problem.

all right. so let's do something fun. uh so let's take everything that we've seen uh during this presentation and uh uh use this to troubleshoot a real database performance issue.

so we are going to look at the metrics. uh we are going to dive deep into some of them and hopefully we can uh find a solution for uh our problem or maybe several of them.

um all right. so let's let's set the context first.

now imagine that you are an on call engineer for an online web store. your team among other things is responsible for the recent orders, web page, right, where customers go and see their past orders.

uh now, when customers query uh their recent orders, there are reasonable expectation is that uh these queries need to return fast and your team's job is to make sure that this happens.

now uh to get technical, the main driver for, for this page response time is a sql query that goes to the back and r ds database and extracts the data for the page. and because this query is so crucial, uh you obviously monitor its performance very extensively and uh have all sorts of alarms in place.

ok. well, it's one of those days and uh some of your alarms fire.

um and uh this is a pretty big deal because uh many customers are complaining that uh their uh page refreshes became uh excruciatingly slow. and uh there are even reports that some of those requests are timing out.

so clearly, that's not a good situation to be in. and uh as an on call, uh well, you, you, you need to do something very fast to, to make customers happy.

um well, whenever we deal with a performance issue, whether it's database or, you know, system performance, you typically need to answer these three questions.

so first, well, what's going on a k a, the root cause of the problem?

uh the second question is who is responsible for that? uh you know, if it's a human so that we can talk to them and if it's a machine that, you know, we can find other ways to, to reason with them. but ultimately, what you want to uh the answer that you want to find is to the question of uh well, how can we fix it?

so let's take these questions one by one now to answer the first question, what's going on, we are going to uh look at the main uh performance insights page that shows the uh database load chart.

and as you can probably see, uh uh there are some indications that the database is not behaving uh uh as, as well as it should uh right, that big blue spike is, is the evidence of that.

uh but we also have the new analyze performance button. so let's see what, what it can do for us in this case.

so we press the button and within a few seconds, we get a report that that highlights free things.

the first one is that the database load spike that we see on screen is both high and unusual, which confirms our suspicion that the database is not exactly healthy.

the second finding is uh that the database loads spike is dominated by active sessions that are waiting for io. uh we can deduce it by looking at the, the actual weight even which in this case is io da data file read.

uh but the big picture is that uh the sessions are waiting for io and uh that's the problem and deferred finding is also interesting.

uh the system found that there was a correlated and unexpected spike of database connection that happened right at the same time when io problem happened.

um so all of this is useful information. uh let's see what we can do with it.

now, um first thing first, uh let's take a look at our io because we noticed that uh uh lots of active sessions are waiting for io and uh to do that, we are going to switch to uh to the new metrics chart that shows uh uh the database half dashboard that i mentioned before.

and uh this dashboard has multiple different parts. uh uh some of those parts is dedicated to om meric. and uh in this case, a couple of biometrics display pretty interesting behavior.

so the first one that i'm highlighting here is uh io latency or response time for, for io requests.

uh now, this is a typical ltp system that we have. uh it's a transactional system and uh within the uh well bounds of that system. whenever you see a u latency on the order of 500 milliseconds per request, uh you can probably say that it's well, pretty unreasonable, but the, the second chart is even more to in this case.

and this chart measures the number of iu operations that we, we uh issue that the database uh issues on behalf of our queries. and uh as, as you can see this uh metric uh kind of plateaued or straight line at a rather curious number of about 1 1000.

now, i'm going to zoom this uh graph in so that everybody can clearly see uh what's going on here. and i'm pretty sure some of you already uh uh have a good idea of uh what, what is, is the problem here, any guesses, right?

yeah, thank you, sir. uh you're exactly right. so, in this case, uh our database uh is using uh actually in ao a 01 ebs volume behind the scenes and this volume has a set number of io operations that is allocated to that to that volume. and in this case, it looks like we are uh well, especially trying to cross that limit and as a, as a result, uh our requests are throttled and that's why very likely we are seeing the latency on a side.

ok. uh so is that uh the only interesting uh metric pattern uh that, that we can see on this, uh how dashboards and the answer is no, we can actually see more patterns, right.

so, uh as, as you look, uh for, uh as you look for multiple metrics, you can see that uh many of them show very similar spikes, uh that, that correlate very well to, to the time of the problem.

all right. so i'm going to zoom some of them in to, again, make the patterns more clear to, to everyone and it's a little bit subtle, but uh you can also probably guess uh what, what the main driver of uh of better your performance is.

uh now uh uh the way to understand the shards, i think that the best way is to just read the names of the metrics and uh try to construct a sentence of what, what is happening with those metrics.

uh so i'll start with the first one, right? a number of active sessions spiked during the time of the problem.

now, a number of data, database transactions spiked during the time of the problem, right?

and i think you can see where i'm going with this. so essentially, uh we can say that the database workload spiked during the time of the problem

And uh uh that most likely that's what caused the problem in the first place, right? The database work workload spike. So we started treating more uh issuing more IOPS to, to the EBS volume. And uh that's why the system became slow.

Alright. So we have the answer to question number one, what's going on, but who is responsible for that increased load in our database? And to answer that, uh I'm going to go back to the Performance Analysis report and look at the query. Uh that was the major contributor to the problem. Or you can alternatively say it's the query that uh is most suffering from the problem.

Uh so this is the query that that we are looking at. And here comes our first big surprise. It's actually not, not our query, right? It's not a query that refreshes uh the the orders uh page, right? So that, that query is on the right. And as you can see that this query is quite different from, from the one that is actually causing the issues.

So we have this uh extra workload which we are not aware of uh that uh seems like it's uh making our database performance bad. But where does this new workload come from? And so fortunately, uh in our system, uh our applications are well instrumented. And in particular, every time that uh database connection is established, we also pass along the application name tag uh along with the connection request. And uh this is important because Performance Insights is uh well instrumental to use those tags.

Uh in fact, one of the uh uh database load dimensions that we support is is by application name, which is constructed from these tags. And we can use that to identify which applications are causing uh increased load on, on our database. And once we switch uh to, to that dimension, uh it becomes very clear that uh well, there is a new call center application which essentially overtook uh the uh the, the load on our database. The the majority of database load is coming from that uh application, right.

So we found who is responsible for the problem. Just to uh have a quick summary of what we know so far as uh what's going on is uh there is uh a pretty acute I/O issue uh that is related to the fact that uh EBS uh volume IOPS uh have been exhausted. And uh we also found a responsible party for that. Uh it's an extra call center application uh that uh started uh doing a lot of activity in our database and that most likely caused that uh I/O spike to begin with.

So how can we, how can we address it? Um well, whenever you deal with a noisy neighbor situation, uh there's usually three ways how, how you can address that extra workload uh requirements. And uh this is assuming of course, that that extra workload is actually needed. And in this case, we talked to, to the downstream team. So we, we talked to the call center uh developers and they confirmed that yes, it's a valid workload and uh they want to run it.

Um so our options are here. Uh we can either tune the new workload or we can try to coexist with this new workload somehow or if everything else fails, then we can uh move this workload elsewhere. So let's uh go through these options, one by one and, and see what uh what kind of solutions we can find.

Alright. So, uh the first question, uh can we tune in new query? And uh this is probably the uh the first question on any engineer's mind, right. So uh that's, that's what engineers do. And fortunately, Performance Insights uh does provide uh SQL statistics that you can use to evaluate whether your queries are performing uh or underperform.

Uh but the bad news uh here is that uh the numbers that we're looking at uh are quite small and what this means is that uh this query is probably as efficient as, as it can be, right? So, no, no major improvements there, we can dive a little deeper and uh look at this query execution plan. And if there are any PostgreSQL experts here, uh they can confirm that uh well, for, for this type of query, this is pretty much the perfect one that it can use, right?

So uh no major opportunities for improvement. And um I have to mention that uh execution plans for open source databases is the only part of this uh uh investigation that is not yet instrumented in Performance Insights. Uh but, but we do already have it instrumented for other engines such as such as Oracle.

Alright. So uh option one, tune in our extra query is a no go. Uh so let's switch and uh try to execute option number two, right. So can we tune our system instead so that we can run both the original workload and the new workload on the same box? And uh there are several options that are available to us here.

So the first option is, is fairly obvious. Uh the original problem was caused by the fact that uh I/O capacity on the volume was exhausted. Well, maybe we can just add additional IOPS to that volume and this will solve our issue. Uh this sounds uh very trivial but uh this is the beauty of of uh running your database in the cloud is that these operations are uh are very simple uh to, to execute. Because cloud gives us that flexibility to, to add capacity on demand.

Now, obviously, we'll need to run some numbers here, but that's a potential solution number one, uh solution number two is uh well, let's try to look at our query specifically the the query that refreshes the, the uh orders page. And in Performance Insights, you can uh reconstruct your database load chart to show only activity that is specific or scoped to, to only a single query. You can do it simply by by clicking on the query in, in the bottom top in Performance Insights. And uh this will arrange the graphs uh for that query only.

And here, what you can see is that uh well, the uh database load spike for our query looks almost exactly like the database load spike for, for our database, right? So there's a bunch of sessions that started to wait in for I/O when the problem began. But the important piece here is that uh our query was not waiting for I/O before the problem began. And once again, we can confirm that the query statistics will show us that uh execution rate for this query uh for our query actually did not change, right?

So it's just the uh the the the effect of noisy neighbor that, that caused this behavior to, to change. And we can confirm what our query load profile looked like before the problem by zooming in Performance Insights on the time of the problem and looking at the result in wait to end distribution. And in this case, you, you can see that the query was clearly running from memory, right? So there's barely any I/O related waits in this picture.

Ok. So what, what changed? And uh what's what's the conclusion that uh we can make uh by looking at this graph? Well, the rather obvious conclusion here is that our database box was sized pretty pretty well for uh the, the resident set size of uh our query before uh extra workload came in, right?

So the the query was running mostly from memory just just because the the box was uh uh finely tuned to, to handle that workload. And then once the new workload came, it pushed uh both itself and our query out of memory and uh to, to start using uh disk requests.

Ok. So if that's the explanation, then the rather simple solution here is to potentially just use more memory, right? Uh to put it simply, we can, we can put this database on a bigger box. And uh uh this may solve our performance issue. So obviously, we need to run more numbers here and uh perhaps do some experiments. Uh but that's a potential solution number two.

Uh so we have our two solutions, but let's still be thorough and see if uh we can also explore whether we can move uh uh the new noisy neighbor workload elsewhere. Now, to do that, we are going to look at our uh neighbor query again. And what we can notice is that uh uh first of all, it's just a single query which is great. And uh it's also a select query uh which means it's read only, right?

So it's even more great. And what that means is that there is a good opportunity here to perhaps uh run this uh uh new extra workload from a dedicated reader replica or several uh we do need to again talk to the downstream uh team and uh see how this can be done and whether they have dependency on uh on, on the timing of this query in different parts of their application. But uh it's, it's a potential solution number three.

And uh there is even one more potential solution here, right? So instead of using the read replicas, we can perhaps uh uh cache results of this query in something like Amazon ElastiCache, right. So, and uh we can apply this logic to both the new query and our original query to to make sure that the uh response time of of uh these queries uh remains uh in, in the acceptable range.

And that concludes our investigation. Um so we looked through multiple options and we were able to find some solutions that we can apply very quickly uh like increasing IOPS of the volume to, to perhaps stop the pain. Uh and we also find some other solutions which uh are more elaborated, but that can can potentially address the problem better long term.

So we do need to talk to downstream team and uh run some experiments, run some numbers. Uh but there are, well, at least there are several ways that we, we know how we can uh address the performance issues that we we are seeing. And as you can see Performance Insights uh um helped us uh run through this investigation and then provided valuable insights, right?

So we were able to look at the system overall holistically, we were able to dive deeper and look at specific metrics. And uh we were able to eventually find the solutions uh based on the evidence that we saw in Performance Insights. So the bottom line is uh Performance Insights was a big help. And here is hoping that uh the same Performance Insights tool set can, can help you guys in your own uh performance investigations and uh performance journeys.

And of that, I'm going to give the mic back to Pankaj.

Thank you very much Maxim for showing some of the latest and greatest performance diagnostics features and it shows you how utilizing the latest and greatest features can simplify performance diagnostics and troubleshooting. So you don't need to work too hard to identify performance bottlenecks and resolve them.

I will just summarize by saying that uh database performance is all about visibility. And as we said in the beginning of the presentation, you can't fix the performance bottleneck that you can't see. So we encourage you to take advantage of the features that are available for you, the services and tools and aim to be more proactive.

For example, we talked about the Dynamic Thresholds in Amazon CloudWatch which analyzes past metric behavior and knows what is the expected baseline range at any point in time and can notify you about deviation from the baseline.

Maxim showed you some features which allow you to be more proactive, like exporting Performance Insights metrics into Amazon CloudWatch. So now you can be alerted on Performance Insights metrics, which is a very significant announcement for performance monitoring.

We also discussed about exporting Performance Insights dashboards into CloudWatch. So now you can use CloudWatch dashboards, you can share it with your team and you can be more proactive.

So the bottom line is take advantage of these tools, find the right tool for the right job and aim to be more proactive.

One last note before we finish the session - as you know, AWS is a data driven company. So getting your feedback would be extremely valuable to us. If you can spend a minute and provide us feedback in the session survey, that would be great.

Thank you so much for attending. Maxim and I will be outside of the room just to respect the next speaker. So if you have any questions to us, we would love to get your questions and enjoy the rest of the conference.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Take a load off: Diagnose & resolve performance issues with Amazon RDS

Good morning, welcome. Thank you for joining us. Before we get started, I'm curious to know how many of you have faced database performance challenges. If you could raise your hands... Alright, so basically all of you, which is great because it means that
复制链接

扫一扫