Enabling operational analytics with Amazon Aurora and Amazon Redshift

最新推荐文章于 2024-08-14 16:58:16 发布

taibaili2023

最新推荐文章于 2024-08-14 16:58:16 发布

阅读量393

点赞数 9

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134584926

版权

All right, thanks everybody for coming to this session today. We are going to talk about how to enable operational analytics with Aurora and Redshift. My name is Neeraja Rentachintala. I lead the product management for Amazon Redshift. We have another speaker, Adam Levin, he is the senior product manager in Amazon Aurora team.

Today, we are very excited to talk to you about a brand new capability that simplifies operational analytics with Aurora and Redshift. Before we dive in, can I get a quick show of hands on how many of you use either Aurora or Redshift or both? Ok. And how many of you are either in a data engineering or a user such as analyst who has to think about worry about data pipelines or have to do analytics on operational data? All right, hopefully you'll get something out of the session.

This is the agenda for today:

I will briefly cover the challenges and opportunities with operational analytics
Then we'll talk about the new capability that we just launched to simplify analytics with Aurora and Redshift
We'll cover briefly some use cases
Then Adam will show us a demo
And we'll dive into the actual capability in a little bit more detail

As customers become more data driven and start to see data as an asset or a competitive advantage, they want more near real time analytics. The use cases for real time analytics are plenty as you can see here such as personalization, being able to provide the best customer experience based on their specific preferences and activity, fraud detection, churn detection, customer relationship management and then vertical specific use cases like inventory optimization, IoT, location optimization, gaming leaderboards.

But the key enabler for all these use cases is getting the access to the data very quickly, gaining insight and taking action at the right moment today. To enable operational analytics, you have to build and manage pipelines to take the data out of operational databases and load it into analytics system. Building and managing pipelines is expensive, cumbersome, and most importantly, they can be error prone leading to getting access to the data and leading to delays. This can also be incomplete and inconsistent use of the data. And another common problem is any time the source system schema changes reflecting that on the analytic side, it's a complex task. It again requires manual intervention to deal with these problems.

Some of the customers use a single database solution that offers both analytics and transactions in the same database. These kind of databases existed in the market for a while and there are new databases that keep coming up with a similar architecture. But most of the times these kind of systems are limited to purpose built, very targeted analytic applications. They are not useful if you want to think about operational analytics as a general purpose solution. And they also tend to become expensive as you add more data and try to scale it.

And another very important factor that we hear from customer is if they are trying to analyze data across their transactional databases, which is often more than one, you cannot do that with a single database.

So how do we at AWS think about this problem? At AWS, we believe in offering purpose built databases to meet specific use cases. Amazon Redshift is the data warehouse in AWS stack. Broadly, Redshift offers three key benefits:

First of all, it's easy, secure and reliable so that you can focus on getting insights from the data rather than having to deal with infrastructure.

The second benefit Redshift offers is it lets you analyze all data, whether the data is in data warehouse, data lake, operational databases, streaming data or even third party data. And you can do SQL based analytics, but you can also do advanced analytics like machine learning directly on top of the data warehouse.

The third benefit is it offers the best price performance for complex workloads and at any scale. At the core of Redshift innovation is its superior performance. So Redshift offers up to 5x better price performance compared to the alternative data warehouse systems. And there are many performance features in Redshift such as columnar storage, advanced query optimizations, distributed data processing, vectorized query execution, and different types of capabilities that contribute to this superior performance.

And the performance with Redshift also scales as you increase the number of users and queries also grow the data volumes. As you can see on the left side of this graph here, this is a real life dashboarding application where a lot of users come in and query the data warehouse. So it's very small, very small latency queries but very high concurrency. So throughput matters here in these kind of workloads. Redshift offers up to 7x better price performance than the alternative systems.

And as you can see on the right side of the graph here, Redshift can allow you to go from gigabytes to terabytes to petabytes of data and scales linearly and maintains the price performance.

Tens of thousands of customers use Redshift every day collectively processing exabytes of data. And again, the use cases vary. This could be business intelligence, ad hoc queries, machine learning, operational analytics, data exploration and a wide variety of others.

And if you take Aurora, this is the relational database management service in our stack which combines the speed and availability of a high end commercial database with the simplicity and the cost effectiveness of an open source database.

Aurora offers MySQL and PostgreSQL compatibility so that you can move to Aurora without having to change any of your application. It's also a fully managed service in the sense, you don't have to worry about the hardware provisioning, database setup, backups, recoveries, patching all these complex administrative tasks.

Aurora provides the security, availability, reliability of the commercial databases at 1/10th of the cost of a commercial database. Aurora is the fastest growing service in AWS history. It's used by almost all of the top 1000 AWS customers across industries including financial services, telco, gaming, retail across all different industries. Aurora is a very, very popular database.

So with these systems, Aurora and Redshift widely being used by many, many customers, what is the typical transaction analytics architecture? This is the most common pattern we see which is they have transactional applications running on Aurora, writing data to Aurora and they have one or more Aurora readers that are covering specific applications.

And as the demands grow and customers start to look for better performance at scale or they are looking to analyze data across multiple Aurora databases, they move to a slightly different architecture which is look something like this, which is the transaction analytics at scale architecture.

In this one, as you can see here, they are trying to bring data from one or more Aurora databases, they load the data into Redshift but in between they have a pipeline, the pipeline consists of one or more AWS services for example, in this particular picture here, they take data from Aurora with Database Migration Service (DMS) and put the data in S3. And optionally do some transformation enrichment using either EMR or Glue, bring the data into Redshift into S3 and then load it into Redshift.

There's nothing wrong with this kind of pipeline if you are looking if your intent is to transform and enrich data. But the point is if your intent is to really trying to get to the data faster, the pipelines are going to come in your way and create latencies.

So to deal with this challenge, we we have introduced at re:Invent this week, Amazon Aurora Zero ETL integration with Amazon Redshift. We are very excited about this capability because it gives a very easy and secure way to enable near real time analytics on petabytes of transactional data across Aurora databases.

And with this integration data is available in Redshift within seconds after being written to Amazon Aurora.

So what are the benefits of this integration?

First, it's easy and reliable. So you don't have to worry about creating and managing pipelines and having to deal with any errors that that might occur as part of pipeline management.

The second benefit is it gives you low latency data integration. That means you can run near real time analytics and machine learning on transactional data without again having to deal with stale data issues.

And finally, it gives unified insights. So you can combine data from multiple Aurora databases and do analytics and to get a holistic view of the business.

This particular integration is designed for simplicity and performance. So as a customer, all you need to do is specify your source Aurora database, specify your target Redshift data warehouse and that's it. And immediately the data seeding starts to occur, data will appear in Redshift. And from then onwards, any time the changes happen on Aurora, the changes are reflected back into Redshift within seconds and the replication is happening at the storage layer.

So we'll talk about some of those details in later slides. The benefit of this integration is also you can start doing queries, you can start to do analytics along the side of injection. You don't need to wait for this. This is a continuous injection that is happening. You will be able to monitor the progress lag and this is also a resilient integration in the sense, it will automatically recover from errors.

Once the data is in Redshift, you can leverage all the Redshift rich analytics functionality. For example, Redshift supports complex joins, data sharing, semi structured data support. You will be able to combine the data with data lake and other data sources. You can secure, you can do granular security, you can connect it to the analytics tools and all this power.

So whatever you could do with Redshift all that functionality is available on the transactional data in seconds. The ease of use and the efficient integration that we have between these systems is further amplified if you use Redshift and Aurora Serverless alongside with the Zero ETL integration.

Aurora Serverless v2 instantly scales to hundreds of thousands of transactions in the fraction of seconds. And as it scales, it adds resources to meet your application demands, you don't have to manage any infrastructure and you just pay for what you use.

Similarly with Redshift Serverless, you don't have to manage any clusters. Redshift automatically provisions the compute, it scales the compute to maintain performance. And as a customer, you just pay for what you use.

So the idea is by using Serverless with Zero ETL integration, not only you are able to avoid the complexity of the data pipelines, you don't have to think about infrastructure at all, both on Aurora and Redshift. They scale automatically according to the incoming data demands and maintain performance.

Briefly touching upon the use cases, as I mentioned before, the main use case is being able to analyze data across multiple operational databases. This is one example, one customer example where they have 70 plus Aurora databases with a lot of data in and they have a less than 12th latency SLA to be able to get the data and run the recommendations. So this is the kind of use case the Zero ETL integration will enable.

The other use case, a very common use case that we see emerging is being able to use and share near real time data. So the idea is within Redshift, you are able to integrate data from Aurora but Redshift also can support streaming ingestion. So you are able to bring data from streams. Redshift can query data from data lake. So you have the data lake data. So you combine all this data and you make it available to other data warehouses by a Redshift data sharing capability. You can even offer this data as a service to your customers, your partners and your business ecosystem. So this is the kind of real time data flows we believe are critical for an organization’s agility and innovation.

So with this brief introduction, I want to hand it over to Adam to show us the demo and dive into the capability.

Adam: Thanks Neha. So we're going to take a look at a simple demo of Aurora Zero ETL integration with Redshift in action. We are going to create the integration, data will start flowing to Redshift. We'll make some updates in Aurora and we'll see how those flow through to Redshift. And then we'll take a look at what we can do with that data once it's in Redshift.

So first we're going to go to the RDS console and we are going to take a look at an Aurora MySQL database that's already been created called demo_cst_sales, which you can see here.

Next, we're going to go over to the Redshift console and we'll see that we've already set up a Redshift serverless data warehouse called redshift_demo. We're going to connect to that data warehouse using the Redshift query editor. And as you can see on the left hand side, you've got the, the standard default dev and sample databases.

We'll go back to Aurora and we're going to connect to our Aurora MySQL cluster using the command line interface. We take a look at what's in that database and you can see that well, you will see that there are seven tables there that are already in this MySQL database.

Next, we're going to set up that Zero ETL integration. So we go back to the console and when we go back to the console, you'll see that there in the left hand nav, there's a option for Zero ETL integration. Then we're going to click the orange Create Zero ETL integration button, specify our source MySQL cluster that we were just taking a look at. And then we will specify our target Redshift data warehouse and what you have access to is determined on, you know what permissions you have in your account and with your user.

So once we've selected our source and target integration, um we'll click Create. The integration will go get set up and once it switches to active, we can go take a look and you know, we can see some details of the integration if we want to do that.

Next, we're going to go back to Redshift and we can see here that the data hasn't showed up yet because there's one last step that we need to do. We need to go tell Redshift to accept that data that's now flowing from Aurora and to use the schema and data that was in Aurora to build that integration into Redshift.

So first step is to get the integration ID. So we query the Redshift system table to get that integration ID from the SVV_INTEGRATION system table. And then we use this integration ID to create a new database in that Redshift cluster. And we'll name it mysql_db_one just for the sake of simplicity. Redshift is going to use the schema and data structure from Aurora to then set up that database in Redshift.

So we'll refresh the list of databases in Redshift and you can see mysql_db_one there. Now, next, we're going to switch to mysql_db_one and see what's there. So we go and we select mysql_db_one from the drop down. We run a select star to see what's there. And we'll see all seven tables that are now replicated from our Aurora cluster and seeded into Redshift.

But this is not just about seeding data, we also want to capture changes to data as they flow through into Aurora. So we go back to Aurora and we're gonna add a table, now commit that, then we go back to Redshift, refresh the list of tables that we have and we can see our new table um is now available in Redshift.

And a similar thing works for adding a row of data. So we go back to the MySQL uh command line. We add John Doe from Texas and we commit that and go back to Redshift and in near real time, the data is replicated over into Redshift. And you can see we now have John Doe from Texas in our Redshift data warehouse.

This integration isn't just about replicating data from Aurora to Redshift. It's also about combining and looking at data across multiple data sources, whether that's multiple Aurora databases or combining Aurora data with existing data that you have in your data warehouse or data that's coming from somewhere else.

Um so we are going to um flip over to a, an existing database in this data warehouse called demo_customer_dw. Um and you can see that this has similar schema to our Aurora data. So we wanna combine this data into a single materialized view. So we're going to, you could do this with federated query, but in a lot of cases, the performance of federated query isn't what you need it to be to run true operational analytics.

So we're going to go create that materialized view that will then take the data that's coming in from Aurora, combine that with the data that's in the cust_demo_cdw database. And then we'll have a single view of that data coming together and we're doing a little bit of clean up and labeling in the background here as well just to keep things tidy.

Um so we're running the materialized view SQL command right now. Um and once that's ready, we'll go and run a select star in that view and you can see that now we have data from that Redshift database and the Aurora database um altogether. So now John Doe is with his family and friends um in a database with that materialized view.

You're able to do a lot of uh you can use all of the features that are available in Redshift, you can run dashboarding, you can run BI, you can run um all the other advanced analytic use cases that you use Redshift for.

So in this demo, we took a look at an existing Aurora database. We took a look at an existing Redshift data warehouse. We showed how to create that integration between Aurora and Redshift. And we showed how data flows through and how changes are reflected in Redshift once they're made in Aurora.

We also took a look at creating a materialized view that combines data, you could use this for combining data coming from multiple Aurora databases or from with or for combining data that's coming from Aurora with other data that's already in your Redshift data warehouse.

This is a bit of a simple demo but in steady state, we are targeting single digit second latencies from for data flowing from Aurora to Redshift once it's been committed in Aurora.

Customers like Adobe and partners like Click are super excited for the speed at which Zero ETL integration is able to move data so they can provide operational analytics faster. And they're also excited for new analytic use cases that are opened up by the possibility of running operational analytics over multiple databases and having access to analytics on operational data so quickly.

So we've talked a lot about what happens and the benefits of the Aurora Zero ETL integration to Redshift. But now let's talk about how this all works and when you click that Create Zero ETL integration button, what happens behind the scenes.

For the purposes of the rest of this section, we're going to be focused on Aurora MySQL just as an example. But we are working to support Aurora PostgreSQL as well here.

So in this first example, you've got first slide, we've got an existing Aurora MySQL database and our existing Amazon Redshift data warehouse and this parallels the flow of the demo. So we've got these existing resources and one of the things we're not going to go deep into internals of Aurora or Redshift in this session. There are plenty of other sessions that you can go to to get really deep into how these services work.

But there's one key element that's worth pointing out and that's Aurora storage and Redshift storage because that's what enables the uh some of the key elements and key components of the Zero ETL integration.

So just double clicking into Amazon Aurora storage a little bit, Amazon Aurora storage is purpose-built log structured storage that is optimized for database transactions and database interactions. When a request comes in, when a write comes to your Aurora database, we write that efficiently to six copies of the within an Aurora storage cluster. We don't wait for all six storage nodes to come back with an ack. We just look for a quorum and then we commit that transaction and we, you know, respond back to the application.

Storage is distributed across three AZs for high availability. Um and this is really nice because the database engine doesn't have to concern itself with the underlying hardware characteristics and processes. So this is really optimized for uh for database and and storing and reading and writing uh database transactions.

Storage nodes themselves use local SSDs um for fast uh reads and writes and storage nodes continuously stream data changes down to Amazon S3 um for durability and and backups.

Likewise, on the Redshift side, we have managed Redshift managed storage, which enables you to scale and pay for computing storage independently. So you can size your cluster based on your needs.

Uh the Redshift storage nodes are built on AWS Nitro system and feature high bandwidth networking and performance. That's um that, that's quite fast. Uh these use very large high performance SSDs as local caches and intelligently um push data down to S3 um as well. So you get sort of the, the best of all worlds here.

Um in both Aurora and Redshift, the storage and compute scale independently. Uh so you can size your cluster to exactly what you need it to be.

So going back to our diagram, when you start to write data to your Aurora cluster, we start to write the transaction log down to Aurora storage. Now when you go to create your integration, just as we saw in the demo, you go and you specify your source, then you specify your destination and you click create.

And on the back end, we do a number of things first. If you don't have logical replication or change data capture turned on, we go and we turn that on in Aurora, MySQL, this is implemented through binary logging or bin log. And in Aurora PostgreSQL, this is implemented through decoding the write ahead log.

For those of you familiar with MySQL, um sometimes the mention of bin log can send shivers down your spine because often people associate turning on bin log with a performance hit on that MySQL database. This is common across all MySQL instances, whether it's Aurora or somewhere else, there's a wide range of impact that turning bin log can have on the performance of that database. But it's something that I've heard time and time again that customers are sometimes concerned with.

And so we've listened to that feedback and we've made some optimizations to how bin log works on Aurora MySQL. So at the top, you see how bin log works in community MySQL and the performance impact happens from for a couple of reasons.

First, the transaction log and the bin log are written serially. So MySQL waits for the transaction log to get written, then it writes the bin log and then a two phase commit, commits both those things. And so you have to wait for one and wait for the other and then wait for the commitment.

We've rethink that a little bit. So on Aurora MySQL, now we can write both the transaction log and the bin log in parallel. And we can also write and then at the end, we commit both at the same time. And so we save a lot of time and save a lot of performance on the database instance.

So once we've got change, data capture going and we are collecting and writing the log of data changes, we then seed the data into Redshift and we do this intelligently at the storage layer. So we take a export of the database cluster and we we've improved the performance of this, you see recent 10x performance improvements. And so now we have multiple worker agents running in parallel at the storage layer, taking that data and pushing it over to Redshift for that initial data seed.

So we've got change logs going, we've taken that seed data and pushed it over to Redshift. And we've also built a optimized fleet of streaming servers that read the data directly from Aurora storage and push it over to Redshift. And so this is happening at the storage layer and not going through the writer instance of the database. So there's no performance impact to the writer instance of the database. This is all happening at the storage layer.

This streaming fleet enables gapless delivery of the data and we make and because it's at the storage layer and storage, Aurora storage is shared across reader and writer instances. If there's a fail over of your Aurora writer instance to Aurora reader, nothing happens. It just keeps on streaming and so taken all together.

When you create an integration, we turn on this enhanced streaming, we seed the data using a very efficient and fast cluster export mechanism. We start streaming the data changes from storage. So there's no impact to the writer instance and this is using a specialized fleet of streaming servers.

And throughout this all, we're monitoring all of this to ensure that we're meeting performance benchmarks and latency requirements of the system. If something falls behind, we can reseed the data as necessary. Schema changes flow through as they're made, data changes flow through as they're made.

And what this allows us to do is to have to enable near real time operational analytics on your operational data in Redshift.

And so Amazon Aurora Zero ETL integration to Amazon Redshift is fully managed Amazon Aurora capability that enables you to run real time analytics on petabytes of transactional operational data.

You can combine multiple Aurora databases and create views across those databases in Redshift and all of this is happening within seconds of the transactional data being written into Aurora. You can do all this without having to create complex and brittle data pipelines.

Um just with a few clicks, we we we we will report the lag and the latency and other important metrics. Um so you can monitor the performance of this system and have confidence that all of the data is being written and being read from Aurora and written into Redshift.

And so at re:Invent, we announced the limited preview of Aurora Zero ETL integration to Redshift and you can sign up to be part of the limited preview at this url (apologies that it's a little bit long and I'll leave it up for a few more seconds). Here is the remaining transcript formatted for better readability:

Um but overall, thank you very much. My name is Adam Levin. Neha is here. Amazon is a data driven company, so please fill out the survey.

Um and we do have some time for questions. And so if you'd like to ask questions, uh you can come up to the mic um or stay where you are. Um and we're happy to, to answer any questions about this new feature.

taibaili2023

关注

9
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
Enabling operational analytics with Amazon Aurora and Amazon Redshift

All right, thanks everybody for coming to this session today. We are going to talk about how to enable operational analytics with Aurora and Redshift. My name is Neeraja Rentachintala. I lead the product management for Amazon Redshift. We have another spea
复制链接

扫一扫