Analyze Amazon Aurora PostgreSQL data in Amazon Redshift with zero-ETL

最新推荐文章于 2024-07-25 13:56:03 发布

李白的朋友王维

最新推荐文章于 2024-07-25 13:56:03 发布

阅读量100

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134837737

版权

All right, good afternoon everyone. Thank you for joining our session. My name is Niraj Rintala. I lead the product management for Amazon Redshift. Uh I have a co-speaker, Adam Levin, he is a Principal Product Manager on Amazon Aurora.

Today, we are super excited to share with you more details about a new capability that we launched at Re:Invent, which is zero ETL integration between Aurora PosterSQL and uh Redshift.

So this is our agenda:

First, we'll start with um a very brief discussion on just operational analytics - what it is, what are some of the use cases?

Then we'll dive into how AWS powers operational analytics - so what are the kind of the foundational pieces? What are the challenges that customers have today? And uh our vision for zero ETL.

Then I'll cover briefly some of the recent developments around uh zero ETL um at AWS.

And then Adam is going to dive into the Aurora Postgres zero ETL including what are the features, how does it work? And he has a couple of nice demos to show.

All right. So let's start with what is operational analytics. As customers become more and more data driven and start to use data as a strategic advantage. They increasingly want access to near real time analytics. This helps them to be much more proactive in understanding their core business drivers on how to increase sales, reduce costs, how to optimize their businesses.

And there are several use cases for operational analytics. Uh for example, things like personalization so that you can offer best customer experiences using relevant data based on their own preferences. There are things like detecting fraud, preventing churn. Uh these are all like horizontal use cases that are applicable across different industries, things like product insights.

And then there are industry specific use cases like gaming leader dashboards or things like internet of things. So all these use cases are the key to enable all these use cases are is to get to the data fast, be able to analyze it, gain insight and really take action.

So this is why we are working towards at AWS a zero ETL uh future. And the idea is we want to make it very easy to analyze the data that you have in one system using another system without having to manage, deal with the whole complexity of the pipeline management and the zero ETL theme.

You probably have heard we have multiple other zero ETL integrations that we have launched. Uh for Redshift here we are focusing on the Aora PostgreSQL.

So before I dive in, can I get a show of hands on how many of you have a need to analyze near real time data from some operational database? Ok. And how many of you directly or indirectly are responsible for building maintaining data pipelines? Ok. I have some audience that can connect to the problem that I'm going to take. Talk about.

All right. So let's talk about how does AWS power operational analytics. So to start with, first of all, AWS idea is to provide purpose built databases for transactions and analytics.

Amazon Aurora, as many of you know, is a relational database service that we have that combines the speed and availability of commercial databases with the simplicity and the cost effectiveness of open source database.

So many of our customers use a Aurora as the back end data, back end for their transactional applications. It's fully compatible with MySQL PostgreSQL. So you can easily write applications to it, move your applications to it without having to make any changes.

And it also comes with a bunch of tool sets so that you can build applications easily. Whether there are several less applications, more and more generative AI applications, you can easily build agnes Aurora, Aurora is also a fully managed system, which means that you don't have to kind of think about the hardware provisioning database setup, patching backups, all these, all these things they they are all fully taken care of.

And again the security availability reliability that you expect in a commercial database, but at 1/10 of a cost and we have hundreds and hundreds of thousands of customers using Aurora. So this is one of our fastest growing service on AWS.

We have out of the top 1000 customers. More than 950 customers use Aurora. So very very popular database.

Redshift is purpose built for analytics. In 2013, AWS has introduced Redshift as the first fully managed cloud data warehouse that makes it easy to analyze large amounts of data using familiar tools.

And this is a significant leap as the traditional data warehouses were not elastic, they were expensive and they were much more harder to tune and operate.

So since then, in the last decade, we have added a lot of innovations to Redshift so that we can keep pace with the customer demands but also expand the horizons of data warehousing use cases to high performance analytics at scale but also add machine learning capabilities, Spark integration near real time analytics, data sharing monetization.

So there are a lot of capabilities that a data warehouse can offer today. A quick show of kind of numbers on Redshift, we have tens of thousands of customers collectively Redshift process extra bytes of data per day, executing or one tens of billions of queries per week.

And Redshift offers ability to query the data in open formats like Parquet or ORC JSON, etcetera. And just on the data lake, we have approximately 50 exabytes of data that is just getting scanned on an annual fashion.

And we have a capability called Redshift Machine Learning with which you can use SQL to create train and invoke models. We have over 10 billion inferences per day using Redshift. So machine learning is a popular use case.

And then there is streaming injection where from a single stream, Redshift has around 2.5 gigabytes per second throughput. And the most differentiating capability for Redshift is really its analytical performance.

So Redshift offers up to six x better price performance than the alternative systems. So it is it is built for analytics. So it has columnar storage, massively parallel processing compilation caches results at caches materialized views a bunch of performance features really to enable analytics and Redshift scales with the number of users and the data.

So this is just an example of a dashboarding application where you have high concurrency so large number of queries but each query is very, very small. So it's a low latency query in this kind of operational analytics. Dashboarding applications, Redshift offers up to seven x better price performance compared to the alternative systems.

Redshift. Again use across different industries, the healthcare, financial services, telecom, all all sorts of industries and customers they use Redshift.

So we have a nice uh um database service that can do transactions. We have a very useful product which is the data warehouse that can do analytics.

So what is the problem that we are trying to solve, right, as you can see here, many of you know, can have experienced this, you have to use a lot of components and services in order to bring data from Aurora to Redshift today.

So for example, you would use something like AWS Database Migration Service to take the data out of Aurora. Put it in S3, use something like EMR or Glue to process the data, put the data back in S3 and then use something like a COPY commands to load data into Redshift.

And another thing we repeatedly hear from customers is Aurora is a database. Redshift is a database. But because of these pipelines in between, you have to actually pay for the compute to do things like duplication, to do the merges, just updates to the records, all this data, you have to kind of pay for the compute in between.

So what we are trying to do is these pipelines are we, we are, what we are saying is this kind of an architecture is really not going to help. If your goal is to get to the data faster, you want to get to the data faster, start gaining insights. Maybe at that time, you might realize a need for transforming the data into a different form. But to be able to do that, you need to be able to get to the data first.

So this is what we are trying to do with zero ETL which is we want to make it simple and secure to enable near real time analytics on petabytes of transactional data. And as a customer, you don't have to really build and manage any data pipelines.

And zero ETL integration is kind of a continuous replication. It's a change data capture based. So it's continuously as the data is updated on the source database, the data is automatically appearing on the Redshift side and within seconds, Redshift is going to have the data after the transactions are committed on the Amazon Aurora side.

And one of the benefits of the zero integration is many of our customers have one database for each of their applications. So they have a Aurora database instance for one application, a second application has a different instance and many of our large customers have hundreds and thousands of these Aurora instances.

So one key benefit of zero integration is you are able to actually bring the data from multiple Aurora databases into a single data warehouse so that you can actually get holistic insights across all your data, all your transactional data.

And once the data is in Redshift Redshift has a very nice feature called data sharing with which you can actually share live and transaction consistent data across multiple data warehouses. And these data warehouses could be within the same account, they could be in different AWS accounts, they can be in different regions. They could be even across different organizations like your customers your partners.

So this live and transactional consistent data sharing is very handy because now you landed the data in Redshift, it's near real time data. Now you are able to share this data seamlessly across your organization without having to worry about again, copies and movement of the data.

So combining the zero ETL with data sharing, it can be a very powerful mechanism to share the data in an easy fashion across the organization.

And Redshift of course have other capabilities. For example, on the left side here, zero ETL is one way to bring the data from operational databases. But you probably have some data in S3 data lake, some some some data coming from a different system and you might have some streaming data, some events data that is streaming from a streaming service.

And then the nice thing is in Redshift, you can actually join all this data, your data lake data, your streaming data, your operational data coming from Aurora and Redshift is even integrated with AWS Data Exchange. So you are able to actually access third party data sets and join with the with your first party data sets.

And then on the right side, you have all types of analytics, you have business intelligence and uh uh the BI tools like Tableaus MicroStrategy, different kinds of BI environment from which you can access the data, you can build data applications, you can do machine learning and you can do things like Spark using the EMR and Glue directly on top of the Redshift data.

The key point is once you have the Redshift data, data and Redshift, you have all the power, all the rich analytics of Redshift accessible for your users.

So let's talk about some of the recent developments in zero A TL.

So we have launched zero A TL uh last year as a beta at uh Re:Invent. Later we opened it up for a public preview. We have lots of customers that have tried out and provided the feedback and then uh we have launched it as a generally available capability, Aurora MySQL, I'm talking about so that when generally available few weeks back, I think 56 weeks back.

So in the zero ETL of Aora MySQL, we have added a number of capabilities, for example, expanded region support. Now we have made the API s available uh in the preview time frame, it was primarily accessible through console and then we have improved the getting started experience. So it's much more seamless now to try out the 08 l integration, we have added things like events.

So whenever the integrations get created, you can actually subscribe saying, ok, tell me when the integration got created when it got changed. So all these things you can subscribe to and we have got a lot of requests for Aurora MySQL around JSON support. So that is that is added as well.

So Aurora, MySQL is generally available and we have a number of customers that are, that are using um successfully uh Aurora MySQL zero ETL integration.

Just a couple of examples here, Money Forward is a financial um management platform. And as you can see here, they, they talk about the real time data synchronization between Aurora, MySQL databases and Redshift. Previously, it could have taken them months to set up an analytics environment that they were able to do very, very quickly.

Another important point is with zero ETL integration, the application is happening at storage layer. So there is no impact to your actual production environment. So that is an important benefit.

And Woolworths and another customer similar feedback which which is it could have been could have taken months to develop the ETL pipelines much simplified experience. And also because of the near real time data, they were able to analyze their event data much, much faster.

So at Re:Invent, we have extended the Aurora MySQL zero A tl integration to more data sources. Now we are add, we added the support for Postgre SQL

The nice thing with the zero etl integrations is it they all have consistent experience. So whether if you are using Aurora MySQL, Aora PostgreSQL, the experience of zero etl is pretty similar, you just create an integration. We are going to show that later in the demo specify a source, specify your target and that's it. And that experience is uniform across the different data sources. So all the benefits that we talked about from a zero etl perspective, they all apply to the post as well, simplifying the operational burden near real time access. And uh once you have the data in Redshift, being able to do analytics at scale. And the PostgreSQL zero el integration is also storage level replication. So the benefits that we talked about not impacting the production cluster, they are applicable here as well.

We support the DML which is insert updates, deletes kind of operations as well as metadata operations, which is adding new tables, removing tables, adding columns. So both data and metadata operations are supported, we automatically map the data types. So all the standard things that you would expect in a in a application kind of a capability, it's it's all there, it's it's taken care of.

So we have the preview available inapg 15.4 version on words. Uh we have availability, it, we have it available in us cs two to start with. So you can try it out uh there and we already have many customers that are getting ready to try it out. Just a couple of examples here. Um and, and the patterns are pretty similar.

So in the case of c ms who is a talent acquisition uh company, so they, they are doing currently nightly one time batch injection. Now they are able to move to near real time. So the key is near real time, eliminate the data pipelines and without having to impact anything on your production clusters.

So with that brief intro, I want to hand it over to Adam to dive into the a pg specific details.

Thanks. So I now that we've now that we've talked a little bit about, you know, high level uh what's the value of getting data into Redshift? What are the capabilities of Redshift and a little bit about um the capabilities of the Aurora Postgres zero tail integration. Um I wanted to dive in a bit and look at some specifics. Um you know, what does it take to set up an integration? Um you know, what are the challenges that are specific to overall Postgres? Um and then, you know, I've got a, a few demos to show in terms of, you know, the setup experience, uh you know, near real time replication um lag. And then uh also, you know, what does it look? What, what can you do with the data uh once it's over on Redshift.

Um so we'll break this up into three sections once uh first we'll talk about, you know how uh it's simple to set up an integration, then we'll talk about how it's uh simple to manage that integration and we'll sort of take a peek under the covers a little bit and look at what's going on behind the scenes once you click that create integration button. Um and then, you know, we'll, we'll talk a bit about, you know what you do with that data once it's in Redshift, sound good cool.

Um so when you want to, you know, just double clicking on that chart that we were looking at earlier, like, you know, you've got a bunch of pieces in place that you need to go and spin up when you go, when you want to get data out of an operational database and into an analytical store, you need to put a team together, you need to go get infrastructure, you need to figure out like is this gonna be batch? is this gonna do we need near real time? um how do we deploy this? how do we monitor it? how do we ensure that it is healthy? who's getting paged in the middle of the night if something breaks?

Um you know, all of these things can take days, weeks, months, um a year uh to, to really sort out. Um and that's because there's a bunch of different components that are needed and there may be reasons that you want to go and do this and build all of this out. You know, if you need to um uh if the sort of level of complexity of your workload um isn't a good fit for a ze tail integration, then you may still need to do this. Like I'm not trying to say that Azure tail integration is a fit for every single use case. But for getting data from Aurora to Redshift, um it works uh you know very well and that's because we take all of that stuff that's on the top of this list and we manage that for you. You basically come in and you tell us where you want the data to come from and where you want it to go and with what types of permissions and, and we take care of all of that for you. And this lets you get started in minutes.

And so even if this is just trying to get a, you know poc, what can i do if my data was in Redshift, it's a lot easier to go and spin up and create a zero etl integration in a few minutes than to go build a whole bunch of data pipeline um to, to move your data.

So when you come to create a zero tail integration, um you first need to figure out uh is this gonna be, does your uh a rc and your Redshift data warehouse live within the same account? Are these different teams? Uh they in these resources in different accounts? Uh the good thing is that we support all of these modes and so you can specify a resource policy by which data, by which on the retro side, it is accepted or not accepted from certain accounts or certain clusters. And you can have really fine grain controls over the uh the authorization and the policy when it comes to where data is moving. This is particularly important if you've got two separate teams, managing operational databases and analytics databases where one team do can't just say, oh i'm gonna put my data over here.

Um and so you can, you can set these things through. I am the next thing and we'll go into a bit more detail. Uh is, is really the simple setup experience. Um you go and you create a database cluster, there's a few parameters that need to be set on that database cluster. As Nara you mentioned, this is a, you know, change data capture based system. So on the Postgres side, we're using Postgres logical replication. But we've done, uh we've made a number of improvements to how logical replication works. And we've pushed a lot of that processing down to the storage layer to free up compute on that uh on the compute node and then ongoing monitoring and error reporting. We provide fine grained uh reporting of events of issues. Uh we'll tell you exactly which table uh has an issue. If there is an incompatibility or some sort of data type, you know, one of the benefits of post is that it is extensible. And so things like extensions and custom data types, um you know, when, if they aren't able to be replicated over, you know, we'll come and tell you that. So it's no surprise that you're, you know, something may or may not be there.

Ok. So let's, let's take a look at the um the initial setup experience. Um so what we've got here on the right hand side, I've got a Aurora Postgres 15.4 cluster that's been created in the us east two preview region. And so when everyone goes and tries this out after this session and you're go, you open up us s too. Um I want you to click on the r ds console and then scroll down to the bottom and click preview. Um and so at this, you know, this was just launched in preview uh re invent. Um and so it's in the preview region. It, I just don't want anyone to be confused if they uh go looking for this in us c two for Aurora, Aurora Postgres.

Um and so we've got the, the cluster set up here on the Postgres side. And similarly, uh we've set up a Redshift cluster in the preview track um in us east two as well. So we are going to go and fill this uh this cluster um with some data. So first step is to create a database, um we'll call it a pg source etl. Um and then we're gonna go and create a few tables and fill those tables. Um you know, just with some sample data. This is a fairly simplistic example um but just to show that uh when we create the integration, initially, all of the existing data is copied over and then any changes that happen uh after that integration is, is set those things uh are replicated over as well. And it's really sort of this simple.

So, um you know, while we're creating a few tables and inserting some um you know, some values into those tables, uh I'd like to welcome uh bob and alice to the database. Um the, it's worth mentioning that because this is a logical replication based system, you do have to enable logical replication on the cluster when you go set it up. Um and specifically, we've for zero etl for post grads, we've introduced enhanced logical replication. And so what that instructs the cluster to do is is to use a different pathway where we're pushing down the processing of those logical replication files and the decoding um into storage and removing a lot of the uh compute um intensive work that's required. And so as much as possible, we have a um a a uh sort of vision to offload as much as possible to storage.

So we're, we've inserted some uh data into our resources uh table. We've got keyboard, a keyboard, we've got a mouse, we've got a display. Um and then the last table we're gonna create is a uh rented resources. Um it's important to note that each of these tables um requires a primary key. Um so that's important, that's, you know, good general hygiene anyway. Um but it's important to create your uh database and your tables uh with a primary key uh for zero tail integration.

So we just inserting the last bit of data. Um and then we are going to go over to the console and click on the almost there one last table. Um the, yeah, the we're gonna create our returned resources table um and uh set that one up as well. So just creating a few examples um that we'll, we'll play with uh in Redshift. Um on the on the flip side of the song. The other aspect of this is that today it works just with provision clusters. Um but as you, you know, if you take a look at it on the ro my ql side, um we support both serverless and provisioned. Um and so, you know, we'll get there as well on the Postscript side.

So now that we've set up our, our, our database, we click on the zero etl integration tab, we scroll down, we click create a zero etl integration. Um we're gonna, you know, name that, that integration with uh whatever we want, we'll select our source database cluster. And because this is Postgre and Postgres resource model, uh you can create multiple databases within a cluster, you need to tell us what that database is called. Um the mapping is one database per integration.

Um and then you saw a select the target destination data warehouse, you can optionally uh assign tags to it. You can specify your own encryption key. Um and then we clicked, create the 00 etl integration. And as you can see on the, the final screen here under status, it shows us creating, typically the process takes 20 to 30 minutes. Um and so we're not gonna sit around and, and wait for that.

Um but uh I will uh tell you a bit about what's going on behind the scenes while the integration is creating and, and what's going on once you click that orange button. So you've got your source cluster. Um in this case, it's a 15.4 cluster. Um and you know, as has been said many times about Aurora, um you know, one of the key innovations is the separation of compute and storage and really the, the log is the database. And so we right, the right, right ahead, log the wall down to storage and we push all of the processing of that wall and the, the um you know, writing the, the the page files of the database down to Aurora storage to offload that from, from the compute node.

So you've got your, your transaction log going down to Aurora storage. Um and then you go and you enable uh logical replication, logical replication and Postgres. Uh there's replication slots to manage um depending on uh you know what what's going on? There's resource contention with any system that you're building yourself. You're going to have to go through the compute cluster to get your data in and to get your data out.

Um and so we thought, well, you know, why can't we get the data directly out of Aurora storage? Why do we need to go through the compute cluster when it's time to, you know, do any of this work? You know, we've got storage that's managed by aws. For Aurora, we've got storage that's managed by Redshift managed by aws on the Redshift side. Like, you know, why can't we have these things uh work together um on your behalf.

So when you go and create that integration, uh we take an export of everything that's in your database. Uh we use a performance optimization that we've made to the export function in r ds. We get that data out and we load it into Redshift. After that's done, we've been collecting the change data logs. And so we're able to go and anything that's changed while that database was uh exporting the initial uh exporting and loading that initial seed data. We're able to go catch it up based on the the changes um that have been stored down uh using logical replication.

And we've built a, a customized set of uh streaming compute servers to read the data directly from Aurora storage and then push that over to Redshift. And so this is a, a fully managed uh set of functionalities that you don't have to worry about. You don't have to think about um spinning up that compute or uh or, or managing that just to, to move the data out of Aurora storage.

And while all this is going on, we're monitoring the entire system depending logical replication in Postgres uh has some limitations in some, particularly around ddl. And so with zero tl integration, we're able to support some ddl. And when we encounter ad dl that isn't supported via logical replication, we intelligently are a to go and reseed the data

And so we take another export and we load that into Redshift to make sure that we are not that make sure that the, the data and the data structure, the schema on the Aurora side is the same as that on the Redshift side. And all of this is happening behind the scenes, automated um things are recovering as needed um which brings me to the next point of, of resiliency.

There are multiple points in this system where we have built in resiliency where we've built in uh automated recovery from issues. So you don't have to worry about it. So you're not worried about getting paged in the middle of the night. So you're not worried about what happens if something um breaks and, and I'll just touch on a few of these.

So first on the Aurora PostgreSQL side, you can create multiple read replicas of your Aurora PostgreSQL database and they are all connected down to the same uh storage layer. And so if something happens to one of those instances, you can seamlessly fail over to a replica. Um and this has no impact on the zero TTL integration because everything is done at the storage layer.

Similarly with Aurora storage, when you write data to Aurora storage, we copy it out and we write it 6 different places across multi AZs. And so if something happens to one of those uh storage nodes, uh you've got 5 other copies of your data for every write that comes in, you know, we wait to have a quorum of at least 4 of those writes. Uh before we acknowledge that, that uh that write to the compute instance, similarly on the, on the Redshift side, uh Redshift storage and and Redshift um have capabilities to enable uh resiliency and high availability.

Uh so even in the face of of issues, um you're able to fail over and get back up and running very quickly. And then in the middle that on that CDC streaming fleet, I I spoke about that's fully managed for you and that's it's built to be resilient. And so if something happens to the compute, we switch over to another compute instance. And because the change data capture log that logical replication log is stored in Aurora storage, we just catch up and so whatever files we didn't read while in the, you know, few seconds that it was taking to, to switch over to a different compute instance that's streaming data out.

Uh we just go and, and we catch up those changes and push those over to Redshift. And we're always able to reseed the database as necessary. So we're able to reexport um the, the database and the tables and push those over to Redshift. And as I mentioned, all of this is happening behind the scenes automatically. And um you know, without any intervention, we do report when things happen. And so you can see, you know, lots of details around what the system is doing where it encountered issues, you know, when um you can report on lag.

And so, for instance, with the streaming server, if, if it needs to fail over to a different server, you may experience that as an increase in lag. Um but you can see in the Redshift console, fine grain metrics around lag and, and how that uh is shaped over time.

So going back to the console, we've got our uh integration, you can see the status has changed to active. And um let's let's play around with this a little bit. Let's, you know, let's let's take a look at the system. Um the first thing to do though is, you know, the integration is not quite complete, going back to what I was saying before around uh IAM policy and having fine grain control over data coming in and data going out.

Um you need to accept the data. So the Redshift administrator or the Redshift user needs to accept the integration into um the Redshift data warehouse and create a database that is connected to that integration. So we're just gonna grab our integration ID which is just a unique identifier and we're gonna say create database RRS Redshift target ETL from that integration.

Um and now we've got our uh database over on the Redshift side. So we'll just connect to that database. And let's take a look at uh at what's there. So select star or hit run and there's Bob and Alice, they've made it over from Aurora to Redshift. And as you can see that's exactly what's on the uh Redshift side.

Now, I know this is a video, this is a recording but this was, was recorded live. So, you know, this is real data happening in real time uh as we're watching it. So let's take a look at another table. Uh we'll take a look at the resources table. Uh it's matching keyboard mouse display, the quantity and stock is 10, 5 and three. So that matches as well.

And now let's see what happens if we insert some new data. Um so we're gonna welcome Noel uh into our store. We have a new hire um and we will uh Noel's right there. So a couple of seconds have gone past, let's see if he's made it over into the Redshift side and there he is.

And so when we talk about near real time uh operational analytics, it's a few seconds to get data on an ongoing basis, not a batch, not a daily batch or a weekly batch, but on an ongoing basis from Aurora over to Redshift.

So similar thing works when you wanna go create a table. Um so let's go create a new table. And so this is DDL, you know, so uh uh a new table we're gonna create that. Um this is getting processed through logical replication running in Aurora PostgreSQL using the optimizations that we've made for zero ETL.

And so we'll go create that table, we'll insert into that uh uh that new table, a couple of um records, give that a chance to finish. So, you know, we're, we're gonna review the mouse. Um you know, everyone has their favorite mouse. Um this one worked well, we've inserted that. Um so let's go take a look at, you know, we've created a brand new table, we've inserted some data into it a few seconds later that shows up in Redshift.

Um and is it there? So it hasn't gone there. So, you know, just to reemphasize the point that like this is, this was recorded live like, you know, lag is variable depending on the process, you know, what is getting inserted into Aurora and the processing that needs to happen, you know, the lag can vary from uh you know, a low single digit seconds up to, you know, uh double digit seconds, but there is some variability there. But now that we have, you know, waited a couple of seconds, uh it's the data is there now.

And so we will, you know, create another table, test resource clean up. Um we're gonna go through a couple steps to add data. We will um show a transaction that is uh inserting, that is um uh you know, we will add data to an existing table and then truncate um a a table and we will, you know, show that these things are all getting replicated over.

And so this is really powerful when you want to have all of these changes and support application development with Aurora as an operational data back end. Um but then you wanna be able to get all that data out and analyze that data um on the on the Redshift side.

Now this is a mirror image. So all of the data that is in your database within Aurora, PostgreSQL is getting copied over. And we have heard, you know, lots of feedback around the ability to filter out particular tables from a database, uh particular columns from a database.

Um and so, you know, that is one area um that we've heard lots of feedback from and so, you know, we appreciate that, that feedback. Uh and you know, it's something that we are, we are researching.

So all of this is happening on the back end. Um we've now uh updated our, an existing table. We have incremented some of the stock based on uh new inventory. We're gonna drop a table. And so uh drop uh changes are also replicated over to Redshift.

And the reason we're going through so many of these sort of small examples is because this is hard to do like PostgreSQL logical replication doesn't support a lot of these things right out of the box. And so as much as possible, we're trying to build in functionality uh to support the breadth of uh DDL commands that you can throw at PostgreSQL and be able to replicate those, you know, the, the impact of those changes over to Redshift in a performance way.

And so we've dropped a, a table, it doesn't exist anymore. Uh it's uh removed from the database on the Redshift side. And then we are last, you know, ran this transaction uh to set the stock of our um our, our, our resources table um and drop that, that other table. And um now we've got, you know, our, our newly updated resources table on both the Aurora side and the Redshift side.

And, you know, now we're gonna uh drop the return resources and you can see that you know, it's replicated over on the Redshift side. So, so just uh uh you know, variety of, of examples of thing, common things that you might do to a database inserting data, updating tables, adding a table, dropping a table, a transaction to process multiple uh to touch multiple tables at one time.

Um and all of these things are getting replicated in near real time over to Redshift. And so that's great and all but like the point is what do you do with the data once it's in Redshift? And as Neri was talking about earlier, Redshift supports a wide range of really powerful analytics capabilities um from being able to share data out to other organ, other parts of your organization uh the integrations with other AIML services uh materialized views and being able to aggregate data from multiple data sources into a single view.

And so let's take a look at a couple of those briefly. And so in this next demo, we're gonna create a materialized view based. This is a different data set. So this is just a sample uh ticket sales demo set. We're gonna create a materialized view over the data to um uh create a materialized view. We're gonna create a model um and an AIML model, excuse me based on that data uh that forecasts sales for the next month.

And so, excuse me, initially, we'll create this materialized view uh daily sales by event, um we input a number of columns there, quantity sold, price paid, uh the event, sales amount, the dates and you can see that, you know, we've got uh data um from the beginning of 2022 out to the end of 2022. But we wanna know, you know, let's say it's the end of the end of the year. We wanna know what our, you know, potential sales in January are gonna be.

And so we can use this type of pattern to generate some forecasts that we can then use as the basis for decision making. So we'll create, we created our view. Um then we're gonna use the forecast function um in Redshift ML to create a model. Um and using that model, that's when we will go and you know, pump this data into it and generate a number of forecasts across a number of different runs of the data um at different confidence levels.

And so that's what you see down below here, you know, where we've generated a table based on the model and we are displaying um the the list of of potential outcomes. And so this is just a small example of, of the capabilities, you know, you've got operational data coming in, things are changing quickly. You can create this type of system to constantly be updating a forecast or you know, whatever is relevant to your business. So you can keep feeding these analytic systems with real time operational data in near real time. And and that's really the power of this system is leveraging the the what each of these systems is built for and being able to move data between them in a very seamless, easy way.

So just taking a step back here, you know, this is the ZTL is part of a a vision. You know, we want to enable you to perform analytics on your transactional data without building complex data pipelines. You know, we're at the beginning of this journey, you know, by freeing up your teams to focus on getting value out of the data instead of sort of operationalize the data, you can then take steps forward, take leaps forward in your cloud data strategy um and get to better data driven decision making, better data driven decision making, use this data to feed into other applications and other use cases, you know, and as you saw Re:Invent this week, we've announced a number of new ZTL integrations.

Um and you know, we're, we're really just uh uh getting started here. There's a number of other your ETL sessions um that are talking about uh other source and target pairs. Um such as RDS, MySQL. Uh this this session has been about Aurora PostgreSQL, Aurora MySQL, there's a couple of sessions as well. Um but, you know, this just gives you sort of a glimpse into the into the vision.

Um and you know, we think this is, this approach is working well for customers. Um but we'd really invite you to, to take the public preview for a test spin. Um let us know what you think. Let us know if it works for your use case, let us know, you know what else you wanna see out of the Aurora PostgreSQL ETL integration with Redshift.

Um and you know, it's in preview today, you can go try it out. Um and then, you know, as we work towards GA, we, we'd love your feedback. Um so thanks for listening. Um and please don't forget to complete the session survey in the app and we've got about uh 10 minutes so we can take a few questions if you'd like.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Analyze Amazon Aurora PostgreSQL data in Amazon Redshift with zero-ETL

All right, good afternoon everyone. Thank you for joining our session. My name is Niraj Rintala. I lead the product management for Amazon Redshift. Uh I have a co-speaker, Adam Levin, he is a Principal Product Manager on Amazon Aurora.Today, we are super e
复制链接

扫一扫