Set up a zero-ETL-based analytics architecture for your organizations

最新推荐文章于 2024-07-25 13:56:03 发布

李白的朋友王维

最新推荐文章于 2024-07-25 13:56:03 发布

阅读量104

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134832489

版权

Good afternoon. Welcome to this lunchtime session where we're gonna be talking about the Aurora MySQL Zero ETL integration with Amazon Redshift.

Thanks for you all for joining. If you've got your lunch, hope it's delicious. And that we can provide you some information while you're enjoying that or digesting it or looking forward to lunch.

My name is Adam Levine, I'm a product manager on the Amazon Aurora team. And joining me today are Joe T Agarwal, a product manager on the Redshift team. And then we have two speakers joining us from Intuit - Smits Shaw as well as Rakesh Ra. And they're gonna be joining us to share how they are using the Aurora MySQL Zero ETL integration with Amazon Redshift.

So we've got a pretty packed agenda today. We will start off by talking about, what do we mean when we talk about operational analytics? Like what's the point of that? We'll talk a little bit about specific to this Zero ETL integration, how AWS powers operational analytics, we'll talk a little bit about recent developments, recent announcements. We'll hear from Intuit and then we'll do a little deep dive, talk about how the integration works, show some demos of what you do with the data once it's in Redshift.

The goal here is so you leave with a good understanding of what Zero ETL integration can do, how it can help you with your business. And then hopefully you go home and try it out and make your lives easier.

So let's start off by talking about, you know, what do we mean? I think we can all agree, right, that data is a source of competitive advantage today. And there are multiple ways to use data, there are horizontal use cases, there are industry specific vertical use cases, but the key enabler here is data and being able to get access to the data to drive insights faster and in near real time that lets you take action on changes in the data and that lets you leverage that data to drive business decisions and business outcomes.

So just by a show of hands who here in this room needs to analyze data from operational databases? Ok, so most of you. And then also by a show of hands who here is responsible for building, maintaining, integrating with data pipelines? Ok, so a few, fewer hands. And who enjoys the complexity of keeping all that stuff up and running? A few, fewer hands.

So this is really why we're investing in Zero ETL integration. We wanna make it simpler and more reliable to get data, to use data that's generated or stored in one system in other systems within AWS and in AWS.

We have a purpose built data strategy. You know, we want you to be able to use the right tool for the job when storing and processing data. When it comes to operational data transactions, Amazon Aurora is a high performance, highly available operational database that is both MySQL or PostgreSQL compatible that delivers this at 1/10th the cost of commercial databases.

And the key innovation with Aurora is the separation of compute and storage. So these two layers can scale independently and operate independently to provide high availability, high throughput transaction processing, and durability. So when data is written down, when data is written to Aurora, it's actually written six ways at a high durability in Amazon Aurora storage. And the separation of compute and storage enables a number of really powerful features for online transaction processing such as automatic and continuous backups, fast database cloning, serverless, and the ability to replicate data across regions.

And this concept of having the storage layer do work is key not just for Aurora but for Zero ETL integrations as well. Amazon Aurora is the fastest growing service in AWS history and is used by over 95% of the top 1000 AWS customers. And you can see just a few examples of some customer names here.

Now on the other end with analytics, Amazon Redshift sits at the center of your data journey. It's a fully managed petabyte scale data warehouse that's deeply interconnected to our data offerings to deliver the best price performance in a cloud data warehouse. And Redshift takes a similar approach with the separation of compute and storage as well.

And just a few examples of Redshift, you know, by the numbers, there are tens of thousands of customers that use Amazon Redshift today. With Zero ETL integration, data is moved from when data is written to Aurora to when it's available to be queried in Amazon Redshift in about 15 seconds. Exabytes of data are processed each day. And with integrations with AI/ML tools within AWS, Amazon Redshift contributes to 10 billion predictions per day with Amazon Redshift ML.

But the key reason people are choosing Amazon Redshift is performance. So Redshift delivers up to 6x better out of the box price performance compared to competitors. And so on this side, lower is better. And then for more common short queries, which form the majority of customer workloads that require concurrency scaling, Redshift delivers up to 7x better price performance for workloads such as BI dashboards. And this is a really compelling, powerful number for those that are looking to save on costs. This can really change the game.

And then once your data is in Redshift, it opens up a number of other opportunities to share data within your organization without having to copy the data multiple ways, whether that's a data mesh architecture or a hub and spoke architecture. Amazon Redshift can support your needs in whatever way are required.

I'd actually invite you all to attend immediately following the session is a session around What's New in Amazon Redshift. I believe it's also in this building. So I invite you all to go take a look and learn about some of the recent innovations with Amazon Redshift.

Similarly, tens of thousands of customers are analyzing exabytes of data every day in Amazon Redshift. And so you can see, just some of the names as an example here.

So let's turn our attention back to Zero ETL and integrations. There are many reasons to build a data pipeline, right? So if you need to process data changes, you may still want to build a data pipeline. But maintaining and managing all of those components when all you want to do is get your data, your operational data from Aurora into Redshift can be overly complex.

And that's why we built the Zero ETL integration. We want to make it really simple and secure to get data out of Aurora and into Redshift so you can analyze that operational data in near real time and build on that. We support multiple source integrations to the same data warehouse.

And so you can create an aggregate view across all of your Aurora database clusters in the same data warehouse to get a single unified view of petabytes of operational data once it lands in Redshift. That's what we mean when we talk about operational analytics and how we're approaching analytics.

You've seen at re:Invent, we announced other source and target Zero ETL integrations. And so specific to Aurora and Redshift, I just wanna review a few of those announcements and recent news.

The first is that recently Aurora MySQL Zero ETL integration with Redshift reached General Availability. And the difference between preview and GA is that we added additional regions, we improved performance, we added support for the API and AWS CLI, we revamped the getting started experience.

And actually there's a sort of a follow on session to this breakout a little bit later today where we'll go through that improved getting started experience to help you get the Zero ETL integration set up and running faster.

We also improved events and notifications so you can subscribe to events to know exactly what's going on with your Zero ETL integration as it happens. And then we're also continuously working to expand the support for different data types. So in preview, we did not support JSON, and now in GA we support JSON.

And we have a number of customers that are happily using Aurora MySQL Zero ETL integration mostly to reduce operational overhead when it comes to building these data pipelines and integrations.

So one example - Woolworths in South Africa, a leading retailer, they're producing the same results within a day that would have otherwise taken them two months to develop. So think about what you can do with that time that you're saving from instead of building and maintaining complex data pipelines, focusing on what you can do with those insights and how you can better derive more comprehensive and sophisticated insights based on the data instead of just focusing on schlepping that data around.

And then at re:Invent, we announced support for Aurora PostgreSQL. There's a separate breakout session that happened yesterday, that recording will be available online that deep dives the Aurora PostgreSQL Zero ETL integration with Redshift. But we're taking a similar approach for both of these integrations, whether the data is coming from Aurora MySQL or PostgreSQL.

We want to reduce the operational burden of building data pipelines to provide you near real time data access so you can leverage all that Redshift has to offer to drive analytics at scale.

To bring this to life a little bit more, I'd like to invite Smits Shaw from Intuit up on stage to talk about how Intuit is using the Zero ETL integration.

And the primary goal for us was to kind of, you know, deliver features quickly to our customers and also make sure at the same time that our architecture is scalable enough to handle the requirements for the next decade or even more.

And when I'm talking about the migration, it was not just a straightforward, you know, just move the database from, you know, one relational database to another relational database or just write new services that are accessing the existing database. It was a complete overhaul of uh you know, the data model, how we kind of decomposed our architecture.

And that is why it was very, very important for us to develop a migration framework that is seamless and robust. And we were very very clear on our principles, right, that we cannot afford to have any downtime for our consumers. And the data consistency was paramount, right, we cannot really cut up the data. And then we also wanted to make sure that whatever stack we develop, we are kind of operating in this double bubble. So that if for any reason, you know, things don't work out on the new stack, we kind of go back to our better tested stack that we had.

So for that, what we wanted to do was we wanted to make sure that we are able to get real time insights into our whole data migration journey. And then we also wanted to make sure that we basically provide crystal clear transparency for our stakeholders.

Now, if you really look at this, right, what we're talking about is migrating hundreds and millions of customers and small companies that we have from our legacy stack to the new stack. And all our customers are kind of operating globally around the globe. And we basically had to slice and dice the data to make sure that we do not really impact our customers.

And then uh when, when I talk about, you know, ensuring clear transparency for our stakeholders, this is something that, you know, we we we did not really figure out day one, but when we were starting to do the migration in our prep production environments. At that point, we realized that our team was spending a lot of time and energy in updating all our stakeholders all are be used on what is the migration status? Where are we with the migration? What are the errors that are happening? Are we done yet? And they also wanted to kind of get that confidence that, you know, this is this is going well for the other products. And now, you know, for the more critical products, we are good to go and kick start the migration. They wanted to have that confidence and which is why we said that we need to fix, get out of way to kind of provide some asynchronous mechanism for them to look at data in real time and gain that confidence and not just for our external stakeholders, but even for us, we wanted to get data in real time so that we can figure out how migration is happening, what are the errors that are happening and use that as a feedback loop to kind of improve our migration journey.

So this is this is kind of a little bit of a high level architecture of you know how we kind of looked at this problem. So if you really look at it on the right is the legacy platform that we have. And then on the right is the more modern microservices based architecture.

And if you really look at it, the overhaul that I was talking about is that we had, you know, adopted completely new a p strategy through graph qlaps, we completely changed our database going from legacy relational databases to a more modern no sequel dynamo db database. And it was like a complete shift in the way we were kind of doing things as far as identity was concerned.

So, and then as I said, we couldn't afford any down time. And what we did is that we basically adopted adapter pattern where regardless of whether the user data is in the legacy stack or it is in the new identity stack, we basically would rout the traffic and we develop that anti-corruption layer, regardless of which api you are coming through, the traffic would get routed either to the legacy stack or to the new stack.

And what we did is we developed a spring batch based framework that would take the data from the legacy stack and migrate it to the new stack. And the backend database for that uh uh spring batch application was aurora service. So we basically had developed this pipeline from our legacy system uh to seed the data uh into aurora services.

And I wish we had zero etl there, we wouldn't have had to build that pipeline. Uh but uh you know, it was unfortunately not there and we had to kind of build a aws glue based pipeline to see all that data into the aurora service. And then the touch based applications would pick up the jobs and then execute the migration.

Now all of that data was sitting in aurora servers and we, as I said, we wanted to kind of power our dashboards so that all the stakeholders can really look at real time on what's happening with the migration. We also wanted to do all sorts of real time analytics rock queries.

And if you really look at that box from aurora service to red, it's just a simple arrow, right? Uh if that was not there, we would really be looking at, you know, four or five more boxes in between to get the data from aurora services to the red shift cluster.

Let's talk about the outcomes. So you know, like I said, what, what this has enabled us, for us is it has allowed all of our stakeholders to look at the data in real time and get that confidence of migration being going really well. We've been able to kind of, you know, identify patterns of failures that are happening through either our queries or, you know, through looking at our dashboards.

And then we've been able to use that as a feedback loop to improve our overall migration process. And at a high level, the earlier approach that we were looking at, uh we would have had the data into red shift, at least, you know, it would have taken us 4 to 5 hours to get that data into red shift. And our dashboards would be stale with zero l. We were instantly able to get all the data into the red shift cluster. And, you know, we were able to do our analytics in real time pretty much.

You know, i would say that something that would have taken us at least a couple of months to develop with zero etl, we were able to deliver it in production in a couple of days, right. So this was amazing. And as far as you know, we we are a big red ship shop. So this was just the first use case. I'm hoping that as we go along, there are going to be a lot more use cases where we'll be able to leverage zero etl to get data into red shift. And you know, kind of do a lot more real time analytics with this.

I'm going to hand it over to rakesh to give a quick demo. Rakesh, all yours.

That was awesome. Thank you smith. Good afternoon, everyone. I'm rakesh and I'm part of the identity team at into it today. I'm excited to share a demo of the migration process which utilizes aurora serverless v two with analytics backed by redshift.

We initially started with the sorry. Uh we initially started uh with having this uh complicated pipeline which replicated uh the data from aurora to redshift. But uh thanks to zero a tl, it was very simple and it reduced the replication lag from hours to seconds.

So, so to set up, this is pretty simple, uh you just have to launch the source and destination server. In my case, it, it, it is uh the source is uh aurora server as we do and oops, sorry, that's fine. Ok. The source is uh aurora serverless v two and the destination is uh red shift.

So I'll be running a couple of queries where it's gonna uh get the count on the aurora and the red shift. And uh this is the dashboard which is built on top of the red shift. So it has few filters which is basically uh ha has the test user count. And it also has some of the information on the bar chart which indicates various uh migration statuses. And it also has a table which gives you the clear account of what was migrated.

So let me go ahead and update the aurora table. So I'll be mocking the migration process where I'm going to select a few records, which is there on the table as you can see the migration status for all these records are not migrated. So I'll be mocking the migration where the migration, let's say like 10 records are updated to migrated and let me do the update and now 10 more records, let's say it gets failed and it gets marked as skip for whatever reason.

So once once these records are updated, let me just go back and select it. Ok. Now the update is complete. Let's go back to the dashboard which is built on redshift. So we anticipate the number of records to increase by 10 for the users who, which is migrated and also to reduce the users to be migrated by 10 and increase the skipped records by 10.

So let me just go ahead and uh refresh the dashboard. Wow, this is amazing. The counts got updated, the updates uh which i did was on aurora mysql, but it got reflected on the red shift dashboard as a business unit owner. This provides me a lot of confidence in the data. I need not go back to an engineer and ask him whether this data is accurate or not.

And as an engineer, it is very invaluable to me i can just directly go and, and dig in and find out what, what caused the failure and i can go ahead and rectify them. And this is also a win win situation, the source and the destination, both a re serverless and they can scale efficiently. And to top it off, the integration between them is near real time. It has saved a lot of uh engineering effort and not only that, it has also saved a lot of cost in maintaining the pipeline.

So this concludes the demo. Thank you aws for providing this amazing feature to empower red shift and thereby helping into it to accelerate the migration. I will hand it over to joey.

Awesome. Thanks rakesh and smith for taking us along on your zero t journey.

Hi everyone

Good afternoon. My name is Joy Gerwal and I'm a senior product manager with Redshift.

Now I wanted to take a few moments to talk about why Zero ETL is different than the other solutions, how it works and how you can use it.

Zero ETL is a means to an end and that end is getting all this data into Redshift so that you can generate more value out of it.

Zero ETL integrations are easy to set up. You can use either the console, AWS CLI or the APIs to get started within a few minutes. They are easy to manage with our system handling most of the DDLs and DMLs for you and with inbuilt monitoring and observability so that you have more confidence and visibility into the system. And it gives you access to powerful analytics as soon as your data lands in Redshift.

In the upcoming slides, we’ll dive a bit deeper into each of these categories.

Now, building a secure and reliable data pipeline takes time, effort and talent. Engineers usually maintain a checklist or an Excel of some sort with a laundry list of items that they need to take care of while building a pipeline.

And once a checklist is up on the screen right now, and needless to say that these tasks also have their own subtasks that need to be taken care of.

For example, let's talk about building out a solution which would include picking out the warehouse that works for your needs, defining the data extraction process, which would also include writing scripts, configuring APIs to ensuring there are end to end correctness checks in place.

There are validations in place so that the data in the warehouse is complete. Wow. And while this is going on a process that could take days, weeks or even months, you cannot use this data for any data driven decisions in your organization.

Well, look no further. Instead of spending these many undifferentiated hours building the pipeline and going through this frustrating process, use Zero ETL integrations and get started within a few minutes and focus on generating value out of your data.

We use simple IAM based authorization policies so you can easily manage users and accounts that have access to your Zero ETL integration. And we also support cross accounts so you don't have to worry if your Aurora user is in a different account than your Redshift user.

Then by choosing the Aurora database you want to run the analytics on, the target Redshift warehouse and the method of your choice from console, AWS CLI or the API you can get the integration up and running within a few minutes.

We even support multiple Aurora databases streaming near real time data to the same Redshift warehouse so that you can run unified analytics across your applications and break data silos.

We've built a resilient system that monitors the health of your integration in the background and in most cases is able to keep it up and running.

We've also added comprehensive monitoring and observability capabilities through Redshift system tables, Redshift console and AWS CloudWatch.

Please join us later today in a T218 where Adam and I will go through the end to end demo.

Now before we dive into how we made our system easy to manage, let's talk about some of the optimizations we made to the binlog.

Now, both Redshift and Aurora have a separation of compute and storage which allows us to push down a lot of work to the storage layer without impacting the compute.

The Aurora MySQL Zero ETL integration relies on various existing components that have been optimized for performance such as the Enhanced Binlog.

Enhanced Binlog is a stand alone feature of Aurora MySQL 3 and is a performance optimization on binlog. When you turn on binlog on any MySQL cluster running anywhere, you typically experience a performance set in Aurora MySQL customers experienced up to a 50% performance degradation when they turned on binlog as measured by throughput.

But with Enhanced Binlog turned on, this is brought down to an average of 13% and as low as 5% in some of the cases, we do this by rethinking the way we process the binlog and pushing down as much work as possible down to the storage layer.

So in community binlog, which is the top picture, the transaction log files and binlog files are written serially and then a two phase long commit is required to commit this transaction. But with Enhanced Binlog, we write the transaction log files and binlog files in parallel on a per transaction basis. And we even push down the processing and ordering of binlog files down to the storage layer for speeding things up.

Such enhancements allow us to seed your data and the ongoing changes you make to it from the Aurora storage into Redshift. Unlike some of the other solutions which use the source compute leading to resource contention.

The system handles most of the DDLs and DMLs for you such as a column drop, column add, table update, table and the changes get reflected near real time in Redshift.

And while all this is going on, we monitor the health of your integration in the background and detect whether there are some tables that need to be reseeded or the integration needs to be fixed and are able to keep it up and running automatically.

We've also built an observable in the system with detailed metrics around lag and integration health down to individual tables so that you have more confidence and visibility into the working of it.

Next, I want to show you a simple demo which shows Aurora data replicating within a few seconds to my Redshift warehouse.

I have a Zero ETL integration set up between my Aurora database and Redshift and currently my warehouse has six tables. So let's just go back to the Aurora DB and add a table to it. We named it category_copy.

So let's go back to Redshift now and refresh my table list to see if this DDL made its way over. Yes, it has my table got refreshed and the contents are empty.

So let's just go back to Aurora and add some data. So yeah, we inserted the category table into this 11 rows got added.

So I'm rerunning this query here on Redshift and the table is populated here as well. So in this query, what we are trying is we are finding out the category description of musicals.

So let's go to Aurora and update this category description to something else. So yes, we've updated one row now, I'll rerun this query on Redshift. As you can see the description has been updated. Pardon the spelling mistake.

To monitor this integration, you can go to your Redshift console, click on the integration ID and see all sorts of metrics around lag tables replicated and tables failed. I don't have any tables that are failed in this integration. As you can see the lag has consistently been below 25 seconds and is currently at 18 seconds.

Once this data is in Redshift, you can leverage all the Redshift functionality including high performance complex joins, joining it with data from other data lakes or sharing it with other consumers in your organization by using Redshift data sharing or connecting it with other BI tools and building ML models on top of it.

In the next slide, I'll show you one such example.

So in this demo, I will use Redshift ML to build a machine learning model on top of my operational data that came using Zero ETL in Redshift.

We are using a dataset that represents ticket sales for different concerts, shows and events in the year 2022 for a fictional website called TicketCo.

Now using Redshift query editor v2, I transform this materialized view to extract total sales for each event that happens every day. The sales table and event table are both my Zero ETL tables that came in from my Aurora MySQL DB.

As you can see in this query here we are aggregating over the sales data to produce the total sales amount that we need. The materialized view has data from January through December for year 2022.

Now, on this aggregated data, you can perform all sorts of predictive analysis using Redshift ML. And I'm using it with FORECAST model to forecast the sales for the next month, which is January 2023 using a simple CREATE MODEL statement from my query editor.

You can see the contents of this model using a SHOW MODEL command and we can see that the model is now ready for use.

So let's just go ahead and create a table from this model and I named it as forecast_sales_output_2023_01.

And when we run the SELECT statement on this table, we can see that it has the predicted sales for the month of January for each event with different percentiles.

And this is one such example that's you can use it for all sorts of analytics, use cases and Redshift.

Now to sum everything up. Redshift allows you to analyze data from files, databases and streams and makes it available for different types of analytics for running SQL queries, for connecting to your Spark based applications or building ML models as we just saw and BI apps.

And you can even seamlessly share it with consumers across your organizations, across different regions, across different accounts and build a multi cluster sort of an architecture without going through the hassle of maintaining ETL pipelines.

We also integrate with AWS Data Exchange so you can subscribe to and access third party data without going through the hassle of licensing.

By using Zero ETL integration, we enable you to unlock this broad class of analytics use cases and use data for your competitive advantage.

Now, in addition to having the right tool for the job, you need to be able to integrate the data that is spread across your organization to unlock more value out of it. That is why we are investing in a Zero ETL future where data integration is no longer a tedious manual effort and where you can easily get your data, where you need it the most.

We started today by giving you a glimpse into our vision and we've been on the spot for a long time and it has been working well for our customers. With each step, we move closer towards the future that removes complexity and increases your productivity.

Please go check out all the other Zero ETL sessions and recordings for the different integrations we announced this year at re:Invent.

With that, you've just completed an AWS Analytics Superhero session. Please scan this QR code to learn more about them.

Thank you so much for taking out the time today and learning more about Zero ETL. We have...

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Set up a zero-ETL-based analytics architecture for your organizations

Good afternoon. Welcome to this lunchtime session where we're gonna be talking about the Aurora MySQL Zero ETL integration with Amazon Redshift.Thanks for you all for joining. If you've got your lunch, hope it's delicious. And that we can provide you some
复制链接

扫一扫