Amazon Redshift: 10 years of innovation in integration, data sharing & more-CSDN博客

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134598188

Good evening everyone. Thank you for joining the session at this late hour. Uh before we dive in, can I get a quick show of hands on how many of you already use Redshift or are familiar with Redshift? Ok. Yeah, I think as many of you might already know, Redshift was introduced as the first cloud data warehouse in 2012. Since then, we have been closely listening to our customers, anticipating the changing landscape demands, and also looking at the fleet of tens of thousands of customers to understand what is working, what is not working, and reinventing data warehousing in the past 10 years to meet the growing data and analytics demands of our customers today.

We are very excited to share with you some of these key innovations, including some of the recent Re:Invent announcements. Um and we have a demo for you that includes the zero ETL integration and Apache Spark integration. Let me introduce myself. I'm Neeraja Rentachintala. I lead the product management for Amazon Redshift. We have another speaker, Hippocrates Pande. He is a VP and Distinguished Engineer, primarily focusing on Redshift.

When we think about innovations in Redshift, there are three broad benefit areas we look into:

How do we make Redshift easy, secure and reliable? Which means that as a customer, you're really focusing on getting insights from the data rather than having to worry about infrastructure.
The second area is analyzing all your data. So we want you to be able to analyze your data across data lake, operational databases, streaming data, third party data. And do not just the traditional business intelligence style of analytics but also do machine learning and newer advanced analytics use cases.
And last but not the least is the best price performance at any scale. So, Redshift has fundamentally very different performance capabilities which Ippokratis Pandis is going to walk through in his slides. But broadly, Redshift has up to five times better price performance than alternative cloud data warehouse systems.

So in today's session, we have two sections. In the first section, we want to focus on the core data warehouse, the platform side of innovations, which includes our security, availability, performance, elasticity, serverless, autonomic core platform side of the data warehouse. And the second section, we are going to talk about how Redshift deeply integrates with the AWS ecosystem so that we can enable you to analyze all your data and open up analytics for all users in the organizations.

With that brief intro, I would like to ask Ippokratis Pandis to walk us through the first section here. Thank you.

Alright. Uh hi, uh my name is Sipo. Uh and I've been with uh uh AWS and uh and the Redshift team for the past seven plus years and uh I'm very excited with uh the progress on the service and uh and the future we have ahead of us. Uh in this section of the talk, I'm gonna talk about the core data warehousing capabilities of Redshift, the pillars of our service. Uh if you have to categorize where we are spending our energy or resources, uh you can arguably say that it is thematically uh can be broken into five thematic areas.

The first and probably, and, and the most important is security and availability. You folks give us extra bytes of data to store and manage, make sure that those are safe and are available for you to access. So Redshift offers industry leading security and access controls out of the box without any additional costs for you. For example, this year we introduced role based access controls, we introduced role level security. Uh we introduced the integration with Lake Formation. Uh we run your compute, uh we run your compute environments in your own VPC. And we provide these industry leading security capabilities when it comes to availability.

Uh with the introduction of managed storage, we have been able to uh provide you a recovery point objective RPO of zero we can ensure no data loss. And we have offered the ability to be able in case of uh availability zone uh availability zone issues. We are able to give you uh the ability to press of a button to relocate your computing environment to another availability zone and continue your business. This is something that has been in production for some years now and customers are able to deal with cases of uh issues in some availability zone.

Uh this morning, Swami uh announced the introduction of Amazon, the preview of Amazon Redshift Multi AZ feature. Uh previously, when we had this uh cross uh cross uh availability zone uh relocation. Uh we were not, we could not ensure that uh in the availability zone you are failing over with, there would be enough hardware for your computing environment to start up. Uh with the introduction of uh uh Multi AZ we have, we, we are running uh your environment in two separate availability zones in two separate computing environments. And we are offering you double the throughput in your workload. And in case of the rare cases where there is an issue with one of the availability zones, we are failing over to the other continuing your business. We are very excited with the introduction of this feature.

The second most important area where we are spending a lot of our resources is performance. The core of Redshift is it's a very fast system. In the next few slides. I'm gonna tell you how we achieve this performance. We are gonna open a little Redshift under the hood and we're gonna take a look under the hood.

The Redshift is a very simple system. Uh it consists of two layers uh the computing layer and the storage layer. If you look on the computing layer, uh a Redshift environment consists of two types of nodes. The first one is the leader node, this is the node on the, on the green color on the top. This is where you are connecting with your favorite tool to perform analytics. Uh so you have JDBC or ODBC, data API what have you. And then there are a bunch of compute nodes, the worker nodes that are responsible for executing your work.

Uh what we do at runtime is we assign uh the data is in a separate layer, the Red managed storage and it is partitioned. And at runtime, we assign uh sort of data to uh individual compute nodes. When a query comes, uh we parse it, we understand what you wanna do. We try to do some rewrites in order to make an equivalent query uh SQL query that is similar and more efficient to execute. And then we go and do a cost based uh distributed query optimization in order to generate an optimal distributed query execution plan. And we have a lot of tricks there where, where we take advantage.

For example, we take advantage of the physical design of your database, of the tables. If you are joining, for example, two tables that are very large and both tables are distributed on the join key, we're gonna end up doing a join that does not have to shuffle data over the network. It's what we call a collocated join that is very, very fast. So once we decide about this optimal distributed query execution plan, one thing that Redshift does that, it is kind of unique uh among uh data warehouses is that we take all these query fragments of, of a query and we generate C++ code, we almost handwrite C++ code for the individual query you wrote, we take this, execute this C++ code, we compile it, we take this executable, we send it down to the compute nodes and off they go to execute your request, the compute notes.

Uh take this executable, start scanning the data they are assigned to and they do the whatever your request uh wanted to do. And we do a lot of tricks there. For example, we have min/max values on the data blocks. We try to minimize the, the amount of data blocks we need to open. When we are scanning data, we store the data in a columnar fashion in encodings that are friendly for us to operate and use vectorized instructions such as AVX2 in order to operate on top of encoded data. And most of the time we are operating off of data that are in local in the memory of those compute nodes or in the local attached storage of those compute nodes. And that's it, that is Redshift in a very simple higher level view. It's a very simple system and it's a very, very efficient system.

In order to give you a little more understanding of what I mean with code generation. Let's take an example. Let's take for example, this query, it's a simple query which joins two tables, applies a filter on one of the two tables, joins those two tables and then calculates, calculates an aggregate on top of those. What Redshift ends up doing is it is generating C++ code that looks a bit like the code I have here. If you, if you spend some time, you can actually see it's very simple and you can read it, it starts reading the one, the, the one table applies a filter, calculates a hash function probes a hash table that's the aggregation. And there you go, it is very, very efficient.

Yes, we do a bunch of tricks for example, to minimize data costs. Uh we use the vectorized AVX2 vectorized version of, of this code. This is the scalar code I could actually print the, the AVX2 version of this code. It would be a little more difficult to parse. We use min/max pruning, late materialization. But you can get the idea is, is, is a piece of code that is very efficient and makes very good utilization of the underlying hardware. The E2/N2 hardware we are using and we keep on innovating in this space.

When I personally have a PhD on databases. And the things we are doing with databases in the cloud is something that we are writing new chapters in the databases book. For example, one of the nice things, very interesting things we did recently, whenever I'm giving this talk and I'm talking about code generation, somebody tells me, asks me, hey, don't you worry about the cost of generating the C++ code and compiling it as part of your query? Doesn't that overhead slow you down?

We are very worried about any latencies introduced by code compilation. And those latencies sometimes show up in workloads that have very high throughput and low latency requirements. In order to address these things, what we have been doing is very aggressive caching, in particular, just by putting a small cache on the leader node of every Redshift compute environment we are able to have because your queries tend to be very repetitive. We were able to have a cache hit rate of 99.5%. So only five out of 1000 queries you execute on average had to be compiled, but that was not enough for us.

So the next thing we did is we introduced the global compiler as a service microservice. And the way it works is the following. Every time we see a query or a fragment that we have not seen before we push it to this compiler service, it gets picked by some machines. They compile it optimally, they put it in a global cache. And the next time any Redshift cluster has to run this new query fragment, it picks it up from there. And by doing this very simple trick, we were able to improve the cache hit rate from 99.9 0.5 to 99.96. Those are the things, tricks you can do in the cloud world, you cannot do back in the traditional on premises world. And our team is very much focused on performance.

There is not a single trick we do, you know that we make Redshift fast. But for example, in this, in this graph, I'm plotting the throughput of Redshift running the cloud data warehouse benchmark, modeled after TPC-DS in the, in the course of a year, we did all sorts of tricks we looked on, we looked on the fleet, we looked on, what are you doing? Where are we spending our time and we try to fix and improve the code? We have and you can see in this graph that we were able, for example, in that year to improve the throughput performance of this benchmark by over 3.5x and the results speak by themselves.

So this morning, Swami talked about Redshift's price performance advantage, in particular, we, we see when we run workloads out of the box workloads like TPC-DS which is a workload where you the user are just defining some tables, load some data and start running your workload. We can see that we are up to five times better against other popular cloud data warehouses.

What is even more interesting is that over the past few years, we have noticed a tendency for from our customers to go into more real time, higher throughput, lower latency workloads. So this type of workload, which traditionally was not the target workload for data warehouses. What we see is that because Redshift has so efficient usage of the resources it is using because the code paths are very, very short and very efficient, Redshift achieves much higher query throughput for short queries.

So in this plot, I'm having here, we can see that Redshift achieved up to 57 times better throughput price performance against popular cloud data warehouses. So we are very focused on utilizing very efficiently the compute environment you are using from our service. We did that and the next requirement came from our customers was we really like the performance of Redshift, but we would like to put more data into it and we would like to have more concurrent users using this system. And that's what we did.

The most important architectural change we did in Redshift over the, the past 10 years that we have been in production was the introduction of Redshift managed storage, enabled by high bandwidth networking of EC2 Nitro instances. We were able to disaggregate storage and compute and move the storage in a separate layer outside of any Redshift computing environment. And by doing so this big architectural change, we have been able to move very fast and offer value adding services to you. And by doing so, what we see is that we can actually scale the storage size very well.

For example, in this plot, we, we got, we are running an experiment where we are running the cloud data warehouse benchmark modeled after TPC-DS. And what we did is we ran this benchmark using from using a 10 terabyte data set all the way up to one petabyte data set. And what we did is we increased the size of the computing environment proportionally. And what you can see is that the time it took to run this workload was pretty much the same. It, we increased the volume of the data by two orders of magnitude without changing the run time of the workload, which is very, very nice properties.

It's this predictability in the performance of the system is a very good property for you because you can predict your cost and you know, predictable cost and forecast your needs. So that kind of. So by doing the introduction of Redshift managed storage, we have been able, we were able to address the needs for storage elasticity. When it comes to compute elasticity, we did two main changes.

The first thing we did was the introduction of elastic resize. So any Redshift computing environment can grow up to four times bigger and up to four times lower at a range of 16x which you can use that in order to fine tune the latency of individual queries you have. And once you find the optimal size for your computing environment, then there will still be some cases where you will gonna have concurrent users submitting requests at exactly the same time.

Typical example, because your employees go to the office at nine o'clock in the morning, they start running the reports to see what's going on. In these cases. There is physics at some point, we run out of resources, we cannot admit more concurrent requests into the system. In order to address this need, we introduced an auto scaling capability in Redshift, we call it Redshift Concurrency Scaling.

And the way it works is is an autocad feature. So um the requests are going submitted in the main red shift cluster. And if there is some queuing it moves up another uh equi uh similar size uh red uh red cluster and starts running requests there. When the burst of activity uh goes down, the system contracts, the additional clusters go back to the main pool and the system starts operating in the original size.

And since last month we introduced, we announced that uh now in comparison, scaling clusters where you can run, not only we run, not only read only queries, but also we can uh run rights such as inserts and deletes and copies. So using elastic precise, you can size the size of your individual environment to meet the individual uh lates of your queries. And then we using auto scaling, we can auto scale to and increase the throughput of the system. And by doing that customers are able to get substantially higher throughput without having to do a single change in your in the application.

In this graph and plotting a a controlled experiment we did in the lab where what we did is we had an increasing number of concurrent users submitting queries from the uh from the cloud data warehouse benchmark with zero think time. It is a stress benchmark that emulates a very large number of concurrent users. When we uh and unloading the throughput, one could get from a redshift environment. Uh you can see at the beginning of the year in 2019. In january, the maximum throughput was around 200 queries per hour from five concurrent users. When we were done with this feature. At the end of the year, you were able to get over 12,000 queries per hour from 210 concurrent users. I think at that point, we stopped putting more hardware to that 210 concurrent users with zero think time emulates thousands of concurrent users in a realistic scenario. And all this happened by just enabling concur scaling uh in your environment. This is a 60 x improvement in throughput without having to do a single change in your application.

Again, those are the great things you can do in the cloud you were not able to do in the traditional on premises work. And by doing these changes, we addressed the needs for elastic compute elasticity. Then our, then our customers came back and said we really like the storage and the com computer elasticity. But we also have isolation needs. We would like to spin up individual uh environments for our business units too so that they are able to run their workloads in isolation to be able to charge back. And and what have you in order to address this demand, we introduce relative data setting.

The idea is very simple. You have one relative cluster that it is producing some data, has ownership of some data and then give, gives access to other individ separate uh uh consumer environments, either provision clusters or server less end points to consume this data in a transaction, consistent fashion. And this thing can be combined obviously with auto scaling. So you can have these environments each of them isolated and autos scale. And you can do that with uh in with clusters in the same account uh across different accounts or even across different regions. So you can have a globally, you can have uh uh an uh you know, transaction and consistent data setting of that data across the globe in a uh in a very efficient fashion. This is things that make the life of our our customers uh easier.

So by doing all these things by doing elastic, by doing uh introducing a lot of scaling, by introducing data sharing, we have been able to address the needs in compute uh in compute elasticity and isolation. Then the other, the customers came back and said, we really like what we see uh in, in the, the in performance and security. We would like you to make also a redshift easier to use. And we don't wanna have database administrators that need to be managing these systems. And this is where we're spending a lot of our energy in red system.

For example, since you are using, using some compute time, what we do is we have a very active system which monitors the usage on in in your environment, sees the pattern, sees what you are doing in the workload and tries to make use machine learning in order to make decisions and execute in the background, mundane maintenance operations in order to improve the health performance of the system.

For example, we may see all these tables have not, do not have statistics. Let me go and analyze them in order to make the queries on these tables more efficient or there some table may require a vacuum operator or we may even go decide to change the physical design of some tables. If we think that this is the right thing to do. In order to make these decisions, there is a lot of science behind it. For example, there is this in this, there is a paper we wrote back in 2020 where we were uh discussing some of the algorithms we we use in order to decide the physical design, the distribution key of the tables in a workload and the algorithms behind that. So we have, we have these things running in production ere for several years now. And as we gain more confidence, we become even more aggressive in what we can do.

For example, a couple of years ago, we introduced a new type of table in red shift which we call the auto table where you, the customer tells us. Tell us, hey, s red shift service, please manage the these tables. For me decide about the physical design of these tables. I don't wanna think about it and that's what we do. We monitor the work, the usage, the access pattern, they use that and we decide about the physical design of your tables.

In this example, i'm plotting the performer how much time it takes to write, run the cloud data. Our house benchmark de derived from t pc a s in a 30 terabytes scale. Whether the c the user, what he did is he decided to just define some tables load 30 terabytes of data and start running this this meds work. What you can see is that within some hours, we were able to improve the performance of this workload by almost two x where initially it was taking around 100 minutes to run the workload. At the end of of this process, it was taking just 60 without you having to do anything and we are becoming even more confident and more aggressive.

For example, um last year we introduced automatic automated materialized use. We see your access pattern and we may even decide to go and create materialized views because we predict that you are gonna run these queries and you are, you are gonna find this materialized view useful when you don't start stop using it, we decide to drop them. This is the active monitoring we do in red.

Another area where we are using a machine learning is also in, in order to improve the utilization of the resources, you are essentially uh uh paying for uh uh to us. So for example, for many years now, we have uh a a classifier which allows us to decide which jobs that are being submitted in the system are small and which ones are big. And if you remember from queuing theory classes back in school, the best way to increase the throughput of a system if you have large jobs and small jobs is to take the small jobs, admit them immediately without any admission control, get them out of the system in and out as fast as you can and then use admission control for the large jobs. This is something that runs in production ere for several years.

Now, the next thing we did, we have been improving our, our classifiers and we are able not only to predict if a job is a small uh big. But now we can predict also approximately how much cpu how much memory, how much io it will need. And we take this input in order to piggyback in order to schedule re requests in order to maximize the utilization of the resources you are paying for.

So there is uh with this section i go, i convince you that we are putting a lot of smartness in the system, a lot of autonomic in order to make the uh life of database administrators much easier or uh individuals that do not want even know about database administrator and administration. The culmination of our efforts has been the introduction of serverless, which was announced uh these days last year and it was g eight this summer reds if serverless takes all this smartness i described you earlier and it offers and even simpler to use red shift interface where we make decisions about the compute that you are having to pick.

For example, in the provision world, you have to decide about the number of nodes with the instant stuff you are gonna use. Whereas in red shift sur as we take also this decision to our side, we take decisions about when to auto scale. Uh and you are paying only for the seconds you are using this environment. And at the same time, you get all the full sequel functionality that uh you usually get with red shift and keeps on improving over as, as we re roll out new software in the fleet.

So that's it. Uh i am in the past uh um 30 minutes, i kind of gave you a, a description of uh of the pillars of red shift. And we are, we feel that we have a very, very strong uh industry, uh uh have our house with industry leading uh uh performance. And now uh mira, i will talk about how we take this core very, very good data warehousing uh technology and how we integrate it with uh the rest of the aws uh ecosystem.

Thank you. So as i mentioned, we do spend a significant amount of time on the platform innovations and we spend spend equal amount of kind of resources and effort on integrating red shift deeply with the rest of the aws ecosystem. So we understand that customers come to aws to solve their end to end data management challenge. So we want to provide a complete solution. This is where the deeper integrations with the aws ecosystem is critical.

So just to kind of give some perspective on how customers are using. So i talked about red shift data sharing capabilities and we see that customers are able to form flexible data mesh architectures using the red shift data sharing. So in this data mesh can contain provision clusters, it can contain serverless redshift serverless end points. In a dream event. This week, we have announced integration of data sharing with aws lake formation.

So the idea is with red shift data sharing, you can share from one data warehouse to another, you do direct sharing. But with lake formation integration as the producer of the data, you can share it with the lake formation and all the consumers would come in, discover the data from the lake formation and consume it. And the benefit of doing this is now you can centrally set up manage and govern data access across all your data sharing consumers in a centralized fashion. So you don't have to worry about the point to point sharing. So it's a complimentary kind of a feature compared to the direct data sharing capability.

And another capability we have introduced in redshift is third party collaboration. So redshift integrates with aws data exchange with which you can, you can subscribe, you can go to data exchange, you can subscribe to redshift data products, you can discover them, subscribe and consume the data directly from redshift data warehouse and you can join the third party data with your first party data in the data warehouse.

And similarly, you can take the data in red shift and you can make it available as a service to your customers, your partners. So essentially think of it like extending the data mesh architecture also to your ecosystem and data exchange integration is built on data sharing. So this is all live data sharing. There is no etl no data copies that are made either on the provider side or on the subscriber side, coming back to the analytics on all data analytics for all users. So these are our kind of core themes on why we integrate with the rest of the aws ecosystem on the analytics on all data.

The goal for red shift is to allow you to do analytics across the data lake operational databases and streaming data. So we already have capabilities to directly query the data in s3 open formats in s3 data lake. And we also introduced a couple of years back capability to federated query to directly query the operational databases such as amazon aurora rds, mysql post sql databases at re invent. This week, we have introduced new capabilities to simplify data injection into redshift even further.

The first capability here is auto copy from amazon s3 where you can continuously ingest data into redshift from s3 without any pipelines. The second one is we are we are making generally available the real time streaming ingestion capability with which you can ingest data into redshift from streaming sources like amazon kis or amazon ms k manage service for apache kafka. And the third integration we are also excited about is the zero etl integration with amazon aurora. So this is from operational databases, transactional databases. We are able to easily ingest data into redshift.

And on the consumption side, redshift always had ability to query the data from b and analytics tools through jd bc od bc, we also have a very simple data api which is an application development api that the developers can use to build applications in python or js whatever the language of choice very easily. And we introduced the couple of years back, redshift machine learning capabilities in database machine learning by integrating with stage maker and the re invent this week, we announced apache spark integration.

So with this, you can simplify and speed up spark applications running on the data warehouse data. So why, why did we spend so much time to simplify the data ingestion? Because we know that data ingestion actually can be pretty hard. We see that customers set up complex pipelines to bring data from all these data sources into red shift and they can be expensive cumbersome and also sometimes they are error prone leading to delays in getting access to the data.

So if your goal is to get the data into red shift as quickly as possible pipelines come in the way. So this is the reason to kind of have a series of features to simplify data injection. The first feature as i mentioned is auto copy from amazon s3 with which you can have a very simple automated file ingestion you don't have to worry about. Ok. Did i ingest this file or not like manually track what is happening on the s3 and what new files are being ingested, et cetera. So redshift takes care of it for you and auto copy is a very simple feature. You just specify what source location s3 location. You want to ingest the data from, you specify the target, redshift table. And redshift already has something called copy command to load data into redshift.

Now we have a new attribute to it called job. So convert your copy statement into your job, give it a name and set it as an auto job, which means that you don't have to track it. Redshift detects new data as it lands in s3 and automatically and continuously ingest the data into redshift.

The other capability we have uh made generally available at re is streaming ine with this, you can ingest high volumes of streaming data into red shift and make it available for analytics within seconds. This is a very, very straightforward integration using sql again, no data pipelines, no manual management to deal with data ingestion, you can bring the data, you can give it a schema and load it or you can just load it using our semi structured data types such as super and war byte. Do some ad hoc queries, figure out what columns you want to really materialize and then create a table out of it. So you can bring it for ad hoc queries.

So with streaming injection, it's hard to see here. But broadly, there are three steps. One is you already have a kinoy or ms k. So you have some streams or kafka topics that are publishing data on the redshift side. It's a very simple two step process. You just create an external schema pointing to kino or kafka and you create a materialized view, pointing to that stream. And the moment you create this materialized view within seconds, the data is in red shift, the materialized view is populated with data. And as and when new data lands on the screen materialized view is automatically refreshed with new data.

So as as you can see here, you are literally working with sql right? There is no data pipelines, there is no complex set up that you need to do. And the third step here shows you start querying the data, right? And as you as you refresh it, if there is new data on the stream, you will start to see the latest results.

The third capability is uh inge from transactional databases. So this is a kind of a typical transaction analytics architecture at scale that many of our customers use. So if they want to bring data from amazon aurora into red shift, typically they use few or more like one or one or more services in the pipeline on common. Some of the common services are database management service glue and emr and they build these pipelines to load data into red shift

There's nothing wrong with the pipelines. But the point is if you're trying to get to the data faster, the pipelines will come in your way right. So if you want to bring the data as soon as possible for analytics, so which is what we are trying to simplify with the zero ETL integration.

So with the zero ETL integration that pipeline is eliminated, essentially, you're able to bring data from Amazon Aurora, we are starting with MySQL support data automatically lands in Redshift within seconds after writing the data to Aurora. And this Aurora integration is designed for simplicity and performance.

So as a customer, all you need to do is specify which Aurora database I want to ingest data from and what is my target Redshift cluster. And you can ingest data from multiple Aurora databases into the same Redshift. And once you specify this integration automatically, the data seeding will trigger and data will start showing up in Redshift.

And as new data is added, both metadata changes like new tables are added or new data is added to existing tables. Redshift reflects the new data within seconds again after writing to Aurora and you can run these analytics along. You can you can start to do real time analytics um along with the data ingestion that is happening, you can monitor progress and this integration is also resilient integration.

So if there is any failure, we automatically recover and start doing the rethink on the consumption side. As I mentioned, one of the capabilities we added is Redshift machine learning. We integrate with SageMaker behind the scenes. But the idea is you just use SQL to create train and deploy models. You are you, you you don't need to be familiar with any other skill set.

And one of the nice features with Redshift machine learning is you can bring your own model. You might have created models outside Redshift. For example, you might be having some SageMaker model so you can import them and run in database inference. And without this previously, customers used to have big data sets in Redshift. But the moment they want to apply machine learning models, they have to take the data out of Redshift, run a batch inference on that Redshift data and then bring it back into the data warehouse, which again is time consuming error prone.

So with the in database reference, you don't need to take the data out and you can run your inference using simple user defined functions that are automatically generated. This is our latest capability which is Redshift integration with Spark. So the idea is how do you simplify and speed up Spark applications running on Redshift data.

So with this integration, you will be able to get started very easily without having to worry about again, any manual setup, you can use Amazon EMR or Glue and start building Spark applications using the native Spark programming model like data frames or Spark SQL and start working with the data warehouse data.

And we also improve the ease of use security. So you will be able to nicely fit it within the overall ecosystem that you have. We also improve the performance of this connectivity. So we are we are able to achieve in our benchmarks. We see that Amazon EMR applications accessing Redshift data are speeded up by almost 10x performance. So there is a huge improvement in terms of the performance of running Spark on the data warehouse data.

So with this, I want to show a very brief demo, let me just introduce. So the two capabilities we are going to show in this demo are the zero ETL integration from Amazon Aurora to Redshift. And from there we are going to use in addition to SQL also Spark. So run a couple of Spark commands Spark jobs and showcase the integration. And we pick very simple scenario here. You have a demo database on Aurora, you have Redshift on which the demo database will be replicated. We try to add a table, add some data on Aurora and we see that it is reflecting in Redshift and then we join it with the data in the data warehouse, build a materialized view and access it from the Spark. So get started.

You can start with AWS Management Console. This is the Relational Database Management console, you can click on the databases. So this is the RDS console and we have a Aurora cluster called demo sales so that will be our source database. And on the Redshift side, we want to have a Redshift serverless endpoint which is going to be our target, it's called demo cluster. So that's what we will use as the target for this particular integration.

This is Redshift query editor. We are trying to connect to the serverless Redshift and there are a couple of databases here dev and sample data dev, these are existing default databases. So, so what we want to see is there are some tables in Aurora that we want to set up an integration and bring it into Redshift. So we are connecting here to the MySQL CLI for for Amazon Aurora. The integration right now is with MySQL. So we are going to see what are the date, what are some of the tables. So we have certain tables here around seven tables here that we want to bring it into Redshift with the zero ETL integration.

So you we are going to quickly check the data on on the the next step is you want to create a zero ETL integration. So in the RDS console, you have a new left side navigation called zero ETL integrations. You start by creating one, you can give it a reasonable name. Some demo integration here. You choose your source Aurora cluster, which is the demo sales and then you choose your target Redshift, which is the Redshift serverless endpoint that, that we saw before and that's it, right?

So you go ahead and create this integration and this is simply metadata. We haven't started copying the data yet replicating the data yet. So the integration gets created, you can see the integration properties such as tags and again, other information. So that's the first step creating a basic integration object between Aurora and Redshift.

Now, I'm on Redshift. So this is a Redshift query editor tool which comes out of the box with Redshift. So we we want to see are the database is still already showing up on Redshift. They don't yet. There is one more step you need to do on Redshift side which is you need to create a database from the integration that we just created.

So on the Redshift side, you can get the integration id from some system table or you can also get it from the console. So we are trying to get the integration id and then start to create a database from this integration. So really two steps create an integration and on the Redshift side have an explicit accept step. Yes, I want to take this replication because you start copying the data.

So there is an explicit create database from integration step on the Redshift side and that's it. So that, that finishes our setup. Once the database is created, then if you come back and start querying on Redshift side within a few seconds, you start to see a new SQL database. So here the mysql_db_1, this is the replicated database name that we just created and you can, you can do all the standard metadata queries.

Now we are trying to see ok, how many tables are there within this particular database here? So it has all the seven objects that have been created. So a couple of scenarios from a demo perspective, we want to illustrate one is if you create a new table in Aurora, does it show up in Redshift? So we just created a simple table and you can come to Redshift again within seconds, you will be able to see the data, see that particular table created on the Redshift side.

So we we are able to synchronize the schema changes. Similarly, if you are able to, if you want to add some data to the table that you just created, we are adding a john doe row row in the customer table and then once you go to Redshift within seconds, the new data starts showing up. So we we are illustrating basically the idea of how this works and the application here.

So now once the data is in Redshift, you can use all the power of SQL, right. You can, you can do machine learning, you can do all the SQL analytics that come with Redshift. One of the capabilities of Redshift is as I mentioned before querying across data sources. So we can query data lake, we can query operation databases.

So in this particular scenario, what we are trying to do is on the data warehouse side. Also, we have a customer table. It happened to have the same schema just for demo purposes. So what we want to do is the replicated table that came from Aurora and we have a local data warehouse table. It's called _dw just to differentiate the demo customer versus demo customer_dw.

What we are trying to do is we want to combine that data and create a materialized view and these materialized views can be arbitrarily complex, right. Again, we are illustrating a simple scenario where we are going to union the data between what we got as a replicated table and the local table. But you can have transformations, you can add other data sources into the mix. You, you can create as much complexity as you want with respect to the analytics that you want to do.

So here we are creating a materialized view as I mentioned, which is combining the data that we got from Aurora and join it and union it with the data that is already there on the Redshift side. So this is these are materialized views. So the data is in the materialized view. Now you can use this object for downstream analytics, right. You can access it from BI tools, you can do additional type of analytics on top of this data.

So for the purposes of the demo, one of the other scenarios that we want to demonstrate is can I access this data from Spark? This is our other announcement around Redshift integration with Apache Spark. So what we want to do is we want to take this materialized view and try to access it from a Spark application.

So in this example, we have a EMR cluster that is pre created and it is created using EMR 6.9. So which is when the deeper integration with Redshift, the EMR 6.9 and onwards is where we have integration with Redshift. So the EMR cluster really represents the EMR cluster and it could be any EMR provision and EMR serverless, it could even be Glue, right. So in anywhere where we have the integration.

So here we have the EMR cluster as I mentioned, it is created on the latest version and once you have it, we are trying to connect to it from the EMR notebook interface and we connect to that cluster. And the first step to do here is the loading the jars, the two jars that are of interest, one is redshift-jdbc and another one is redshift spark connectivity.

So both of these jars are available as part of EMR and Glue. So we just you can just point to the local locations here as part of the loading configuration and the second step is once you, once you have it, then you will be able to connect to Redshift from EMR serverless and start to work with the Redshift data.

So here we are establishing a JDBC connectivity to a serverless workgroup that we created before, right? Which is where we replicated data into. So we are just connecting to it using JDBC. And we are also having a temporary directory which can be used for staging any intermediate data. And we have some roles, we are telling EMR go and use these roles to connect to Redshift and just for demonstration purpose, we also have something called a query group.

So what we want to show you is what is the last query that executed from from this Spark job on top of Redshift to do that, we have something called query group where we have an identifier that identifies all the queries that are coming in from this particular notebook. So that's it. And then once you have it, then you can do all the things with Spark.

Here, we are creating a data frame that that is connecting to the materialized view that we created in the previous step. And it is doing a Spark SQL command to to group the data and sort the data and select specific columns. This is a pretty simple example, but you get the idea the point is everything is packaged for you. You just come into EMR or Glue environment and you start working with it.

And one of the things that we want to kind of show is are these queries actually pushing down. So there there is advanced push down capabilities that we have added as part of this integration to speed up performance, right? We don't want to bring all the data from Redshift into Spark environment, billions of records and process it there. We want to push down as much as processing possible into Redshift.

So here this is showing the query that is actually executed on the Redshift. And as you can see ORDER BY GROUP BY all these clauses are pushed down. And we also do intermediate location as Parquet for high performance as part of the Spark job processing. So as you can see here, this just gives an idea but the point is you are able to do SQL advanced analytics and similarly you can do machine learning on the data.

So within seconds of data generating from transactions, you are able to bring the data into Redshift. You are able to do SQL analytics, you can use Spark on top of it and you can leverage machine learning on top of it. So how do you and you can share the data this real time data across your organization with data sharing? So having these real time kind of flows throughout the organizations, makes the organization more agile and more innovative so our goal is to enable that kind of end to end pipelines.

So this is our summary, we have tens of thousands of customers processing exabytes of data on Redshift. Redshift offers industry leading security out of the box at no additional cost. Our continuous focus, the top focus for us is performance and scalability. And we elastically support tens of petabytes of data and thousands of users with little or no work from you at all.

And we have autonomic in place that continuously learns and improves the system and we shield all this experience into serverless. We leverage serverless leverages all this innovation and further automates for you. And Redshift also deeply integrates with the broader AWS ecosystem so that we can support your end to end data management needs, not just data warehousing and finally Redshift customers.

We see that moving to data mesh architectures, that is the much scalable way to make analytics broadly available in organizations. So just to wrap up, you can check out Redshift on AWS site. If you're an existing customer, new customer or thinking of migration, there is a lot of information available for you to get started with Redshift. That's it. Thank you.