Simplifying modern data pipelines with zero-ETL architectures on AWS

最新推荐文章于 2024-08-23 15:15:27 发布

李白的朋友王维

最新推荐文章于 2024-08-23 15:15:27 发布

阅读量162

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134811381

版权

Well, hello everyone and welcome to our session today. I know we are literally eating into lunch time. I'm sure a lot of you have skipped lunch for the session or already had lunch or actually having lunch. I see a few people having your lunch right now, which is perfect. Rest assured you're gonna learn a lot from this session today right now.

Before I start introductions here, I just want to get a few questions in to get the pulse of the audience. How many of you guys in this room are data engineers or data architects or have similar roles? Perfect. I see a good number of hands. Anyone, a recovering data engineer. Ok. That's me. Fine. So you're in the right session. I can guarantee you that.

Has anyone faced issues with monitoring data pipelines or let's say you have a complex ETL job, you don't know where it's failing, you know what the problem is and you're struggling with that and you have issues with that. It seems like you have a good number as well. Ok, perfect. You're at the right session because we're going to teach you how to come up with data pipelines and how at AWS we're able to simplify that.

With that being said, I'm Anthony Prasad Raj, Senior Partner at AWS and I'm joined by my colleague.

Hello, I am very excited to be here and I flew from Seattle and we also have our partner San Jain who is also an AWS Hero as well.

Hello, everyone. This is my third re:Invent. I'm really excited for this week.

Awesome. Now that we're all here, we have a packed agenda, but we have 60 minutes, we'll try to do our best, right? So the first important thing is we're going to help you understand what data pipelines are? San is going to talk about some of the challenges that he's noticed with customers on the field. I'm gonna help you understand how can actually simplify those data pipelines, right? Using some of the connectors that we have from our prebuilt services.

There is gonna go in Sanchez again, is going to talk to you about customer success stories. He has some really good stories that he can share details about. I'm going to walk you through a demo. I know a lot of you guys in this room are soccer fans. So stay tuned because we have a soccer analysis demo for you.

And last but not the least a ton of resources with that being said, I'll hand it over sent it.

Thanks Anthony for setting up the stage and thank you so much for all the lovely audience as Anthony mentioned about it, I'll talk about why exactly you need data pipelines, like why it is important for you your business to understand the data pipeline and how it's going to help you to get that competitive edges in your business and understand your customer better. So let's with that. Let's start our session for today.

So we live in a world which is full of data, different varieties of data, structured, data, unstructured, social media feeds real time data, different varieties of data, right? And what it makes sense if you can't join that data or consolidate the data to get some business outcome of it. And that's where you need the real importance of why you need the data integration or ETL pipelines. What we generally refer them as let's zoom in a bit and understand what that process looks like in terms of the ETL what are the different things you do that help you to get that data silos break and consolidate the data.

So if you see on the left, there are different varieties of sources, a couple of them to name a few, you will see some transactional databases, some real time data, some log feeds right? On the right extreme, you will see your business stakeholders, different kinds of analysts who are really eager to get that insights. And at the center, you will see your entire data integration processes and different layers.

If I start from left, you will see an injection process. That's where exactly you see your application code running. And this application code varies from different sources. So you can extract the data from it and store it into some of the landing zone. This landing zone is nothing but your storage service. Any of the storage service, what you are using from this particular storage service.

Then you will see a center box which is called as a data processing. This processing will do two major jobs. One is to cleanse your data to make sure your data is in right shape and form. The second, it's going to make sure you apply the business rules. So you make the data actionable and you can serve it as an outcome to your business stakeholders.

Then the followed layer is the consumption via which you provide that available data to your business stakeholders which can consume via a PS business intelligence, machine learning and so forth.

This entire framework is packaged with security automation, log monitoring. So your business, your team is well versed what's happening and where exactly the system is going around.

This looks pretty simple, right? Like it's so simple, I'm explaining. So you might be feeling what's complicated in this. I mean, this is very simple to implement. But if that's the case, let me stop you there and let me talk you to the next slide. It's not as that simple as it looks like.

So the first thing is as I mentioned in that you have different application codes to ingest the data from these different sources. So that's where you see a complex application code you need to write to get the data from these different sources. Like maybe one for the structured one for unstructured, maybe for real time data. And that also you need to make it scale, not just write the code deploy and make it scale. So it's too complex.

That complexity also leads your pipeline to fail many a times which leads to inconsistent data or stale insights. I should not talk about data management is not an easy job, right? You need so many eyes and hands on keyboards to make your pipeline work and do it. And with that, you also need skilled labor resources to handle the same. You need to train them, you need to make sure they are available best in the market to help you to run this exponentially. Keep increasing as your business scales functionally financially. You need to make sure that your pipeline, keep providing the insights to your business stakeholders so they can keep making informed decisions, right?

This is what it feels like the expression and the sentiment of your data engineering team. Once you start realizing it how complex it is, I mean many of the people would already making this expression in some other form. But this is how actually it looks like when you really getting why it is so complicated.

Can we simplify this? Can you have me something? You can, I have, I have some survivor here and that's where we need something which is very simple, a simplified solution which can help you to solve that.

Let me pass on and let me invite to Anthony. And let's see how exactly AWS can help with different services and offering to simplify. Maybe a life of an engineer, right? Antony.

Thank you, Sanit. Uh although my face would have been a lot worse than what you saw in that pic over there because it is frustrating to monitor and have hands on eyes on all those detail jobs, right? Um let's hope you understand how a database we are actually simplifying some of these data pipelines.

The first and foremost thing is as sunset mentioned, there are multiple data sources. We alas have services that are able to actually connect to all those sources, data sources. So you don't have to build a separate ETL layer to actually extract data from those sources. You can actually connect to those sources from the data analytics services at AWS.

Now let's say you have further value added transformations, you can use something like AWS Glue that will actually help you run those advanced transformations. In fact, we have a visual interface which allows you to not actually code a lot to get that done as well.

Now the key thing to understand is when you talk about AWS Glue in this aspect, right? It is important to understand that you are not only connecting to those data sources, but you are also able to catalog the data, right. So what do you mean by that? The metadata of those tables is already stored and you can leverage those in multiple other services as well.

Now, let's say AWS has always built a portfolio of data services that provides hundreds of data connectors, right? And it's really easy to set up from your AWS console. For example, if you have a streaming analytical use case, you can use something like Amazon Kinesis Data Firehose to stream the data directly into your target from your source.

For example, if you have a situation where you're a data scientist and you want to look at data in your S3 bucket, you want to run ad hoc analysis to understand if the data is actually valid or not. You can use Amazon Athena to run ad hoc query analysis over there.

You can use Amazon SageMaker to use data wrangling to prepare your data, right? You can also use something like AWS Data Exchange that helps you share third party data, you can create data products and you can make those available through AWS Data Exchange as well.

Let's talk about Amazon AppFlow. What Amazon AppFlow is? It's a fully managed integration that actually helps you securely transfer data from those data sources to the AWS services because we have a direct connector to the services like Amazon S3. Let's say you want to store the data in your data, like you have S3 or if you want to store data in your data warehouse, you can use Amazon Redshift, Amazon AppFlow simplifies this whole experience. It automates data cataloging. It also prepares data and it helps in partitioning as well.

This is really a basic architecture to help you understand how you can get started. Now, AWS Glue that I mentioned before actually provides a seamless integration with multiple AWS data services such as Amazon Aurora, EMR DynamoDB and Redshift. Based on your use case, whether it's NoSQL or relational data as well, you can easily and automatically catalog all your data and make it available to run other analytical queries. Maybe you can use Amazon Athena. So your table metadata is going to be your Glue catalog, but you can use Amazon Athena to run queries using the metadata that is stored in your catalog. And you can also run that with EMR as well.

Last but not the least. I also want to stress upon the fact that when you use Glue, you can use something like Glue Studio that can help you programmatically design your ETL code itself. You have DataBrew that allows you to visualize data. It's like as simple as drag and drop. So if you are a business analyst who has less coding experience or who doesn't have coding experience. You can visually get started with your ETL pipeline.

With that being said, I'll have it over to Sudhir to walk you through how you get started with the future. But thank you at AWS. We are innovating around the zero ETL futures for quite some time. And basically when I say the zero ETL, it means when you are running the ETL where it is not adding the value or maybe you are just using the ETL just to consolidate the data and move it into a data warehouse maybe to the different destination. So we want to eliminate such ETL process by integrating the AWS data services. And for that, we have added so many capabilities at AWS.

For example, we did the native data integrations with multiple AWS data services.

"We are providing the federated queries, we are providing the built in ML across the multiple data services. And even very recently, we generally available the Amazon Zero, I mean Amazon Aurora Zero ETL integrations for Amazon Redshift.

And in the next of my slides, I will be diving deep on all of these capabilities along with the use cases and the architectures.

So let's start with a very common architecture pattern where you need to design a system which can process the transactional as well as analytics workload at AWS. Customers like to use the Amazon Aurora for operational and transactional data. And when it comes to analytics, they consolidate all the data from multiple Aurora databases run the Glue ETL jobs to load the data into Amazon Redshift.

And when you run the ETLs, as the Sanjit explained, this is a cumbersome, you have to and manually manage, maintain, you have to do the monitoring to ensure there's no error. To simplify this experience. we introduce a new capability, the Zero ETL integrations between Aurora and Amazon Redshift.

So what it does basically it replicates the data automatically from Amazon Aurora to Amazon Redshift and this whole infrastructure is serverless. So what actually it does, let's say as a client, you submit a transaction into our Aurora database. And once you have the Zero ETL integration in place, it detects the change and automatically replicate the change into Amazon Redshift managed storage so that it is available immediately for your reporting.

So you can see without any ETLs, you will be able to run a system where you will be able to run the transactions workload using the Amazon Aurora. And you can have the high performance analytics on Amazon Redshift all in as a one system, you don't have to run any ETL.

Now let's take a look at the second use case where you want to design a system for streaming data analytics for the data sets, which is coming from IoT devices, sensors, maybe the streaming data generated by the website. Typically what you do. You again you at AWS especially customers use Amazon MSK or Amazon Kinesis to capture those data streams, they store it in Amazon S3. They run the ETL jobs and finally make it available for reporting maybe on through the Amazon Redshift or warehouse.

To simplify this experience. we added a new capability called Amazon Redshift streaming data injection. What it does basically it extracts KDS or MSK streams, receive the data directly from sources and with the help of automatic materialized views created in Amazon Redshift, we will I mean you will be able to ingest all your data in Redshift without any ETL.

And there is one more advantage of being you're using this feature. Basically you will be able to reduce the latency. Just think if you don't have the direct data injection, there will piece of layer the ETL and there will be some staging. You don't have to stage the data anywhere else. So the data will be available for real-time reporting immediately, maybe with the few seconds latency.

Next, next architecture pattern is a data lake architecture. I am pretty sure several of you would have implemented a platform that uses a data lake at AWS customer use Amazon S3 to create a data lake. And then for high performance analytics, they create Amazon Redshift data warehouse, either the cluster or the serverless.

And in order to load the data from data lake to Amazon Redshift, they run the copy commands, ok, Redshift copy commands through the ETL and then they are able to ingest the data from S3 to Amazon Redshift via scheduled ETL jobs. So again, you need to, as you heard, you need to schedule jobs. So it means at certain point only you will be loading the data to simplify this.

We introduce another feature called Auto Copy from Amazon S3. So basically what it does and once you configure your data warehouse for auto copy, it looks for the configured Amazon S3 path. It continuously look for the new file. As soon as it gets a new file, it will trigger the data loading into Amazon Redshift table. It will not wait for any schedule.

So definitely time to insight is going to be reduced. Plus no need to actually run a separate ETL just to load the data from S3 to Amazon Redshift, moving on to the new another pattern.

So it is very common. The customer do not ingest all of their data set inside your data warehouse. Otherwise you will need to create a bigger large clusters to run the analytics for that purpose. Customers keep some of the data just maybe in the Aurora databases or MySQL databases and maybe large chunk of the data historical data especially they will be keeping it in Amazon S3 data lake.

And only the data which is most frequently used for your reporting or to run the analytics, they will try to keep it in Redshift. But what if you want to do the overall unified analytics when the data is spread across all these systems to address that we have few capabilities.

Like the first one is Amazon Redshift Spectrum. So basically with Redshift Spectrum, you will be able to query the open file format data which is stored in Amazon S3. And when I say the open file format like the JSON data CSV data TSV data and not only this even the transactional data lake data.

So let's say if you have the Iceberg table, you have the Delta Lake table or even the Parquet table, you will be able to access and query and join right from Amazon data warehouse. You don't need to move the data data is still staying at the data lake. But when it comes to reporting, you will be able to do that.

And the second thing is when you want to unify this data, data data with the Amazon Aurora, MySQL, PostgresQL or even a MySQL databases, we have the Federated Query. So with Federated Query, you can actually unify this data. And still again, as I said, the data remains to stay at the source, but you will be able to run the insights and the benefit of this is no need to create a big cluster unnecessary.

But whenever you want to do the analytics, you can do that. And one more thing here we add as the data is stored only in the source, you are completely saving the whole data ingestion time because you are not loading the data inside your data warehouse. So you save that time as well.

Then another architecture pattern is to support the data sharing and data mesh patterns. It is very common for several customers, especially for the large enterprise customers. The single data warehouse is not going to provide their scales and it cannot meet the business requirements like the data isolation and sometimes federated governance, tight security controls and maybe sometimes price performance, maybe it is a varies for one business unit to the different business units.

So for these customers, it makes sense to have the multiple data warehouses now because you have the multiple data warehouses and there will be some reports that you want to run it across the database, multiple data warehouses, either you load all the data across all these databases or simply use Amazon Redshift data sharing.

So what Amazon Redshift data sharing does basically the producer owns the data and it defines the set of permission and grant the permission to the consumer and then consumer will get the access to the live copy right from producer, the data is still stays and owned by producer, but it is available and can be consumed by multiple consumers.

And when I say the consumer, it can be a different Redshift cluster or it can be Redshift serverless or even it can. And these clusters are serverless data warehouses can be the same AWS account on a different AWS account or maybe in the different AWS region.

And as I mentioned, right, there is might be a possibility that you need to implement either the centralized or federated governance and compliance. How you are going to achieve that you may have a multiple data and how multiple data shares and you want a nice way to manage all this.

So for that purpose, we have an integration with AWS Lake Formation and Data Zone and even with AWS Data Exchange so that all these data can be shared with the federated governance and compliance with multiple consumers.

And we are not stopping here. So there is another use case. As I said earlier, you might be having the data lake, even the data warehouses. So lot of data is stored only in the data lake and you still want to share this data lake data with the different business units or maybe your different customers in the secure much secure manner.

So for that again, we provide some APIs and CLIs so that you will be able to publish your S3 data either into the AWS Lake Formation or Data Zone. And then these services also provide the catalog which provides the searchable catalog and the consumers can come search for the right product and then they can subscribe to the data.

So the producer will be able to share the data with the consumer with much more secure manner without. And you will see right, there is no ETL it is just a few configurations that you do and you will be able to share your data with the different business units or the different clusters.

Now, the next one is the predictive analytics use case. So machine learning plays a very important role to innovate and improve the customer businesses and experience. Machine learning is important. But at the same time, it comes with some of the challenges.

So for that, you need to build a separate data pipeline for machine learning environment. You also need to hire ML expert who understand the machine learning complex machine learning algorithms, languages and everything. And not only this, you need to bring, you need to set up the pipeline.

So basically, you will be extracting the data from multiple sources, for example, Aurora S3 Neptune or maybe Redshift. There are many more, I just get a few. So you will have several data sources, you will be extracting the data, putting it in ML environment in the ML site.

And then you will be using the different SageMaker or ML services to create fine tune and optimize your machine learning model. And you will repeat this whole process until you get the well optimized machine learning model which meets the business requirements, provide the right accuracies for some of the forecasting of the predictions.

And once you find out the best model, you need to deploy the model and you need to monitor the the machine learning model so that in future, maybe the same model no longer giving the right recommendations for your use case. So you need to establish those workflows as well.

And to make the everything there is a different tools, different services that you need to stick together, which is really cumbersome and even the expensive process to simplify this the whole experience even to change the different persona AWS has pro I mean providing the built in ML integration across multiple data services.

So it means the personas like the business analyst, data analyst BI engineer, database developer, even they can take the benefit of the ML without managing and maintaining the complex ML pipeline or without learning the ML.

So basically, for example, I will take like the BI engineers or the business analyst who often use Amazon QuickSight to build a dashboard, they can enrich the dashboards with ML capabilities such as anomaly detections or maybe the forecasting without performing any heavy lifting. You can use the pre I mean define the ML functions and you can take the benefit of these use cases right into your dashboards.

And then another persona, let's say the data warehousing environment, the BI engineers, they are not really like the experts on the ML, but there are so many things they want to do, they can do it on top of even just from the Redshift data warehouse.

So let's say the product recommendation or customers and those kind of customer use cases you want to do deal with that for that you can use Redshift ML. So with Redshift ML, now you will be able to create machine learning model with simple SQL statements like the CREATE MODEL, define the dataset sort table and that's it.

And behind the scene, Redshift will go and identify the best ML algorithms parameter hyper parameter based on your data sets. And then it will create and fine tune that machine learning model and will deploy inside your data warehouse as a user defined function.

So when it comes to the prediction, you will use those user defined function in your SELECT statement and have the output of the ML. So this is how you can actually make your whole machine learning process very easy by inbuilt ML capabilities across the portfolio.

And you also notice right? Because now the ML has come to the source. It means you don't have to move the data from Aurora to ML stage environment anymore. Let's say if you want to do some ML things right from Aurora database, moving on to the next.

So in the last few slides, I talked about the multiple architectural patterns where the data store or the processing services were AWS services, maybe the Aurora Redshift and all what if you have a portion of the data available in the different cloud or maybe into the system.

And you do not want to build a complex ETL pipeline to move the data from other cloud stage. It somewhere again, stage it into maybe somewhere in in AWS. Then finally make it available into your data platform. You don't want to do that.

So for that purpose, we have Amazon Athena federated query supports. So with Amazon Athena federated query, Athena provides multiple connectors so that you can access your different variety of the data sources, like the relational databases, non relational databases, even your data lakes all in right from one place through the sync connectors and then you will be able to run federated queries from Athena so that you can join all these data from the variety of data sources and have the unified analytics.

And just think about it here again, you are not moving the data, it just make the connections to the different sources even outside to the AWS and you will be able to run the analytics. But there might be a requirement where you want to really ingest the data from multiple places and store it somewhere, maybe for the caching purpose or maybe the use case, you are running multiple SQLs for certain things, the process that you don't want to process again and again.

So in that case, you can think of using the same capability to query, join, store it in data warehouse itself. So no need to maintain the complex pipeline. Just one flow to extract the data ingest it in maybe in the data lake and use the different set of AWS services to query it"

Now let's take a use case of a retail company which operates in 12 countries. Ok. And they provide the mobile app as well as the website to so that their customers can do the online shopping. Ok. And at the same time, the company wants to give or innovate the I mean, you know, we want to improve the customer experience, for example, based on the clicks, how they access into the website and how they have done the previous historical shopping. We want to give the recommendations to the product. They want to have a system which can give the forecasting capabilities and so on. They want to design that. So for that, maybe something like this, uh you will be designing an architecture. This is a very standard, typical architecture for a retail customer where you have a data like the order data, customer, data supplier, data product, all this you are storing it in Amazon Aurora databases and then the clicks, right? As I said, you have the website, you have the app and you are doing the several clicks. So based on those clicks, you want to capture even that data set. So that data, you will stream it through the Kinesis data stream and finally store it in Amazon S3 data lake and the application is not generating two types of data. Maybe there will be so many different data files and maybe some of the processing layer is generating this on data. Maybe the CS V data and you all, you want to ingest it at one place. So let's say you are capturing those things in Amazon S3.

Now here is the fun part. As I said, the I mean, the company wants to have a customer 360 degree view of all the data so that they can have they know who are all their customers, the product inventory, all these things they want to run it. Plus as I said, they want to innovate also. So they will also use a SageMaker at the end to have those kind of generation. So this will be like the typical architecture to design the specific use cases I just talked about.

Now here you see that the series of Glue ETL AWS Glue multiple places, these are nothing but the ETL processing you are making it so that you can consolidate all these data sets into Amazon. Now, I will explain how you can simplify this experience and this architecture.

So first as you see, right, the block, I mean the three boxes, AWS Glue S3 and Glue under the yellow box. This piece you are running it so that you can consolidate multiple Aurora databases into Amazon Redshift. And this you can actually replace with the feature I explained earlier, Amazon Aurora ODBC integration with Redshift. Now, no ETLs more staging and this data is available in the near real time in your data warehouse.

Now, the next component, as I said, you have the KC stream data which actually you are loading it in Amazon Redshift via the a Glue ETL. You can also eliminate that process by having the streaming data injection into Redshift materialized views. Again in real time, you are able to load the data without any ETL next.

So you have the data lake, you have data warehouse and you want to load this data even in Redshift with some massaging. So where you are using the Glue, you can make this process simple by using the Auto Copy. Because now once you configure the Auto Copy, this will be loading this data automatically into Redshift.

The next part. As I said, not all entire data sets you are storing in your the data warehouse, some data still stays in the Aurora or maybe in the data data lake. And you want the ability for that purpose, you have the Federated Query on Direct Spectrum so that you can unify this data.

Next one is SageMaker. So in the in the this architecture that you are sending the Amazon Redshift data into SageMaker environment and then you set up the whole big ML pipeline to get those recommendations or ML use cases and finally it send it to the application two, you can simplify the whole piece by using a just calling the Redshift ML. So basically you already have the data in data warehouse, just use the Redshift ML bingo and send it to your clients.

Now let's assume the customer, the company earlier it was they were operating in 12 countries. Now they are operating in 50 countries. So maybe now the single data warehouse may not be enough to scale the whole system. So what they will do, they will create multiple data warehouses. Assume they have the three data warehouses again to have some common reports. Some data sharing requirements, you cannot keep loading this data to all the places. Otherwise it is like the duplicate effort. It's a waste of money, waste of time and unnecessary managing the infrastructure. You can avoid this with simple data sharing. And then finally you send this data to your SQL clients or QuickSight or maybe the different dashboard, even the different products like the Tableau. Yes, you can send it and analyze it.

So now you can see how I have simplified the earlier architecture where you have seen the multiple Glue jobs. Now there is no more Glue jobs and you are still able to have the 360 degree view of all your data. Plus, you are able to have the ability to scale your environment with Advanced Data Sharing or predictive analytics capabilities.

Now, I would like to call my colleague uh Sanjit Jan. He will explain some of the customer success stories where he has implemented some of these features and capabilities for their customer and how they are able to simplify their solution. Sanjit over to you.

Sanjit: Thanks. I'm pretty sure there's too much of information so far. Right? How exactly things are flowing different services, different offerings and all I'm seeing many people are very excited about the architectures and they are really excited to look forward the actual customer implementations. And I'm lucky enough to talk about that. As a part of my role, I work with different customers across different industries, insurance, health care, retail, manufacturing, I'll talk about two of those customer success stories. How exactly there's different AWS services offering, what we just learned from Anthony. And so they help to simplify that architecture.

So the first implementation is from a talent acquisition company which is a B2B company and serve different Fortune 105 100 customers. Their major objective is how I can get more data from different varieties of sources, analyze the data and democratize to the different teams. So they can do analytics at scale where we ask quantify, help them to leverage different AWS services, different connectors which AWS provides and provide that thing of analytics at scale with proper security. From a business standpoint, it helped them to run faster, go faster to the market with less human intervention. So this was the background about this customer.

I know the excitement part is the architecture piece. So I'll walk you through the architecture step by step. How exactly we build this architecture, how it works in production for the customer. On the left, you will see different sources, social media feeds, transactional databases, which with help of different connectors, which we just learned a few minutes back with help of those different connectors have been ingested and stored into Amazon S3 and then with help of Glue which provides value added transformation, we did that slice dice of that data and store it into Amazon Redshift which becomes a centralized warehouse system. Not only that, we also democratize this data to different of the teams of the customer, whether you talk about analytics marketing team or maybe customer on boarding team, so they can take the advantage of this data and produce the insights and understanding based on how they want. We package this entire architecture with all sorts of best practices, devos capability.

I'm pretty sure you guys are pretty excited about this architecture, but let me take one step backward to it. Let's think about if we don't have the offering from AWS from this. So how this architecture would look like if we don't have all those kinds of zero atl serve as glue and those kinds of stuff, right?

So if you see on the left where I talk about, we have leverage different connectors. Like you talk about f flow k data, fire hose and glue to ingest the data. If we would not have those different capabilities offered by AWS, I typically write and code in any of the programming language, python java.net, whichever you're comfortable, not only develop that host that application and scale it as and when you require it and when you start data, when your business literary scales scaling that infrastructure would again be a pain which will lead to downfall all of your different downstream applications. So you can think of how this connector simplifies out of the box.

Let me show you one more example which you see on the Redshift side, right, the democratization, which I said the data sharing feature. If this data sharing feature would not be available, the other option would be as simple as maybe i do some sort of a continuous synchronization in other word replication from one server to another server. If I would have done that, it would still solve the problem. But think of your security team now they have to manage all of these Redshift cluster from a security, from a devo standpoint, they have to manage as well as from a data integration perspective, you need to synchronize that. So it was quite a lot of overhead to manage all of this. And I'm pretty sure any security guys in room will not allow to do that because now you need to manage the security as five different places. But with help of zero architecture, you can simplify that at one place, the producer is responsible for managing the security, the data, which is you can see on the data warehouse and on the extreme, right, you see different end points or different teams, right? They are the consumer and they can freely consume this data without really getting all sorts of kinds of blocker. Hey, I need this table. Can you give me the access of it? Hey, I need that table. Can you give me the access of it? So that's what the zero at simplifies it and scales it. So customer not only can focus on the business but they can take well formed decision about it.

Let's move to another example of it. This is from an insurance customer. We all know here when we talk about the insurance customer security of data is the prime thing because they work with different pp data, not only that they want to, they want to secure their data, but also secure their devices where they are processing and operating this data. So this is a customer where they want to secure their devices, all sorts of locks, real time, locks, device locks, ingest at real time and do different kinds of security checks on that. The compliance needs what we have, make sure the devices are all out, not outdated with any software and other kinds of stuff, doing all those things at real time is really challenging on it. That's where we leverage different aw s serve offerings on zero a tl which helps them to take this to the next level which provide the scale, provide all of this anly detection at real time and focus more on processing of data rather than really worrying about the security and other functional aspect of it.

Let's take the architecture standpoint here. So on the left, you will see different kinds of logs here, right? Curators flung those are the different sources. You also have some flat files around it from device logs, right? Think maybe somewhere your laptop where you have different logs going on around this logs, all with help of kenneth's data stream was being sourced to amazon redshift which becomes the centralized warehouse or the storage layer here with help of ml capability out of the box of redshift. We try to inference this data on the run and detect different anomalies around that data things as vernal. Maybe, for example, someone is trying to enter in your system and is forgetting the password, right? Maybe he gave a three attempt and you want to block that user, maybe even if it is a legitimate user, but you want to block because you already gave a three attempt. Like think of your atm or credit cards. If you make three or four attempts, they will block you even you are a legitimate user. If you want to take that kind of action. It's very important that you do it at scale as real time as possible. And not only with that, you also package this with devops capability. So you can repeat this and scale this as your number of users base increase or your business skills.

Let's take one step backward and let's replay if we would not have Redshift ML, which is a wow feature of this architecture, how this architecture would have looked like, dear. I'm pretty sure most of the people here in this room would have aware about something called as Amazon SageMaker. And we can see in the architecture. Absolutely. I'm not saying you can't solve this. If you would not have Redshift m you could have solved this without Redshift ML too with Amazon SageMaker.

But think of, as I said from the use case, we need to process this data very real time. Someone is trying to get into your system. So now when you have this data which is flowing through ETL, you need to process that data as well as you also need to give it to Amazon SageMaker to real time inference plus you need to get the output, consolidate store it take, detect it, it will be too much of chaos and complex, right?

What am I doing here? Like am I storing my data? Am I more focusing on the machine learning? Am I more focused on identifying which is the security vulnerabilities, it's going to be all diverse, right? And that's where Redshift ML simplifies this by bringing that out of the box capability. And you can do the same thing what you can do on SageMaker by showing your own custom machine learning models and all which you can do the same thing with simplified at scale and you don't worry about the securities and the governance part of it.

So that's the interesting thing of many such use cases where we have sold for our customers across industry. These are the two of the examples which will help you to understand how exactly zero at are and the services of zero at and the features are really important when you get on the ground and understand how it can help you to simplify and more help you to focus on the business.

I'm really short, too much of theory so far, right? So much of information, so much of architectures and other stuff and you might be be not be happy if you don't end this with a proper demo, right? Which shows something live in action where you can get the feel of how exactly zero atl comes into it, right?

Let me pass it on to Anthony who will really show us a demo to help how exactly this zero it comes in, live in action.

Well, thank you so much San. Um I think we've covered a lot of reference architectures here, we have one more because we have to talk about that and how we're coming up with the demo, right? I think the important thing for us to understand is in this use case in today's world of soccer data is very important, right?

For example, if a national team coach has only five days to prepare the entire squad for a major tournament, they don't have the data in front of them, they don't have the data. Like for example, how many injuries have cost this player, how much play time has this player had for their club? Those information and those details are not ahead of time provided to the national team coaches, which is why if you notice the last two weekends, we had a ton of injuries and these coaches were forced to play the youth players on their team.

The second most important aspect is during half time. If a player is actually running out of fuel or cannot last the 90 minutes, the coach needs to know that, hey, maybe he is running out of steam, let's put on another player. These are decisions that can actually influence the game and a lot of successful teams have actually implemented this and which is why they've been able to make smart transfers and stay ahead of the game.

So in this demo, which we're going to show in this architecture, we've simplified a ton of these items based on what the instances explained. So we're actually pulling data from a third party API know sofia.com which stores this massive data set. And then we're going to actually ingest that into Aurora database instances. We're going to load that into Redshift.

We also have another workflow. Let's say you have real time data coming in. Excuse me, you can stream data from Kinesis Data Streams and then ingest that to Redshift as your sync node as well. So our master table is going to be FIFA stats. You don't have to remember everything because we're going to walk you through the demo. But just to give you a reference and last but not the least, we're going to visualize this. In QuickSight, we're going to publish dashboards that users can look at and with a single click of a button can actually slice and dice the data.

Alright. So at step one, we're going to configure our RDS instance. So in the AWS console, you click on the RDS link and on the left hand side, you're going to see something known as parameter groups. It's important to understand that we are going to modify, excuse me, certain parameter groups in our Zero ETL custom parameter group that we have created.

So there are around six parameters that we're going to modify for this specific integration. You don't have to take a pic of this because we are going to share the blog as a resource after this so basically, the whole idea is you modify these values, get them set for your ETL integration.

So once you save the file, you're able to go on to create your database. So we have created two Aurora instances here. So one is your soccer stats demo and your Zero ETL source. So we're going to look at the configuration details of what we've actually gathered in our instance. So please pay attention. There's a couple of things that you need to note.

Now, that's your my Aurora version, right? You're going to make sure that you copy your Amazon Resource Name or your ARN? Because this will be used as an input for your source when you create your Zero ETL integration. So make sure you copy that over there.

And then we're going to flip to the Redshift console, we actually have created a Redshift serverless namespace. So on the left hand side, you'll find the server dashboard and there you're going to actually see that we've created a Redshift serverless namespace that is associated with the work group. And there are some properties that you need to gather from your server namespace as well.

So let's look into those again. The important thing is you're going to copy your namespace ARN? Because Redshift serverless is going to be your destination for loading your data from Amazon Aurora. And then below, you're gonna gonna notice a resource policy tab where you actually have to grant certain permissions.

So you as a user, you have to grant yourself permission as an authorized principle. So your AWS account ID is gonna be there. Secondly, you're gonna add your Amazon Aurora instance as your authorized integration source, which is why you copied your ARN, right? So you're gonna provide that as an input there.

And now when you set this up, that's all it takes. Now you go to your RDS dashboard and we're going to get started with the Zero ETL integration. So you'll find on the left hand side, the tab Zero ETL Integrations. We have created a Zero ETL FIFA 23 demo which is a Zero ETL integration which is active. As you can see your sources are MySQL, the destination is Redshift serverless which is why we configured your ARN there as well. This validates your successful configuration of your Zero ETL integration between Amazon Aurora and Redshift serverless. Is that simple? We didn't have to run any ETL jobs.

So now we're going to the data in query editor on what is being stored in your Amazon Redshift serverless. Excuse me. So if you remember in the reference architecture diagram, we talked about the table FIFA stats, which is your master data set. And we also created materialized views for specific national teams. Like I said, let's say the coach of Argentina needs to understand what's the status of the current players in the Argentina national squad. Excuse me, this is the master data set.

So you can see we have pull data from sofifa.com and this has the complete details of all the players current stats, dribbling stats, number of passes, a game, number of touches in the in the left hand side of the field. And we will visualize that in QuickSight as well.

Now, let's look at the Argentina national team materialized view that we've actually created here, right? So this is gonna only gonna pull in data from the players on the roster who belong to Argentina. So this way we simplify that process and this information is readily available for you already. So every time there's a new record added, it gets appended to this materialized view, right? That's how easy it is.

Now, let's visualize this whole thing in QuickSight right. Now, if you look at the club level, we've visualized data from over 52 leagues across the globe, different time zones. So all of this is multiple data sources, we will all pull it all together that easily, right? So we've created graphs here to understand the work rate of players. Why is that important? We just talked about injuries. So determining the work rate of certain players is important to understand how often they get injured, how much time they need to recover.

For example, a younger player might take less time, more time to recover. So all these attributes are needed to arrive at analysis right? And then we look at the attacking attributes, for example, we talked about the number of shots taken from outside the goal, number of headers, number of finesse shots, all these details are actually captured in your data and you can actually visualize them.

Now, let's say you want to zero in on the English Premier League, right? One of the most popular leagues in the world, you want to look at the attributes only from the English Premier League. So you're able to identify every player, every nationality and their stats with respect to the players in the English Premier League.

Now let's slice and dice it even further. Let's go one level deeper. Let's look at the player attributes, right? So here we split the players according to the nationalities from the league, then we're also looking at the work rate distribution, right? So it's medium, medium, medium high, obviously, medium, medium seems to be the most predominant one.

And then we also have a score computed based on the analysis of the age and the work rate accountability. And here we are analyzing the potential of players, for example, a 16 year old player has much more potential because we still don't know what this player could turn out to be. And then we also talked about a number of touches or number of players who can actually prefer the right foot versus the left foot, for example. So we've created a simple graph doing that and then you have the individual player statistics, right?

So it gives you the overall potential, the overall score based on the age and the computation, we have all those attributes picked out there. Now, let's say I want to go one level further. I want to add a filter and see what's the number of players who have an overall score of greater than 85 for example. So create another graph below that talking. And if you notice the patterns are kind of similar, so we are predominant right foot bit players over left footed, that's what we can conclude.

Um and I also want to touch down upon one piece of information here uh for those who are familiar with the players, uh we have Eden Hazard. So I think the data has been very generous in providing his work rate as high medium, which probably caused him to retire early. But anyways, with that being said, that was how easy it was to actually run this in integration and we have the resource for you.

So just to summarize everything that we've talked about so far, why do we need Zero ETL? Right? It eliminates the need for you to create complex ETL pipelines. You don't have to have eyes on every single workflow. You don't need to know where it's failing. We are eliminating that process by simplifying the whole workflow. If you look at all the architecture diagrams, we've done the same thing

Second most important aspect is like Sanet talked about, we want to analyze data close to real time or near real time. And we're able to do that with this architectural pattern. We're able to use Amazon Redshift and process petabytes of data coming from transactional databases like Amazon Aurora as well.

And last but not the least as Sudhir mentioned, we have a ton of data sources and the important value for you and your customer is to unify all these data sources, aggregate them, cleanse them and then derive insights instead of only having a one dimensional view.

Ok. So this is the most important part of the session all of you. I know three of you are taking photos, I'd like to keep your phones on and the rest of you take your phones, please take a photo or if you can scan the QR code because these are going to give you links to everything we've just talked about.

So some of them are blogs that actually show you the Zero ETL integration, right? And we also have, if you have interest in learning more about the other analytical services that we haven't talked about, you can refer to the third one. And we also have very specific case studies like San that talked about a few of them and stay abreast of all the latest information at AWS with the What's New post.

So I'll give a few seconds if everyone's taking them down, looks like we have that. Alright, there is another important step. Now, we have offered several trainings. I know a lot of information that we've offered. You might not have the expertise or your teams in your company may not have the expertise we can get you started with trainings. And these are led by AWS experts in the analytical services, right? And we provide them for both our partners and we run them throughout the year.

So you can look at the schedule based on your availability, pick and choose. And then we can also provide you over 500 digital free courses for you to get started to start building on your own and we enhance your AWS skills.

Now again, we're all on LinkedIn, so you can connect with us if you have any more questions and we're going to be outside if you have questions to take over right now. But last but not the least, please open your mobile app for the AWS event and look at your session details and complete the survey. I humbly request you to do this because this is going to provide us feedback so we can get back to you next year with more interesting content and information.

Thank you all very much.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Simplifying modern data pipelines with zero-ETL architectures on AWS

Antony.Right?
复制链接

扫一扫