What’s new with AWS data integration

最新推荐文章于 2024-09-07 11:35:15 发布

李白的朋友王维

最新推荐文章于 2024-09-07 11:35:15 发布

阅读量164

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134833424

版权

Hi, everyone. Welcome. It's day four at re:Invent. I hope you are having a great time. You're learning a lot, making new connections and having fun. It's 3:30 PM here. This is the very last session between you and the happy hour and we promise we'll make it worthwhile.

There are a lot of exciting stuff in this session. Welcome to ANT 220. What's New with Data Integration? We have three speakers today. My name is Santosh Chandra. I'm the General Manager for AWS Data Integration consisting of AWS Glue, AWS DataFlow and MWA which is Managed Workflow for Apache Airflow.

With me, I have Sean Myron, who's the General Manager for MWA and Orchestration. With me, I also have Nishi Desai who is the Managing Director at Goldman Sachs. And together we will be presenting this session. We have a lot to cover. Let's get started.

I'm gonna ask my co-presenters to kind of take a seat. We'll be taking this presentation in turn. After me, Nish will present followed by Sean.

I'm gonna go over the agenda. We're gonna talk about why data integration is important and why it's a critical step to get it right to form comprehensive insights for your business. Then we're going to talk about how AWS thinks about data integration - what are its key pillars? We're going to talk about our key investments along those pillars. We're going to make lots of new exciting announcements today.

Then I'm going to pass it over to Nishit who's gonna connect the dots for you by sharing Goldman Sachs' journey on boarding to AWS Glue and simplifying their data integration process.

Finally, I'm going to conclude and share some additional pointers for you to double click after this session. You have a lot to cover. Let's get started.

As usual, we're going to open up with our customers. We have hundreds of thousands of customers using Glue on a monthly basis. They come in all sizes, all form factors, they have different use cases. They are spread across multiple industries.

For example, we have customers like BMW who's built a self service data integration platform using Glue enabling over 5000 business users. We have customers like AA Bank, a leading bank in Brazil who's built their payment platform using Glue and data mesh architecture.

We have customer like Merck who's used Amazon Kinesis and AWS Glue and have built their ingestion and transformation layer for near real time insights. Some of these customers also co-presented with us at this re:Invent.

For example, customer like Chime Financial, they share their Glue streaming use case in our Chop Talk, customer like BMO Bank of Montreal, they share their ETL modernization and migration journey to AWS Glue.

Finally, we have customer like JPMC who shared their data quality journey using AWS Data Quality. We are thrilled to see so many customers sharing their success stories.

One such example is BMS - Bristol Myers Squibb. They are a leading pharmaceutical player. They onboarded to Glue many years back. They have been using Glue over the last five years. Using Glue they have built highly scalable serverless ETL platform for their business users because of which the drug discovery has accelerated inside BMS.

Another example is GoDaddy, it's a global leader in web hosting and domain registration. They onboarded to MWA, they onboarded thousands of pipelines to MWA and they've realized operational gains, stability and they could get the scale that they could dream of.

Here are some key facts we have. As I stated earlier, we have hundreds of thousands of customers using AWS Glue on a monthly basis. They are running hundreds of millions of data integration jobs on a monthly basis. We have hundreds of transformations and we support 100+ data sources for data integration.

Now with this overview, I'm going to change gears and talk about data integration. So why data integration? Why do we care? What's the motivation?

Data integration is the very first step of connecting, ingesting, cleaning, transforming, combining, cataloging your data before data can be made use for analysis. This is the very first step that you need to get it right. Otherwise you will not have accurate and comprehensive view of your business.

If you rewind 10 years back, data integration was manual and complex, you needed to hire specialized skillsets such as ETL developers. The data was siloed, the processes were manual, you had to run them in batch mode that they were typically run on a nightly basis. Near real time insights was not a possibility.

Now as businesses have accelerated their pace of innovation, they are demanding more from data integration as well. There are three things that have changed in the last few years:

Businesses are expecting up to the minute insights to make informed business decision and get operational insights. That means data integration needs to run not in minutes, sometimes even near real time.
They want data integration to augment their critical business processes and operational systems. Data integration is no more an offline process.
They want more and more business users to participate in data integration process and enable them to to do data integration and analysis. They want to have multiple tools that can enable their data workers to undergo the data integration process.

So how does the data integration process look like? It begins with a motivation, it begins with a motivation such as you want to build an interactive dashboard. Then you start finalizing the requirements with your business stakeholders. You establish your data contracts, you agree with the output with your data stakeholders.

And then you start identifying the sources that you can connect to. Once you have identified the sources, the connectivity, the authentication model, you start bringing the data in. Once you get the data in, you start transforming the data, cleaning the data, removing the inaccuracies.

And finally, the data is organized in a way such that downstream analytics engines and processes can be highly performant.

Finally, you test debug iterate, test, debug iterate. Many of you must be familiar with this. It's a highly iterated process. And once you're happy with the inputs and outputs, you start operationalize them into a pipeline, you start building explicit SLAs you start monitoring your pipelines, you start reducing your data down time, you start to ensure data quality is of the high bar.

It all sounds simple. But there are a variety of challenges. If you talk to business teams, they constantly complain. They are not moving as fast as they would like. They are missing SLAs all the time. They are relying on central teams and they don't understand why it takes so long time.

If you talk to data engineers, they are saying they have to juggle between writing business transformations, managing infrastructure and capacity management. The legacy tools do not scale as your data size increases. That results in them spending sizable time in analyzing their jobs, fine tuning their jobs and debugging their jobs.

If you talk to IT department, they are saying they juggle their time between vendor management. There are too many vendors to manage, too many tools to take care of and is proprietary. Vendors constantly keep prices every year higher and higher without adding explicit value in their tool chain.

Many of you must be familiar with these challenges. Some of you can associate with all of these challenges.

Now, let's talk about how AWS simplifies the data integration process. We can conceptualize data integration process in four core pillars:

Connect - We want you to connect to every single data source, removing data silos.
Transform - We want you to be able to transform using preferred tool of your choice, enabling lots of data workers in your organization.
Operationalize - We want you to be able to operationalize data pipelines at scale without need to manage infrastructure and worrying about operational stability.
Manage - We want you to be able to manage your data and improve data quality at scale.

What we're gonna do now is we're gonna go through these buckets one by one. We're gonna discuss the challenges, we're going to talk about what are the key innovations or approaches we are taking to simplify data integration. And then we're gonna do some cool announcements.

I'm also gonna share for every, every pillar an animation that's gonna connect all these concepts together. So let's get started with connectivity.

Our vision for connectivity is to allow you to connect to any data source in a reliable, secure and performant manner without worrying about cost. There are a variety of challenges.

The first challenge is the new data sources keep coming up. Every data source have their unique way of connecting to them. The authorization model is different. The authentication models vary, the schema changes is constant. The data models is different. That means you need to hire an application engineer or an IT professional to keep up with the new data sources that keep cropping up.

Even if you hire the right skillset, there are hundreds of them this soon, the days become weeks and weeks become months and your business slows down. This is not scalable.

The second challenge that you face is finding the new data sources is an afterthought. The data discovery takes long time against the back against the backdrop of these challenges.

This is how we are thinking about connectivity. We want you to have access to broad set of connectors. These connectors can range from data warehouse databases, third party SAAS sources, raw native services, open table formats like data lakes and streaming sources.

If we do not support a source which is built into Glue runtime, we offer two options. We offer you a custom SDK against which you can build your connector. And once you build your connector, you can onboard it to Glue runtime and you can use across your data integration jobs.

The second option you have is we offer a marketplace for connectors which is populated by our third party partners. Using the marketplace connectors, you can download these connectors in your own time.

When we were thinking about connectivity, we followed following tenets for every single connector, we want to make it reliable, secure and integrated. We want to give them the same quality bar in terms of security, access control, cost control that we would offer for an AWS feature.

You want to make it extremely easy for you to use. That means with just a few clicks, even less technical users, no code users should be able to access data.

Finally, there is no charge for using connectors. They are free as long as you run your jobs. That's what you pay for.

Now, one of the common asks from our customer was can AWS Glue natively support third party data warehouses and database connectors. So I'm glad to announce that Glue now supports 10+ data warehouse and database connectors.

They range across variety of variety of connectors. For example, we have all popular data warehouse connectors ranging from Snowflake, Teradata, Vertica. We have multi cloud connectors such as BigQuery, Azure, Cosmos DB. We have NoSQL database connectors like MongoDB. We have in memory database connectors like SAP Hana.

So we wanted to solve third party connectivity problem for you. These connectors are embedded in our runtime.

They get the same reliability, security and performance characteristics that we have for other built in connectors. Now, while we have supporting, we are supporting the third party connectors, we also wanted to enhance the characteristics of our first party connectors which are our native services.

One such example is Redshift, Redshift is our fast fully managed cloud native data warehouse with the new enhancements. We are better together with Red Shift and Glue with this new connector that we are announcing now with these connectors. It's easy to manage Red Shift table across Red Shift clusters and Redshift serve list. You can accelerate your end to an ETL and T pipeline by accessing direct Shift metadata, Redshift table, Redshift stats and build these pipelines in a reliable way inside Lose run time.

Finally, this this particular connector is very easy to on board to you're gonna have some display later, but you can you can simply have your ELT and ETL workload on boarded using Glue studio. Finally, this particular connector is highly performant. For example, not only you don't pay for this connector, we are using Redshifts query optimization techniques to push down certain operations to Redshift cluster, getting you the acceleration and better price performance.

Another example that we have is Amazon OpenSearch. Amazon OpenSearch makes it easy for you to do real time log analysis, real time application monitoring and web search. We are also announcing a highly performant OpenSearch connector with this connector, you will be able to access OpenSearch indexes, perform queries on them and also write data transformation and data task against OpenSearch data.

Now, as you are all aware LLMs have taken over the world, but however, LLMs suffer from the the the accuracy of their responses. One of the method to augment the accuracy is to use ragging techniques such as using vector databases to put vector embeddings that can be augmented with your prompt inputs. OpenSearch supports vector DB natively using Glue, you can enrich and transform data and create high quality vector embeddings that will improve the quality of your LM responses.

What I'm gonna do is I'm gonna play a small animation. This animation is gonna display you how you can connect to Bongo DB. Pull that data in join that data with your iceberg table, which is your transactional data leak in s3, run bunch of transformations and write that data into Terra data, which is a data warehouse which is a third party data warehouse we talked about.

So as you can see here, we've selected Mongo DB. We're creating a connection here. Enter your parameters. At this point, you can start creating a new job. You select the source which is Mongo DB. You select another source which is Amazon s3 and you go and start adding transformations. We support join, we support filter which are all built in transformations. Then you use the target which is Terra data here. And you're done. Now, you can start joining these data sets. You can even preview the output at each step along the way. For example, you can see the data in Amazon s3. You can see the data in Mongo DB and you can get your end to end data flow or ETL job authored in just a few minutes.

So as I demonstrated with this, you can alter third party data warehouse, you can connect to AWS native sources, you can connect to your multi cloud data sources and form a coherent data integration story.

Now we're going to talk about the next bucket which is transform data with your preferred tools. There are a series of challenges here. First, there is no single tool. If you are a data scientist, you wanna use notebook, if you are a data engineer, you wanna use an ID. If you are a low code, no code developer, you wanna use visual drag and drop. If you're a business user, you wanna use an wrangler interface, such an excel interface. There is no single tool that can make you productive.

The second challenge that you face as i described earlier, more and more users want to participate in data integration journey, that means we need to be able to share their artifacts across different teams as they are successively building complex pipelines and jobs. That means modularity is of critical importance.

Third as these jobs start scaling against large data sets, they do not perform well or sometimes they don't even run. That means you are trading cost against performance or stability against the backdrop of these challenges. This is how we think about simplifying transformation. I'm going to take a double click on this.

In the next slide at the top layer, you can see with Glue, you can author job using variety of tools. You can use Glue studio, visual drag and drop interface. For less technical users. You can use built in notebook for data scientists. You can build a code editor for your data engineers. You can use APIs and SDKs and SDKs for operational your workloads at scale.

The next layer is transformations. We want to offer you as many built in transformations as possible so that you don't need to write any line of code. We also want you to bring your own transformation. We cannot support all business transformation which are associated with the business data sets that you have for this. We allow you a way to author transformation, register with us and those are discoverable and consumable across a variety of users. You can also even author a data brew recipe using your data brew or our data brew tool. Once you author the data brew recipe, you can bring that recipe as a studio transformation, you can production that job. You can even put it on MW a pipeline and run your pipelines at scale.

Finally, all these things are supported by our serverless engines. We have highly serverless, highly scalable server engines. They can give you hundreds of nodes in just a few seconds. That means you can do really, really high performance big data processing. They are also open source. So there is no proprietary lock in. You can you can bring this code outside of Glue and uh and use it somewhere else.

We also support variety of modes to support your data integration jobs. For example, we support batch mode, we support micro batching, we support stream processing, we support interactive processing, all these modes of processing, get the same advantage that you have from the other parts of the glue system.

Now let's talk about built in transforms. Glue now supports hundreds of built in transforms. This year, we announced another 20 built in transformations. Some of these are highly complex transformations such as pivot. Once you once you use built in transformation in your glue jobs, those are highly portable. We maintain the code behind the scene. We make sure if there are better performance best practices we apply to our code. We also help you migrate your jobs from one glue version to another glue version seamlessly. So we highly encourage you to use built in transformations in glue.

Now, one of the type of transforms that's becoming popular is sequel based transformation. That's the new pattern that we've seen emerge lately. Glue supports sequel transformation built into glue studio. However, there is a new tool called DBD data build tool that allows you to do sql based modeling, apply your business transformations, express them decoratively in a DVD project and apply, apply them across your pipelines today.

I'm happy to announce that Glue now supports DBT trusted adapter. What that means is you can include DBT trusted connector in your DBT projects. It will behind the scenes, start interactive session, it will materialize your artifacts, create the tables, apply the transformations and simply recycle the infrastructure. There is no infrastructure management to do on your side. It also supports the same sequel and transformation modeling that you are used to on data warehouse on data lakes. Again, this is open source, there is no servers to manage. It's a very highly popular open source project.

Just ask customers like our horizontal scale, we can give you hundreds of notes in a few seconds. Some customers have highly complex business logic for which vertical scale works better. For this. This year, we announced support for large worker types. If you recollect Glue announced G one X and G two X workers many years back. G one X worker gives you four VCPUs and 16GB memory. But for some advanced workload, vertical scaling is required for that. We announced G four and G eight workers with G eight workers. You get 128GB of memory and 16 VCPUs. That's a lot of horsepower coupled with the fact that you can get hundreds of those. There is no data set that you cannot process with this kind of horsepower. Again, there is no servers to manage. It uses the same serverless infrastructure that we have.

Next, we're gonna talk about our streaming system. As I talked earlier, businesses want near real time data insights for that. Sometimes you need to use streaming systems. We talked about Merck and Chime bank using streaming system and enabling their near real time use cases. Glue supports streaming out of the box. This year, we added three new innovations to Glue streaming.

First, we allowed Glue streaming job to be authored as visual drag and drop job. Second, while customers love streaming capabilities that we have many a time stream is not active as user activity picks up new events are occurring, the stream's activity apes and flows for that. We invested in auto scaling feature with this auto SCCa feature. You would be able to simplify your infrastructure or job job DPU management. All you need to set is your max DPU behind the scene. Glue streaming will automatically scale up and scale down based on the activity in your stream.

We also announced a 1/4 TPU. Sometimes the streams are do not require heavy processing but you still need to be having always on job which processes in near real time. With this, you can have extremely cost efficient Glue streaming capability for your data integration workload.

Now we've done all these innovations, but we wanted to also enable you to start offering job in half of your time. So we this is what we did. We analyzed our job workflow, we analyzed the number of clicks that you need to go through to build your data integration workloads. And we did series of improvements. For example, we simplified your on boarding set up. We are now offer you a getting started job so you don't need to start from the scratch. You have the boiler boilerplate code already available for you. We allowed guided experience for your business users.

We also now allow you to bring your data blu recipes from your data blu uh site into Glue studio and embed them and run as any other Glue job. We simplified your notebook, getting started experience. And finally, we allow you to have responsive data previews as you're authoring your workloads and transforming your jobs together with all with all these innovations. We've noticed that your job offering time is reduced by 50%.

Now let's take a look at this in near real time. This is Glue studio. You want to click on one of the notes. This node is doing transformation and preparing date in particular format. The format of the dates begins with slash as you can see in the column there.

Now we want to change it to dash, you can simply change dash there and you can start the data previews. As you can see, you can have interactive way of analyzing your work iterating over it and it gives you really an acceleration that you're looking for.

As you've seen, we offer a variety of interfaces for you to be productive your data workers to be able to authoring all these complex transformation. It is one interface that's missing and that's english. We want you to be able to author your data integration jobs and pipelines by simple english. Imagine you can say move data from s3 to redshift, ensure data quality, ensure that you do not have data quality misses and if that does happen, do alert me.

So I'm extremely delighted to announce generative AI capability in Glue with Amazon Q for data integration, you can author your entire data pipeline by NLP and simple english. Earlier in this re:Invent event, we announced Amazon Q. That's our generative AI based assistant which is tailored for your own business, data and operational data with high security and privacy bar. You get the exact same characteristics in with Glue data integration assistant. It allows you to to also ask troubleshooting questions which are, which are kind of trained with Glue's corpus of data and they give you very relevant answers which are specific to data integration. It also acts as an instant SME. So what we have taken is we've trained it on our support cases and we've created a subject matter expert agent capability with Amazon Q for data integration. I really encourage you to give it a try and give feedback. It's right now, pre announced it's going to be available very soon with this.

I'm gonna go to the third bucket. Sorry. Which is how do I manage and ensure high quality in my data sets. You must have heard data leaks are becoming data swamps. There is some reality to it because your data sources offer you incomplete inaccurate data sets that if you do not curate or maintain the high bar eventually proliferates into your data leaks. Now compound that by hundreds of data sources that you support this problem gets have a multiplied effect. And once you do not solve it upfront, the problems continue to become worse and worse.

The second challenge that you face is with all the privacy regulation, you want to be sure whether sensitive data or PI data is leaking into your data sets and once it's gone in your data sets and you've not kind of held the high bar. It's kind of finding a needle in a haystack. It's quite expensive to find it. After the fact, there is another challenge that you face is the tools that you get are very use case specific. So you end up dealing with multiple tools that you need to manage. You end up dealing with infrastructure management, you end up dealing with expensive management you also have scalability and performance challenges. So you really don't have right tool chain to manage data quality and redact sensitive data at scale.

This is how AWS simplifies the problem. We allow you to manage and ensure high data quality by embedding data quality checks for data at rest as well as in transit in your pipelines. We allow you to identify and protect sensitive information. We monitor and take action whenever these detectors detect sensitive data or data quality checks deteriorate. It also runs on our scalable cost effective serverless infrastructure. So you don't need to do infrastructure management or trade cost for price you get right price performance.

What we're going to do now is we're going to talk about sensitive data detection. Glue offers built in sensitive data detection. That means you we we have 200 plus entities across 50 countries. These entities once detected, you can either redact them, you can replace them, you can generate alarms, you can also mass them. Today. I would like to announce fine grain actions for sensitive data detection. What that means is with a particular entity that you detect as an example social security number. Your action can be just keep last four digits of my social security number with an email. It can be just keep my first and last last digits of my login. So you can take very specialized tailor made actions based on these capabilities.

Next, I'm going to talk about data quality. We offer data quality which is built in. We have built in rules recommendation. You can start recommendation system which gives you the right data quality insights. We also have a data definition language where you can specify your data quality rules in more declarative way, we offer data quality at rest and in transit, we also offer multiple personal support.

Now, as you can see Milcom who is one of the customers who started using AWS data quality, they saw the robustness of our data quality feature, they applied across nine business departments. And as you can see their data down time got reduced by 50%.

However, rule based data rule based data detection has challenges. Sometimes your business reality changes, your thresholds need to be updated constantly. You have challenging patterns in your data that are not detectable by your human eye and you have to spend complex. You need to develop complex algorithm to solve the problem and you are your time to insights gets inflated for this.

We are announcing today AWS Glue Data Quality that adds ML capability with this. You can detect anomalies in your data sets. You can generate insights about how to detect patterns and you can create dynamic rules. We're gonna see this in action by looking at this particular video by Alison Quinn, our senior analyst solution architect. I'm gonna start the video here.

I have an existing Glue visual ETL job. I want to measure and monitor the quality of the data that it's processing. So I simply add the Data Quality transform right after my data source, I don't have the time to add data quality rules and wanna get started really quickly to do this. I configure the new row count analyzer, this analyzer collects row counts over time and the ML capability learns from past patterns and then detects anomalies in the data. There are other statistics you can configure analyzers for too such as distinct value counts, column lengths, unique values for all columns. For this demo, we'll just look at row count. I've run this job a few times now for each new data set and Glue has learnt the patterns of my loads. Now the disaster day comes, our data file has low volume with the latest loads. I don't have to spend hours debugging the issue. Simply click on the Data Quality tab Observations tab and notice Glue Data Quality has already generated an observation about the abnormal drop in row count. You can also visually see this from our graphics Glue Data Quality can not only detect abnormal drops but can also learn seasonal patterns and detect deviations from them. It has also recommended a rule to be added to my ETL pipeline so that I can stop jobs when such anomalies occur. I'm going to apply the rule I just created to my data pipeline by clicking on Apply Rules. Now, I have the rule configured in the Rules section that will continuously monitor the row counts for my data loads. I can even make these rules dynamic by ensuring that for instance, the row counts are say greater than the average of the last 10 runs. While the analyzer will monitor my data for anomalies, the rules will ensure that any anomalies are detected and will continue to give me insights about my row counts with Glue Data Quality. You can now combine the power of machine learning and data quality rules to deliver high quality data for your business users, enabling them to make confident business decisions. Thank you.

I'm gonna ask Sean to come here and present the next part which is schedule orchestrate and monitor pipelines at scale.

Thank you, Santosh and folks, thanks for joining us today. Santosh has talked about a lot talked about a lot of new amazing things to help you address data integration challenges. Now it is time to talk about our fourth and last pillar and that's really about schedule orchestrate and monitor your data pipelines.

There are some fundamental issues to reliably running data pipelines at scale. In some cases, it can be limited visibility as to what went wrong and why did it go wrong or why is my performance? Not as I'm expecting. They can be hard to build when you need complex dependencies between jobs and want to leverage common parameters across jobs. Often they lack the the flexibility, you want flexibility around timetables, around calendars, around time zones, around what, what holidays are, which are different by country and in some cases, which holidays are actually different within a specific geography, within a country. So they, they lack the flexibility to account for that to support your business. And finally, there's reusability which is really difficult at scale. You know, how can you build data pipelines in the most efficient and reliable way where you can reuse common components. And you're not looking to reuse the wheel every time Apache Airflow has emerged as a fantastic tool to schedule orchestrate and monitor data pipelines.

Amazon Managed Workflows for Apache Airflow as we call it MWAA is a managed service for Apache Airflow that makes it easy to deploy, monitor, connect and secure Apache Airflow at scale without the operational burden of managing the underlying infrastructure with MWAA. You define your data pipelines from the hundreds of open source community available Apache Airflow operators and sensors which include integrations with 16 AWS services and hundreds of third party tools like Apache Spark and Hadoop Apache Airflow has a really dynamic and vibrant open source community and as of last year is Apache Airflow's biggest community by a number of contributors. So it's constantly evolving with new features and new capabilities.

MWAA is is built on this foundation of open source. It's built on a foundation of security with leveraging integration with IAM VPC encryption with customer manage keys or with AWS keys as well as CloudTrail again, integrated with leveraging these, these, these 300 integrations available from the open source community and then easy to deploy, you can deploy it from the console, you can deploy it from the CDK, the CLI CloudFormation, Terraform in multiple different ways that easy to deploy and use. And so Apache Airflow and leveraging MWAA can really help you operationalize your data pipelines at scale.

Airflow has a really rich user interface to allow you to visualize and monitor your workflow and workflow executions. The Python based DAGs give you a lot of flexibility in how to support dependencies in relationships either within a single workflow or across workflows within your data pipelines. You can create your own timetables, your own schedules, your own calendars based upon the needs of your business. And you can do that at a workflow level or you know, across all of your workflows and tailor that to meet exactly what you need. And finally, again, you've got over 300 existing integrations with within Airflow today. The ability to take those integrations and customize them easy ability to create your own custom integrations along with the ability to use parameters and create dynamic workflows, really maximize the reusability across your organization of your data pipelines.

In the last week or so. The MWAA team is busy bringing new capabilities and we're really proud to have launched the Apache Airflow 2.7.0.2 which includes deferrable operator support for Amazon and MWAA upgrading to new versions of software is time consuming and hard with MWAA. We make this easy within place version upgrades and we launch new versions of Airflow at a really regular basis to ensure you have the latest features and capabilities.

Earlier this month, we launched Apache Airflow 2.7.0.2 which included over 40 new features including support for deferrable operators, which is a new way that allows you to monitor for for the completion of long running work in a much more efficient and cost effective way. The introduction of start up and tear down tasks. So if you're using ephemeral resources, this is a way to really define and make sure that an ephemeral resource is there when you need it. And as soon as you don't need it anymore, it goes away. So you stop paying for it. Secrets caching to improve or to improve performance and lower costs when you're using a backend secrets manager like AWS Secrets Manager, user interface improvements, more filtering capabilities, better search capabilities, integrated UI with environment health and, and as well built in support for open lineage.

Open lineage is a open source lineage based tool that allows you to track the flow of data over time and providing a clear understanding of where data started, what happened to it during its journey and where did it end up as well?

We launched support for shared VPCs. Previously MWAA was only available in a single tenant service VPC which added cost and required permissions to both create of the VPC and the VPC endpoints. When you created the environment, we shared VPCs, you can now create your environments in a shared centrally managed VPC. And allows teams with different AWS accounts to create resources within that centrally managed VPC. This reduces the number of VPCs you need reduces your cost and also gives you greater support and integration with both AWS Organization as well as gives you the ability to allow the centralized creation of VPC endpoints that are used within your MWAA environment.

On the Glue side, we've expanded our support for Git integration. Before Git integration, you needed to send up, set up your own integrations with your, with your code versioning systems and you had to build tooling in order to move your jobs from development production. So we've expanded support for Git integration with the addition of GitLab and BitBucket to the existing support that we have with GitHub and CodeCommit.

Git integration enables you to track changes and manage versions for both code and visual based ETL jobs as well as automating deployment, making it easy to integrate with your existing DevOps practices. You can also simplify creating and updating jobs while using parameters for sources and targets.

When you're using this with Glue, for any data driven company, having reliable data pipelines is crucial, you know, when running properly, they give you the the the right information at the right time with the right data quality. However, with varying data volumes characteristics, changing application behavior changing, it can cause data pipelines to become inefficient and they will become, they'll slow down, they'll become unreliable and they'll and failure will happen. And you know, you've told us that you were looking for more, more robust and and and enhanced troubleshooting capabilities.

So we've introduced new capabilities around enhanced metrics for AWS Glue focused on reliability, performance throughput and resource utilization. So we've introduced an entire new class of CloudWatch metrics for your pipelines built on Glue for Apache Spark jobs. These new metrics provide both aggregate and fine grain insight into the health of your jobs and also give you the ability to go in and look at the classifications of the errors that are actually happening within your jobs. This really gives you the ability to do detailed root cause analysis, look for both performance bottlenecks as well as issues in error diagnosis. And with this analysis you can really find new ways to, to improve your jobs, bring in best practices and you'll benefit from higher availability, better performance and lower cost for your Glue Apache Spark jobs.

While Glue has great logging and telemetry about job execution, interpreting it and, and really understanding what's happening at a job level can take time.

Um and, and it can be hard to understand where resiliency and optimization opportunities lie within, within the job level. So we're really excited to have launched the Gluer Spark Q I A new capability and Glue that gives you the ability to debug and optimize your uh Glue Spark jobs leveraging a serverless Spark user interface.

It'll give you aaa lot of detailed understanding as to what's happening within your the um uh scheduler stages, tasks and executors. And it makes it even easier to debug jobs and, and look for areas for optimization in how you're using Glue. And the most amazing thing about this is there's no set up and tear down time required. It starts in just check ins and just works.

So rather than me talk about it anymore, I'm going to show you. So here's the Glue user interface and so we selected a run status and you'll notice we're opening the Spark UI and we expand it. And here is the Spark UI right inside of Glue, the ability to go in and look at your jobs to look at the stages to look at the, you know, the storage being used to look at the, you know, the environment to look at the executors. All of this is available in seconds and it it comes at absolutely no additional cost for you.

So thanks again for your time today. Uh I'm gonna now turn it over to Nass Desai, managing director of Goldman Sachs to talk about their experience with data integration.

Thank you Sean data driven organizations. These words simultaneously, let's sit through uh these words are simultaneously the most cliched and relevant words of our time for a good reason. They mean different things to different people. Today, I'm going to talk about a case study, a project that in a nutshell, we'll tell you what it means for my team at Goldman Sachs.

The story features Glue no surprises there but instead of just focusing on the, the h the implementation details, I would also like to talk to you about the y and take you behind the scenes and complete the arc, so to speak. So let's zoom in.

As most of you are aware, Goldman Sachs is a leading global financial institution that delivers a broad range of services to corporations, financial institutions, governments and individuals. This organization has been around since 1869 and that, that's a great testament to the, to the trust and confidence that it has from its clients.

I work for a part of Goldman Sachs called investment banking on wall street, there has a reputation of being storied franchise. And when I say the word storied franchise, most of you would think of Dallas Cowboys, New York Yankees, your favorite EPL or IPL team teams with a winning culture and this team is no less investment banking has this globally published rankings called league tables that stack up the various banks in terms of their participation on deals.

If you take key products like M&A, that's mergers and acquisitions or equity and equity linked products, which you would know a subset of those is IPOs. Goldman Sachs is number one, number one globally and not just this year, if you take a product like M&A Goldman Sachs is number one has been number one for 17 of the last 20 years.

You don't build this kind of a track record without having the trust and confidence of thousands of clients across the globe. You do not build that trust and confidence without our bankers who are the best of their kind, having thousands and thousands of touch points with those clients. My team and I build products and platforms that enable and accelerate these engagements.

And when I joined Goldman Sachs, many years ago, it was still pretty much a data driven organization, but it meant something else. At that time. At the time, our team was very focused on curating and acquiring vast amounts of data and presenting it to bankers as a competitive advantage simultaneously. We were focused on democratizing this data and making, giving access to more and more bankers. And that continued, both these things continued cyclically to read the point.

We started getting some interesting feedback. People started asking questions like, is there such a thing as too much data? It forced us to take a step back. You take a fresh look at things. It's like you're working on a puzzle board and you're 80% there and there's no way you are getting 200% without taking out a few pieces and refitting them differently. And that's exactly what we did.

We sat down with the heads of our business lines and figured out what are the KPIs and metrics they care about in terms of measuring these engagements with clients. We standardize those. Now these KPIs are literally standardized for every banker, the same KPIs are used if you roll it up to their managers, for them to view their teams all the way to heads of business lines.

The next thing we did was look at our data assets and distilling signal from noise. What I mean by that is filtering out extraneous data and and modulating the right uh data sets. And what I mean by that is not just uh figuring out what are the right insights to present to the bankers, but also figure out the timing and the placement of those.

So there's a great book on this topic called Nudge by Richard Taylor. If you're not read it, I highly, highly recommend it. It talks about concepts like incentive schemes and trust um architecture. It talks about um all these features that we use to implement over here.

The third thing we did was we hyper hyper personalized it, insights. So if you work in the enterprise and, and, and, and you would have noticed that enterprise software often lags consumer software in terms of experience. So if you're a banker and you're using an app, uh like TikTok, you use it for a week and the recommendation, I'll go get so good that you feel like this thing knows me now on the same phone, if you flip over to an enterprise app, you may not get the same experience and that was not. Ok.

So we invested a lot in building a personalized layer and building a recommendation engine that really improved the experience. And the last thing we wanted to do was we wanted to deliver this on a single platform. So instead of fragmenting all this good stuff across different platforms, we want to bring it into a single place. And obviously, we wanted a simplified experience.

Now, when we talk about simplified experience, we hear things like uh great design is 99% invisible, but even though great design may be 99% invisible to the end users, that doesn't mean that behind the scenes, the the data ecosystem or the service layer is anything but complex.

So in our case, the data ecosystem super varied and super complex. And this itself, as you can see, we were sourcing market data and news transaction data, uh reference data, unstructured data. This itself sounded very forbidding. Thankfully for us, right around the same time, our firm came out with this awesome product called Legend. If you not tried it, I highly highly recommend it. It gave us a way to model connect all these data sets and then also implement the right data contracts.

So as you can see, we source all these different datasets, we stay and connect with them in the right concept and we can textual it for the bankers. And then before it was available for distribution channels. And this is typically if you look at it, this is the kind of life cycle that a data point goes through before it it results in an aha moment for the bankers.

Now, as we were looking at this design, we knew the biggest challenge we had was our ETL strategy. There is a single unit of work that could actually trip us up. And there's a reason and this is something that varies, the reasons for why ETL is challenging, varies, organizing, organizing. I won't go through all of them. But I'll call out the fir the second one over there which is com complex processing and retry logic.

So the banking data sets we i if you are not aware are very broad data sets. They are not necessarily deep, they're very broad and there are components of that data set. They are built separately differently, they are published separately. And this could have been a big challenge for us. But this was even a bigger challenge for us because we were stuck on a legacy ETL stack that we could not trust.

So at this point, if you're wondering how we, how we address this problem? I'm gonna play a quick video. It features Michel Bandare, our, our, our lead architect. He's gonna describe how we use Glue to solve this problem.

Michel: Thanks Nish for providing the background and covering key challenges. Let me start by providing a breakdown of our ETL modernization strategy using Glue. We followed a four step approach where we started by identifying different data sources, key transformations and things. Then we prioritized our different workloads for the migration based on criticality and complexity. Next, we created helper utilities to streamline standard transformations and established Glue connectors for our different data sources and syncs using firm recommended authentication methods. Then we ran our Glue pipeline in parallel with our on prem setup to identify any discrepancies. Once we addressed any issues or discrepancies we encountered during testing, we would production the pipeline.

I will now cover our architecture and approach. We start by identifying different data sources and things needed for our use case. We leverage Glue connectors to get data from source to sync to generate user insights. We use EventBridge rules to trigger Step Functions which orchestrate execution of Blue jobs to get data to target data stores. We leverage DynamoDB to capture job execution details like number of records, processed records which failed data validation records in error, et cetera to help us with reconciliation and CloudWatch metrics to help us monitor status of all our ETL pipelines.

We publish errors to Kinesis which notifies our support function using Lambda event sourcing to remediate data quality issues and execution errors. Next, I would like to cover the tenets. We built our architecture on business insights in seconds, identifying high value data sources and then blending them creatively to uncover insights, modular design. The main goals being flexibility, robustness and maintainability, automate everything, scalability paralyzed data processing using Step Functions, use micro batching to process chunks frequently instead of huge batches, monitoring and last data quality assurance having policies for data validation cleansing and enrichment.

Last I would like to cover the architecture for our near real time pipelines to support time sensitive business functions. Key components to outline here are Amazon MSK to stream data from on prem and cloud data sources, Glue ETL to make the data available to cloud consumers via cloud native data stores and warehouses powered by Glue connectors, DynamoDB to capture job summary Legend which enables data producers and consumers to easily, quickly and safely produce and consume data across the firm using these two architecture patterns.

We are able to build our cloud native ETL pipelines that are one robust and resilient, two scale on demand, three optimized for performance, four cost efficient using serverless pay as you go model with AWS clue. Also I would like to call out that we are looking to leverage two newly launched features to enhance our ETL pipeline.

One Spark UI for improved monitoring provides graphical UI to monitor Spark jobs and enables better troubleshooting to job observable metrics. It provides reliability, performance throughput and resource utilization metrics to generate insights into what is happening inside your AWS clue to improve triaging and analysis of issues. In summary, these new capabilities will allow us to better monitor Spark workloads, quickly diagnose issues and gain visibility into resource utilization. This will help us continuously tune and optimize our ETL pipeline.

I would like to hand it back to Nish to cover how the solution helped us achieve our business goals.

Nish: Thank you. And now for the money slide, the key wins 35% reduction in time to complete workflows. 10x data growth seamlessly supported 33 months. This by the way, i i expected it to be a six month project. We were able to wrap it up fairly quickly. Thanks to Glue the the number of days it takes to on board a new data set drastically down and obviously 99.96% availability.

I'd also like to thank um AWS solution architecture team. They were super helpful to us throughout the process, constant companion. And at this point, I'll turn it, turn it back to Santosh to wrap it up for us.

Santosh: Thanks Nishit for sharing your journey with AWS Glue and connecting the dot for the audience. I'm going to talk about how we are accelerating our innovation. I talked earlier that demand on data integration is ever increasing with that in mind. We are also accelerating our innovation. In last 11 months, we launched 70 plus new features. As you can see here, our feature velocity is accelerating the number of features we are launching every six months are greater than we had for the entire 2022.

With this, I'm gonna share some additional sessions. We are at Thursday, some of these sessions have already occurred. However, you can go back to the video and take a look at it.

Finally, I'm gonna share some additional resources for you to double click on. We have AWS data integration web page that you can get more insights from. We also have AWS databases, analytics, LinkedIn account. You can go there and get additional news about our new product innovations.

Finally, I would like to give you a big. Thank you. Thanks for listening to our presentation. I would like to thank my co presenter, Sean and Nishit for coming here and presenting with me. Thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
What’s new with AWS data integration

Hi, everyone. Welcome. It's day four at re:Invent. I hope you are having a great time. You're learning a lot, making new connections and having fun. It's 3:30 PM here. This is the very last session between you and the happy hour and we promise we'll make i
复制链接

扫一扫