Reimagine data integration with generative AI and machine learning

最新推荐文章于 2024-07-16 23:19:21 发布

李白的朋友王维

最新推荐文章于 2024-07-16 23:19:21 发布

阅读量125

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134838258

版权

Matt Su, Senior Product Manager for AWS Glue:

Hey everyone. Welcome to ANT 216 Reimagined Data Integration with Generative AI & Machine Learning. I am Matt Su, Senior Product Manager for AWS Glue. Presenting to me today are Gaurav Sharma and Shiv, also members of the Glue team.

We're very excited to share some of the recent innovations around generative AI and machine learning with you. Welcome everybody.

We have four sections for our presentation today:

First, we'll talk about challenges and trends in data integration and give a quick introduction of our service, the data integration service here at AWS - AWS Glue.

Second, we'll talk about how AWS thinks about data integration and give you a quick overview.

Third, we'll bring Shiv and Gaurav back on to go over innovations on data integration and connectivity, data management and authoring.

And then finally, we'll summarize with some key takeaways.

And with that, let's get started.

We'll start with an introduction to AWS Glue, our serverless data integration service.

Today, hundreds of thousands of customers use AWS Glue to integrate data every month. Our customers come in all sizes from all industries, each solving their own use cases.

For example, Itaú bank in Brazil chooses AWS Glue for their data mesh architecture and to use Glue to power their payment systems.

Merck uses AWS Glue and Amazon Kinesis to ingest, transform and contextualize near real-time data and to run analytics on that data.

And BMW built self-service data integration based on Glue into its data platform that is used by more than 5000 internal users every month.

And many of these customers have presented to us at this re:Invent. Earlier this week, BMW shared their story on the Data Governance customer panel. BMO shared their ETL migration story in a talk session. And JPMorgan Chase talked about Blue Data Quality.

Here's some quick facts about AWS Glue:

AWS Glue runs more than 1 billion jobs every year. That translates into hundreds of millions of data integration jobs every month.
Glue has hundreds of built-in transformations for data integration job authoring.
Customers use Glue to ingest data from hundreds of data sources, both AWS and other external sources and SaaS sources as well.
Tens of thousands of developers are authoring Glue jobs on AWS Glue Studio every month.

So why do all these customers need data integration and use AWS Glue? What's the motivation?

We all know that businesses want to make their data work for them. And data integration is that first step of connecting, cleaning, transforming, combining and cataloging data before it can be used for analytics and machine learning. Without getting this right, you can't have comprehensive or reliable insights.

Now, 10 years ago, data integration was primarily manual and cumbersome and it often required specialized IT personnel. Data was siloed and the integration tools available back then were less flexible. Nightly, batch processing was common making real-time data integration difficult.

However, today businesses need data and they need them fast. Data integration is more challenging and more important to businesses than ever. Why?

Three points:

Data integration drives up-to-the-minute data to make business decisions and run operations. These data integrations are often running every few minutes and some even real-time.
Data integration enables business critical operational systems and applications.
Businesses also always want their data workers with a wide variety of skills to be able to integrate their own data. For this to happen, data integration has to be user friendly and support a wide variety of interfaces including coding, visual and drag and drop.

Some more challenges around data integration include:

Business teams often feel that they're not moving fast enough and that they're reliant on a central data platform team for access to the data. They feel that they're missing their SLAs but they don't have control over their data engineering teams.
Data engineers also struggle with legacy tools. They spend their time on infrastructure management instead of working with their data. They have to learn multiple tools for different tasks which involves new learnings and context switching. Legacy tools work well with smaller data sets but once data volume grows, which is the case for every business, these tools don't scale as well.
And then finally, there's leadership. Leadership often feels that legacy tools are expensive and grow more expensive every year. These tools don't support multiple personas as they're either focused on a particular persona - coding, visual editor or no-code. And we realize that enterprise users often need all three to serve their internal customers. These legacy tools lock business transformation into proprietary tools inhibiting customers from migrating to innovative technology such as open source frameworks, serverless infrastructure, and the recent advances in generative AI.

I'm sure many of you can relate to a few if not all of these challenges.

So let's get a quick overview of how AWS simplifies data integration for Glue.

For Glue, we organize data integration across four key pillars starting from the top left and going clockwise:

Connect - allowing customers to connect to any data reliably and securely as moving data is required to help organizations eliminate their data silos.
Transform - allowing you to transform data with your preferred tool and allowing multiple business users to participate in data integration and analysis regardless of their skill set.
Operationalize - build reliable and scalable data pipelines and manage them at scale without downtime.
Manage - manage data for high quality and accuracy at scale.

Now, recent innovations with generative AI and machine learning gives us more tools than ever to make data integration faster and easier.

Let's discuss how these innovations and Glue can help simplify and accelerate data innovation, unlocking value from your data and helping you reimagine new possibilities.

And with that, I'll hand over to Gaurav to talk about innovations in connectivity.

Gaurav Sharma, Senior Big Data Architect with AWS:

Thank you, Matt. Thanks for that intro. Hi everybody. I'm Gaurav Sharma, Senior Big Data Architect with AWS. Really excited to be here today with you. I hope you share the enthusiasm - it's Friday, so thank you for sticking until the end. Learning and being curious is one of our leadership principles, so thanks for joining the session today.

Today, what I want to do is together with you, I want to take you on a journey and I want to dive into the future of data integration with AWS Glue. And that starts with, first I want to talk a little bit about the challenges we face with data integration and connectivity.

The journey of data integration is fraught with complexity. Often businesses have to spend weeks if not months to onboard new data sources and that represents a challenge because it presents a significant bottleneck when you want to extract data insights in a timely manner.

Now, if you think of it in a traditional enterprise, how many systems you have and all those systems have their own data model, data properties. Now unifying these data for analytics or machine learning is not an easy task.

Data integration isn't just about moving your data from one place to another. It requires deeper technical configuration, networking knowledge together with third party system knowledge.

The importance of overcoming those challenges cannot be overstated, especially when we talk about analytics and machine learning, because efficient data integration is actually the backbone that advances those technologies.

So moving on, I want to talk a little bit - Matt presented the four pillars of data integration. I want to double click on connectivity and how AWS Glue can help you connect to virtually any data source.

Now, whether you have data warehouses, databases, streaming data sources, transactional data lakes or even SaaS applications, customers use AWS Glue today to connect to hundreds of different data sources through our connectors.

This broad connectivity that we present with AWS Glue is pivotal for our customers.

For those of you who know AWS Glue, you know that we actually provide this connectivity in the form of fully managed native connectors with AWS Glue directly available in the product, as well as custom connectors through our AWS Marketplace or third party sources.

But what we hear from our customers is that they tell us that sometimes external connectors are hard, requires additional technical knowledge and presents sometimes a delay in their projects.

So customers asked us to simplify this process and make available connectors in the product so they can leverage them faster and more efficiently.

And one of the best things in Amazon is that we listen to our customers. So we listened, we worked backwards from those requirements to enable you to move faster.

Which brings me to the exciting part - this year AWS Glue has worked incredibly hard to simplify the process of how you interact with your data assets. And it starts with the recap of the announcement that here today, AWS Glue has provided out-of-the-box support for 10 additional database connectors.

We started the year early on with the release of our first 3rd party connector with Snowflake. Then we followed up with Google BigQuery. And most recently, we announced a large wave of connectivity options including Teradata, Azure SQL, Vertica, and many more including Amazon OpenSearch.

By the way, what you see here on screen is our new Connector Gallery. This is part of a broader effort in AWS Glue to simplify how customers interact with the AWS console.

And the reason for that is simple - for those of you who know something about AWS Glue, you know that we provide different ways for different personas of how you work with the service.

For example, if you are a developer or a data engineer, you will most likely leverage our scripting or notebook experience. But for those of you who don't want to dive deep, you can also use our visual ETL builder.

So AWS Glue Studio is an easy to use graphical interface that provides you with the speed and agility for you to create, run, monitor your jobs without writing any code.

This is a significant advancement that lowers the technical boundary and allows any user without Spark expertise or knowledge of coding to build, run and monitor ETL jobs.

Now to bring it all into life, I want to show you a quick demonstration of where I've used AWS Glue Studio to create a visual ETL job that will extract data from various different connectivity options - some of them are new connections that we introduced in Glue Studio - and it will write them into an external data source.

For this example, we had a customer who wanted to create a new marketing segment for this year's Black Friday event. As it happened, they operated on multiple websites and blog posts and they stored their customer journeys and clickstreams in MongoDB Atlas. They also had transactional data like consent data on Amazon S3 storing data for customer marketing consent and customer data in Apache Iceberg format.

So by leveraging AWS Glue Studio, it was a marketing executive who managed to create a visual ETL pipeline that takes data from MongoDB Atlas and Amazon S3, joins this data, filters and prepares the data for a marketing campaign.

Now, as it happens, they used Terra Data to store the data where the customer was actually using uh Terra Data to process uh kind of marketing, marketing processes across multiple uh kind of channels. So let me show you the demo. This one is without argue. So I will walk you through.

So as you see here, we have the visual uy and the first step was to create the connections. Now, what I want you to see here is how easy that is with the new interface where you for each connector, you have to uh only provide uh several different uh configuration options including the endpoint where you want to connect. Uh and in this case, we created the Tero Data data source first where we're gonna write the data and now we create MB Atlas where the data will be extracted from.

Now with AWS Glue, we provide options to run jobs within your own VPC. So there was a session about BCS before that some of you may have entered it. I'm not gonna go into too many details, but these are different connectivity options that you have in order to kind of configure how your job how your script will communicate with your data source.

Uh moving on when you create the first constructs, which are the connections. Of course, we now need to create the job. Now, here you can see the visual graphic interface which we call AWS Glue Studio and you can see that it is a drag and drop point and click interface which allows no coding persona to build the job. Now, what is happening under the hood is that AWS Blue Studio will generate the code for you. That means that if for whatever reason you encounter any problem, a sme or someone with knowledge of five spark and kind of spark will be able to look into this job into the code and find out what the problem is.

Now, in this case, we have to join the data as i've already explained uh from amazon s3 data lake. As you can see, it's an Apache Iceberg format and important thing to notice with Blue Studio is that in each step of your journey, you have the option to preview your data. Now that allows you to see what data you have available in your data source and how that data evolve as you start applying some constants.

Now, each of those data sources of course has its own specifics. So with Modi B, you have to provide the collection where your data is stored. Of course, if you are not coding persona, that information will come from your administrators.

Now, once you have access to this data, then you can leverage the hundreds of transformations that we already have built into the product. In this case, i'm going to join the data by specific id. And for the marketing campaign, there was two filters that were important for the customer. They had to filter only customers that have consented to receive marketing messages. Of course, especially in europe, that's important with gdpr. And because this was a campaign uh kind of targeted for people who like biking. Uh the customer had to also filter the data uh for uh kind of customers who are interested in biking and hiking and uh so on.

So you can see that with zero lines of code, I was able to filter the data and you will see in a minute uh the final step where we gonna write the data into the data.

Now, when we write data, typically, we don't show preview because we already have that on the, on the preview screen. That's it. Once we have everything done, just name your job as you wish and you can run the job just before I run the job. I want to show you the full picture. So it's the same screenshot that we already had on the presentation. But there you go, all green boxes. So everything is configured.

Now you can see that it was a matter of this video is 4.5 minutes. And it was kind of deliberately uh made available and slowed down in some places. But I hope you see how efficient a w group studio was to create an eto job to create and execute a job.

Now, a couple of minutes later, this job will complete and it will write the data and we will make it available into your data. Ok?

Well, thank you for that uh for that kind of attention. And now that you've seen how efficient a wood you was, i want to bring your attention to another vital aspect of data integration.

Now, when we talk about data quality, the quality of your data is a key to derive insights from your data for analytics, but also foundational when you do machine learning and artificial intelligence. So with that, i'm going to hand it over to shiv who will tell us about an exciting new feature with AWS school studio, ah chief for us.

Thank you. Are you able to hear me? All right, hope you're having fun. How many of you need high quality data, right? If you are not raising your hand, that's a problem. Ok. So you're all here on friday, you're supposed to be halfway close to your home, you're supposed to be with your loved ones, but you're here. So you must all have some really messed up data. Uh so all you need to do is to use glue data quality, right? That's what i'm gonna be talking about.

Um so the next part of the present, that's the key message. Uh but the whole next part of the presentation is all about what we've been doing for the last one year. What are the new capabilities that we launch in? Right. And my name is shiv nara and i'm one of the product managers for uh glue data management capabilities. I'm super excited and thrilled to be with you even on friday. Really?

Um so last year, this time, a bunch of colleagues and i, we had the opportunity to announce the preview of glue data quality and one year passed and we are in general availability and we've seen a lot of customers quickly adopting glue data quality and the really five reasons what the reasons why customers really like about us right before glue data quality or customers used to usually take two paths. One is they would just go buy an old guard data quality solution which is commercial, very expensive and they'll find out that it doesn't scale a lot of installations upgrades. Some customers of cost conscious would actually go another way, they'll go download like a package and try to install it on two. Uh you know, you end of the day, you know, you figure out scaling security, uh think about support, the cost all adds up and then they're like, oh my gosh, what's going on now after glue data quality?

Have you seen that our customers really have started to love these the solution because of really a couple of key things, right? The first one is the fact that it's serverless and we we we always talk about the serve. The fact is that you simply can log into your account and then be able to turn on data quality is, is just at least what i've seen from customers. It's very relieving for them, right? Just go log in done and you have data quality right on your console, on data catalog and your retail pipeline.

The second thing is the scalability. Now, how do we say that this is scalable? Couple of reasons. One is we are built on amazon d you amazon dq in the sense that amazon open sourced d you, it's an open source library and amazon the dq is really, really powerful in the sense that it just takes minimal number of passes at the data to gather data statistics. That's, that's what it's really good about and that's also the most time consuming part. And on top of that, we've actually taken that framework and then put it on top of glue, which is providing you all kinds of capabilities around out of scaling the ability for you to run it on flex the incremental processing capabilities that it provides. So it really makes it very easy for customers to consume.

The second thing is the ease in which the ease of getting started because you just go start getting a set of recommendations, you then start looking at them, you tweak your rules and you add a bunch from our out of the box rules. We actually added a ton of out of the box rules the whole of last year. Some of my favorites are things that you ask for referential integrity checks in making sure that in the data lake, you can have two tables being similar to a database. You can now have referential integrity checks on a database, data set match capabilities, which means that you are actually ingesting data from point a to point b. You're now able to make sure that they are all in sync. And then custom sequel rule is another favorite one for especially business users because they can imagine anything and then kind of code it. So that is another one that they really love. So we kind of kept adding at this. And one other key thing is is that you know what customers asked us, like tell us which records passed or failed and we added that capability in the last year. So really this is becoming a very robust data quality solution.

The third one is basically the fact that you can actually have it in your workflow. So for example, imagine you're a data engineer, imagine that you have to go completely to a different environment, configure data quality rules and the fact that you have to read your data twice one to just ingest your data and then to you know, do your data quality scans. That's crazy because you then have to spend twice the amount as you know, reading and writing is is one of the more expensive operations. So the fact that you can just put it in your workflow as you're reading the data, it's in memory and you can just simply run data quality checks at an incremental cost is just amazing for our customers.

And that is the reason why some of our customers like Millicom actually went ahead and migrated their data quality solution into glue data quality. And what they have found out is that because we provide them specific error records, they are able to reduce the amount of time to detection by 50% Garden is actually a start up. They were in the path of going and downloading an open source library and doing all that. That's not what the business is meant to do as a start up, right? They're innovating in a different area and they actually now use data quality on their pipelines to make sure that they can provide really high quality data to their customers.

So this is all what we did for last one year and we started to think about have we done enough? And it turns out we did not. So what, what, what, what's what's going on.

So most of the solutions of the market are rule based and there are really three problems with the rule based data quality approach. The first one is the thresholds that we set up. These thresholds are usually static and the problem with static threshold. Imagine that you're a data engineer, right? Let's say you work in a retail organization every day. Your transaction volume is about let's say a million records one fine day, you set it up, it's all running fine, but your business grows these thresholds become obsolete. The moment these these thresholds become obsolete, the rules become obsolete.

So let's say your your transaction volume became 2 million and then on one fine day, it became, it dropped all the way from 2 million to 1.5 million. There's some serious problem there, but your old obsolete data quality rules won't find it. So rule based data quality is just not enough because the fact that the thresholds can become very quickly obsolete. Ok.

So that's one problem. What's the next problem? It's really hard with rule based approaches to identify unanticipated hidden data quality issues

That's a mouthful. But what do we mean by that? What we mean by that is, let's imagine that you are again in retail, data engineer, retail, your data has its ebbs and flows, you have data flowing. Um let's say, for instance, on weekdays, you have really high data volumes and on weekends you have really low data volumes. This is a pattern. You're kind of like kind of going through this pattern. But one day your weakened data load just shoots up. There's a problem, either you're ingesting duplicate data or really the business is doing well, maybe there's a different pattern but something around that you need to be alerted, you need to be known otherwise your business after a couple of weeks is going to come and ask you.

So rules don't capture seasonality. That's a problem. The third problem is all of us have gone through this. If you're writing data quality checks, um you have to go to your business user, you could kind of write a few technical rules, which is great. You have to go twist the arms of your business users. They'll give you maybe 100 rules and then you say like, you know what i want more and they'll give you another 200. It's, it's just, it's not, you know, a pleasant experience, I guess, but you can kind of like work your way through it. But business users have a completely different agenda. They are obviously they all want high quality data, but they don't want to spend time writing these complex rules and they'll be more motivated if it's more insight based.

So these are some of the challenges. So what what did we do? Our team is very, very excited to announce the preview of dynamic rules. So we launched this like literally a week ago. A lot of our customers are very excited about this dynamic rules, improve your developer productivity simply because once configured, you don't have to keep modifying them.

So here's an example on the table below, i have row number one which is row on greater than 100 a static rule. But now you can write a dynamic rule which basically gets the average of last 10 runs. And then you can actually, I mean, in this example, we don't show it, but you can just say multiplied by an 80% right. So at least every day as your business properties change, your rules are automatically adapting to it. You can also see an additional example, which is my favorite. You're basically adding a standard deviation aspect to it, right? So you can kind of like you get very creative about your rule writing process, making it really, really simple, yet very powerful.

So your rules become intelligent and your data quality becomes more adaptable, right? As your business properties change. So writing this is really simple, you really need dynamic rules have two components to it. One is the static extractions. In other words, there are two functions last an index which basically gathers all the statistics that we have been gathering. You don't have to do anything, just continuously write rules and then say last 10, we then can get the running averages and all of that. So you basically extract how much information you want and then apply an aggregation function and use that as a threshold.

But that again is not enough because what about all the seasonality challenges, all this other stuff that we talked about. How do we solve that again? We are super this is really excited to announce the anomaly detection capability with glue data quality. So the anomaly detection capability analyzes the data statistics over time and gives you insights, the hidden and unanticipated data quality issues without you having to configure data quality rules. It generates these observations that you can easily view and understand. And then we also, in addition to just generating these observations, generate data quality rules so that you can progressively now build data quality rules as opposed to getting all the way started, right?

So, so let's dive a little bit deeper on how this works. Ok. Now, if you're an existing glue data quality user, how many of you use glue data quality today? Great. Thank you. See, that's what you need to do so that you don't show up friday this late, right? It's great. Um so if you're an existing user, you're in for a treat because we've now been gathering your data statistics for quite some time, right?

So data quality rules generates data statistics. That's what i talked about. Leq is really really good at gathering all this data statistics in minimal number of passes. You use a rules engine and then we generate a data quality score for you to consume for your business users to understand how the quality is. We also generate or tell you which specific roles fail, which did not fail all of that detail information.

Now, what if we took that data statistics and applied a machine learning algorithm to look at the data statistics in the past and tell you these are your insights, that's what blue data quality can do today. Now, these observations are really, really useful and insightful. You'll see it in a demo that you can visualize them over a period of time and see where the anomaly resides.

So i told you like, you know, it's very hard to write rules. We have a recommendation service and it's really great. It, it kind of like mass creates rules for you. You can use it. A lot of customers do use it. But what if you don't want to create any rules? So we are introducing a new concept here called analyzers. We'll go a little more deeper in a shot, not a minute about analysis, but analyzers gathers data statistics. You can configure this in the data quality definition language. If you are, if you are a brand new user of glue data quality, you don't want to write rules. Simply configure analyzer analyzer just instructs glue data quality as to what statistics you want us to gather and what columns you want us to look at.

For example, you can say look at my distinct counts of all my columns and we will start understanding your data and within a few runs, we will start generating insights and that will allow you to we start generating observations which will allow you to progressively build data quality rules. So this is an example. Um so i i just want to kind of like show you some difference between rules and analyzers.

Um on the top portion, you have rules. It's a static rule. You don't have to write them anymore. You can write dynamic rules uh row on greater than 1000 right? So that is a rule. The bottom portion is an analyzer. So what's the difference? Both of them gather data statistics, both of them can be used to detect anomalies. You can write one or the other, both of them generate observations. But the only difference is only rules can impact your data quality score. You have to enforce a rule to say, look, i need this to fail and that should actually reduce my score.

So if you configure an analyzer, we just gathering data stats for you and we are just basically surfacing insights and you can progressively turn them into rules. But what what if, what if i had like a row count rule and a row count analyzer by the, we can create a row count analyzer, we will not gather duplicates stats, we have optimized it so that we are only gathering the necessary stats. So you could have duplicate analyzers and row counts. Don't worry about it, but we still gather stats and then we will generate insightful information for you to build data quality rules and to manage quality of your data.

This whole process makes it really, really simple and easy for you to going forward, manage the quality. Without the struggle of creating rules, you could just start very easily and over a period of time, build rules and identify these hidden insights. So it's all good to talk about this let's just watch a demo to see how this works.

Hello, in this video, we are going to learn about the new anomaly detection and dynamic rule capabilities of aws new data quality. Here. I have a glue vision, a job that i'm going to build using glut. The source data contains taxi rides in new york city, including passenger pickup and drop off locations as well as the transaction amounts. I want to make sure my business teams get accurate data position. So let me show you how easy it is to detect hidden data quality issues.

First, i'm going to add a data quality transform at the start of my pipeline right after the social. Now let's click on the animal detection tab. Then let's add some analyses which will gather the statistics over time. Let's add a row count analyzer which will analyze the row counts. I'm going to add another analyzer that will gather statistics for completeness of all the columns. Finally, i want to make sure i monitor the distinct count values for the column containing locations where the taxis are operating.

Notice that i don't have to think about rules or thresholds which makes it easy for me to get started. I have completed building rest of the pipeline for this demo. So let's run the job. Now on the first round, notice that the glue data quality has generated the required statistics. It requires a few runs to learn your data patterns and generate observations. After a few runs, it has learned the patterns and has started to generate observations.

Let's go over them first, you can see the reduced low counts. This certainly is an anomaly that i will look into. But i'm interested in monitoring this pattern going forward. I can easily add the recommended rules to the list for me to review later. Next, let's look at the completeness of the data. I see that the completeness of fair among column has reduced significantly pointing to additional data qualifications. I'm interested to monitor this. So let me add this to the list as well.

Finally notice that the number of unique locations where passengers are being picked up has reduced. This indicates a potential problem with the upstream data feeds. Now simply select all the rules and click on apply copy rules

Now let's go to the data quality transform and face the rules. I'm going to use the new dynamic rules to make my rules more intelligent and adoptive. For example, look at my distinct control that will now fail when the rows go below 80% of the average of last 10 minutes with these new features. My pipeline is robust in finding issues that I would not have been able to predict or account for manually. That completes the demo. Thank you.

Alright. So hopefully, it's more clear with a very quick demo, but I encourage you guys to try this out a little more details with the cool animation.

So the anomaly detection, how, how, what are about the important things that you should know about this anomaly detection capability?

First of all, we talked to you that it has to learn past patterns. So it needs at least three data points to begin giving you the insights. So that's first, the longer we have gathered your statistics, the better we can actually understand and provide you insights more meaningfully.

So for example, the seasonality aspects that we talked about, it requires additional data points. It's like, and, and one of the things that I think you should give it a shot is to make sure that you know, you you, you pass or enable this for a bunch of days so that it can understand seasonality because it can understand seasonality and then start giving you insights and you probably may know this, that blue data quality can be invoked from two different places. It can be on your data catalog and it can be also made available in etl, this particular preview at the moment is only available on your etl.

So these are some of the things that you should know before you go and start tweaking or you know, playing around with this.

So this is all great. But what if sometimes you would be like, oh, you told me that the analyzers won't like impact my data quality scores. But really, I trust the anomaly detection. I want the job to stop. And what if, what, what should I do?

So we thought about it and we decided to actually give you a data quality rule that is detect anomaly. What that does is that again, it instead of it also generates the observations and insights. But when it detects an anomaly, it will actually impact the data quality score. So which means that if you tweak or set up your data pipelines to say, you know what when there is a failure, stop my job, this particular rule will understand and stop.

So for instance, if you want to capture the seasonality of your data, and then you want the anomaly detection job to kind of stop this rule type. Adding this rule type into your rule section will stop the job or it will alert you and it will reduce the data quality score and do all that. That if you, if you're a rule based, you taking a rule based approach, you know, you have that option as well.

Alright. So what's the cost? That's the, that's the important question. What's the cost?

Well, there's no separate cost for detecting anomalies or any of that, right? We follow a very simple, easy, predictable pricing model. So glue data quality follows the pricing model of glue. It is 44 cents per d per hour.

So let me put this into perspective. Let's say that you ran a data quality job for an hour typically that will require at least two d use. Right? So, which means that it is 88 cents per d per hour. So, a data quality job running for one hour, which will be insane. I mean, I don't think we should, you should have such jobs, but for an hour will only cost you 88 cents.

Now you, I mean, that is, that is just a, that's, that's the model, right? But you don't run them for one hour. If it's 10 minutes, we only charge you for that 10 minutes. If it's five minutes, we only charge you for that five minutes. It even gets better.

Now, what is even better is that if you have non sl a sensitive data quality scans, you can just run it on a flex tier, which is much cheaper. It is um 29 cents per dp per hour, right? It's the 33% cheaper compared to your regular dp. So you can try that option.

It even gets better because you can also use our capabilities around incremental processing. That means that you're only sending lesser amount of data for data quality scans. It gets even better with auto scaling because whenever you're reading all the files, maybe you don't need 100 work, not 110 workers maybe. So that even even gets cheaper.

So this glue data quality is a unique solution that is, is associating cost to value there is no complex licensing terms. A lot of our customers turn it on on simple workloads and progressively ease into it. Thereby associating value. They first see the value of data quality and then move more, more and more and more and consume more that makes it easy for business users.

And this is the reason why one of the largest banks in brazil, it trust us with data quality and i really like the way they put it right. They are very excited, they already use data quality. They have been partnering with them to kind of go through this anomaly detection and we're very fortunate to have them.

And one of the cool things that they said was that now they can use data quality rules for familiar data sets because customers are familiar with certain data sets and they know what to expect off of it and can use the ml based approach for unfamiliar data sets, combining both the power of rules and machine learning based approach to deliver really high quality data.

And i hope you will also do that right after this room. And thank you so much for taking your time coming all the way here and we always save the best for the last. So matt, we're going to have matt talk about our recent innovations around authoring. Thank you so much.

Thanks Jim, thanks God for walking us through connectivity and data management. And now I wanna talk about some recent innovations for glue using generative AI to help build data innovation jobs faster.

Let me go back to the challenges real quick that I covered earlier in the intro, data engineers have to learn multiple tools to complete different data integration tasks which involves, you know, learning new tools, contact switching, just gaining that knowledge.

Often the users have to be aware of the details of the components that they use in their data in data integration jobs. They have to be aware of the jargons and terms while offering jobs. Once they've on boarded a job executing and troubleshooting these jobs can also be challenging.

Customers often need to be aware of platform specific settings and parameters. And then when there are issues, you have to be savvy or knowledge enough to be able to self troubleshoot or maybe find a subject matter expert to help you get past these challenges and these challenges they take time to overcome and they slow down development.

Now generative AI can be great at solving a lot of these challenges with its natural language and its coding ability. Couple of things that it can help with:

Faster on boarding with a tune model. You'll be able to get relevant and personalized answers for your data integration questions quickly.
You can certainly ask a model to help build you a job or write an etl script for you which will lead to less heavy lifting using the model to help you automate routine and boilerplate work.

And all these helps you become more efficient making it faster to take your applications and data aviation pipelines to market, contributing to more efficient workflows and reducing complexity.

So one feature that's available today in AWS Glue that could help developers be more productive using gen AI is Code Whisper and we've launched and we, we have AWS Glue Studio notebooks with Amazon Code Whisper integration within Glue Studio notebooks. Code Whisper is available today.

Code Whisper will generate real time code suggestions for you. As you're working on your ETL jobs in your Glue Studio notebooks, Code Whisper can generate code suggestions ranging from snippets to full functions in real time as you're working based on your prompt and your existing code as context.

Code west is also optimized for popular AWS data sources for integration such as S3, RDS and Redshift. Code wiper can generate a lot of code for developers. Help you accelerate your application development and freeze up devs to focus more on business critical problems.

Here I'll show you a quick demo of it in action with Code Whisper and Studio notebooks. You can quickly develop code to accelerate data integration for analytics and machine learning. Just like you see here, simply fire up your notebook, start writing or coding and let Code Whisper guide you to the goal.

Code wister is optimized for the most used AWS APIs making it a great coding companion for those building applications within AWS. While you write your code in Python or simply type type out comments in English within Glue Studio code will provide you with the recommendations in real time notebook.

Users can quickly either accept the suggestion as is you can tab through to get more suggestions or continue writing the code and give it more context to give you better and better results.

Studio notebook is great and it's out today, but I have one last gen AI feature to talk about in Glue and that's Amazon Q. Earlier this week in Adam's keynote, he announced a preview of Amazon Q a new type of gen AI part assistant designed for work that can be tailored to your businesses, needs code, data and operation for Glue and data integration.

We love Q and we were taking that a step further with Amazon Q data integration and AWS Glue. What we did, we took a foundation model on bedrock and we further fine tune it with all of the knowledge that we have and it's extensive all of the knowledge on AWS Glue on data integration. And we've integrated this into Q and this capability is coming soon. So it's not available yet, but it's going to come very soon.

The AI assistant and Glue Studio will help you build data integration pipelines using natural language, it can generate ETL code it can answer your data integration questions and it will work with you as your assistant as you're building and troubleshooting your data integration pipelines and Glue.

A SQ data in data integration is powered by foundation models on bedrock. And again further fine tuned with domain knowledge and Glue and data integration with this feature. Instead of waiting for SMEs for help, you can simply type your questions, ask, troubleshoot issues through the chat interface, anytime through Q and get questions quickly.

It's also a, it will also be available in Glue console Glu studio and APIs as well. If you want to take this and put it into your own, you know UI Amazon, Q for Glue is integrated with Q with the broader Amazon Q experience. And there's no additional thing for you to enable and it will generate ETL scripts like you see here, give information on your data sources and transforms and it will generate a script to get you started.

For example, you can give it a prompt, let's say write a script that takes my data in S3, do a filter or drop an all transformation and load the results back in a Redshift and it will quickly generate code for you to get you started.

Additionally, while you're working, it can reason and answer your any of your questions around Glue features. When to enable them. Why to enable them? It'll answer questions on connectivity to different databases and provide answers for provide you with common data in relation, recipes and best practices to help you build your jobs.

This makes it easy to onboard quickly to Glue to troubleshoot your jobs and fix them and to author your jobs using natural language.

So I'll stop here at that session presentation. Thank you all very much for coming. Here's a couple of QR codes to for our official LinkedIn channel and also for the AWS data integration web page. We, we'll share the latest announcements, blogs content there. So check it out. Thanks everyone.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Reimagine data integration with generative AI and machine learning

Matt Su, Senior Product Manager for AWS Glue:Hey everyone. Welcome to ANT 216 Reimagined Data Integration with Generative AI & Machine Learning. I am Matt Su, Senior Product Manager for AWS Glue. Presenting to me today are Gaurav Sharma and Shiv, also memb
复制链接

扫一扫

Reimagine data integration with generative AI and machine learning

“相关推荐”对你有帮助么？