Accelerate data preparation with Amazon SageMaker Data Wrangler

最新推荐文章于 2024-06-06 11:24:53 发布

taibaili2023

最新推荐文章于 2024-06-06 11:24:53 发布

阅读量406

点赞数 8

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134605163

版权

Hi. Hello everyone. Welcome to "Accelerate Data Preparation for Machine Learning with Amazon's Maker De Angler." I saw some of the friendly faces down there. So hello.

Um and for the rest of you, thank you for coming to the session. I really appreciate you being here. My name is Huong Nguyen. I'm a senior product manager at AWS and my co presenter today is Arun Shankar. He's a senior AI/ML solution architect at AWS.

Quick question. Um since I know we just had lunch, um I need your help a little bit. How many of you have seen Swami's keynote yesterday? Awesome. How many of you have seen the Angler announcement in Swami's keynote? Awesome. Some of you caught it. Great. We have some exciting announcement um in our session today as well. So looking forward to sharing that with you.

We have a very uh action packed agenda for you today, starting with why data preparation and the common key challenge with data preparation in machine learning. And we'll discuss how Amazon SageMaker De Wrangler is purpose built to address those key challenges. We dive into the key capabilities as well as the latest launch that will enable you to accelerate data preparation in your organization.

Then Arun will demonstrate the capability in action with a real world use case. Then we will open up for Q&A. Um and we were happy to hang back afterwards as well to answer any question that you might still have. So let's dive in - why data preparation is important for machine learning.

This is one of the most critical parts of the machine learning workflow because machine learning models are only as accurate as the data you use to train it. Any machine learning projects, after you define the business problem, will start with data preparation. Normally you need to first collect and app data to prepare um for machine learning. So this normally involves connecting to multiple data sources.

You then need to quickly understand your data using analysis and visualization to especially identify any data quality issues. You need to check for bias and then clean your data before you can create features for machine learning. Once the data is ready, you then export the features into a location that is accessible for training and hopefully register the features in a feature store so that you can reuse that for future machine learning models without having to re-prepare the data.

However, the data preparation process doesn't stop there. It is repeated throughout the life cycle of machine learning. Once you train the model and you check the accuracy, you might find out that there are problems and bias with your model that can trace back to the data that you used to train. So you then have to go back, iterate, clean the data and then retrain the model. That's very expensive.

Once you've verified that the model meets all the business metrics and performance metrics, you can then register the model and deploy it into production into an endpoint. At this point, you also need to apply the same data preparation flow on the production data so that you can generate features and the model can consume and generate business insight.

After months or weeks, even the data will change and your model will get out of date. You will not get as accurate data as the beginning. So you then have to repeat the whole process, collect new data, re-prepare data and then retrain the model.

So many research indicates that over two thirds of data scientists' time is spent in data preparation for machine learning. This is a problem because you want most of the data scientists' time to spend building and training models.

So to help with this problem, we've talked to many machine learning practitioners and data scientists and customers to understand why the preparation is so time consuming. There's four themes that emerged.

First of all, there's multiple tools that you need to use to prepare data which require learning curve, context switching that add more time to your data preparation process. More importantly, most of these tools are general purpose data preparation tools, not built for machine learning. So you then have to write a lot of boilerplate code which are not difficult but tedious and time consuming, taking time away from building the model.

After multiple reiterations, as we've seen before, it's also hard to keep track of what are the core operations that you've done on the data and what do you need to uh to retain as a recipe so that you can automate it in the future?

Next, it's difficult to scale from experimentation to production. The same code that you used to experiment on a small subset of data cannot scale to run on millions and billions of rows of data. You normally have to switch to a more powerful tool like Spark which many data scientists are not familiar with.

Similarly, it's also difficult to bring the data preparation process into production to automate. You normally have to refactor the code, make sure that it's performant so that it can run in production efficiently. Both of the last steps normally require skills that many data scientists do not have. And so you have to bring in machine learning engineers or devs to help which lengthens the time to bring machine learning from experimentation to production.

So Amazon SageMaker De Wrangler is the fastest and easiest way to prepare data for machine learning because it is purpose built to address all these key challenges.

First of all, it is one single solution to help you prepare data end to end from data import, analysis, data cleansing, feature engineering and productionization. It is a no code low code tool that removes most of the need to write code or boilerplate code to visualize data and transform data with 300+ built in transformations. Most of the time you don't have to do that.

It also very scalable to large datasets. The same data preparation process you've done interactively can be then scaled to run on a serverless processing job infrastructure backed by Spark.

And finally, it's easy to operationalize. You have all the tools you need to automate like scheduling jobs and integration with SageMaker feature store and SageMaker pipeline to automate as part of the end to end machine learning flow.

So let's take a look at how SageMaker De Wrangler works at a high level. It divides into five key functional areas that help you complete the end to end data preparation process.

Starting with connecting to a wide variety of data sources and browse and select data using a SQL interface or a visual interface. You can then quickly understand your data using built in visualization and analyst templates and Amazon SageMaker Clarify to help detect your bias.

Once you identify all the data quality issues, you can then quickly cleanse and wrangle data and then create features with the 300+ built in transformations.

Once you believe that the data is ready, one of the key differentiators that Amazon SageMaker provides is the advanced analysis that can help you predict the model performance if you were to train the model with the current data. This can help you save hours or days of training models with bad data.

Finally, you can operationalize with ability to export data and automate with scheduling jobs and SageMaker Pipeline.

So the key thing from this picture is that it is a multi-step process that SageMaker enables you to complete and iterate in whole or in part as you need to prepare high quality data to train high quality machine learning models.

Next, I'm gonna jump into each of the key functional areas starting with collecting data. Traditionally, in order to collect data for machine learning, you need to connect to multiple data sources using multiple tools and read data in various formats. This can be very time consuming and introduce a very high barrier to start any machine learning project.

SageMaker makes this easy by enabling you to quickly connect to multiple data sources from the one single interface with native data connectors to the most popular data sources like Amazon S3, Amazon Athena, Amazon Redshift, Snowflake and Databricks.

In September, we also partnered with Salesforce to bring in Salesforce Genie connector for customer data from Salesforce for machine learning. And today, I'm very excited to announce that we now support Amazon EMR as a big data engine to bring in vast amounts of data for machine learning. You can now connect to EMR Presto endpoints and we are previewing the EMR Hive endpoints and they will soon go into production as well.

Once you connect to the data source, SageMaker will read the data in various formats on your behalf. You can see all the list here. And then more importantly, you can preview the data, explore it, view all the metadata and select the data using a visual interface or SQL Explorer for vast amounts of data.

SageMaker will automatically sample the data for you for the best interactive experience. By default, it will do top k sampling and you can configure to use random or stratified sampling instead. And you also can configure the size of the sample data as well to make sure that the data is representative of your data.

We continue to hear from customers that they need to bring more and more data, especially from third party applications in for machine learning. And this is currently a very complex and expensive process because they have to use third party ETL tools to do that or build their own solution.

So with great pleasure, I get to expand on the announcement that Swami called out yesterday in his keynote, that SageMaker now supports over 43 third party applications as data sources for machine learning using Amazon AppFlow.

For those of you who are not familiar with Amazon AppFlow, it is a data transfer service that enables you to map data from third party applications like Salesforce, Google Ads, LinkedIn and many more and map it into an S3 or a Redshift database. You can then register the data in Glue Data Catalog.

And as soon as that happened, SageMaker will automatically show all these new data sources to your machine learning practitioners. They can then seamlessly explore and import data using the same interface that they have with other data sources. So this is an easy way to import more data for machine learning.

Once you have the data in one single place, the next thing that you generally have to do is understand data and identify quality issues. This generally requires writing boilerplate code and SageMaker makes this easy by bringing in popular visualization and analysis templates by default. And you can quickly configure in a few clicks to show you the insights that you need.

More importantly, it also provides advanced machine learning powered insights such as Quick Model. Quick Model will actually run an XGBoost model on top of the data that you have and generate metrics like accuracy score, F1 score, confusion matrix and feature importance to help you understand if the data that you have actually is gonna generate high quality machine learning models. This can help you save hours or days of time training machine learning models with that data.

Earlier this year, we also launched the Data Quality Insights report which quickly become one of the top used analyses in SageMaker Wrangler. It ultimately generates analyses and visualizations on your behalf and uses machine learning powered insights to identify key quality issues that you have to address in your data such as outliers, missing data and target column specific issues like skew target and very rare classes that will affect your machine learning model.

It also includes the Quick Model insight as well. The thing that our customers really like about this report is that it's very actionable. It will also recommend you the transforms that you can use in Wrangler to fix the issue. You can then also use this as a snapshot before and after the transformation process to identify any other issues or validate that this is going to be high quality data.

And today I'm thrilled to announce that we will also generate automatic visualizations as soon as you import data into Wrangler. As you can see on top of the data table here, it would automatically generate bar charts or histograms based on the data type that it detects and then it also identifies missing values and invalid values for you using the spark charts on top.

So at a glance, you can quickly identify the data columns that have potential issues that need to dive into. Once you understand your data and identify data quality issues, the next step is to clean the data and create features to train machine learning models. And this again traditionally requires you to write hundreds or thousands of lines of code just to do basic operations.

SageMaker enables you to do that without writing any code with 300+ built in data transformations. You can do the most common operations such as managing columns, handling outliers, missing data to the more advanced machine learning powered analyses and transforms.

For example, one hot encoding, balancing data, featuring text. And this year we added more transforms to cover more ML use cases such as time series data that can feature your data and extract up to 800 features. You can use train test split to split your data into training and validation/test and many more like dimensionality reduction, multicollinearity and smoothing and more.

I will show you we also use customer feedback to continue to improve our usability for the data transforms. For example, we add the multi-column support earlier this year and the ability to search so that you can quickly find the transform you need among these 300 transformations.

We do know that there are still gonna be use cases when you need to use custom transformations. So we support Pandas, Spark SQL and PySpark out of the box. And this year we add the common code snippet library based on popular customer request so that you can quickly get a snippet of the code for the most common transforms and then you can quickly build your own custom transform or visualization on top of this.

As you iterate and go through the process of analyzing data, cleaning data and transforming data, SageMaker will actually automatically capture all the steps on your behalf in a data flow graph. This enables you to easily navigate, view the other operations and then quickly modify, edit any step as needed as you go through the transformation process.

It also allows you to join and concatenate data from multiple sources directly through a UI to reduce the need to write complicated SQL queries to join data from multiple sources. This same data flow is going to be the foundation for you to help automate the data preparation process in the later part of the machine learning lifecycle.

Right now that we come to the end of the data preparation process, you've transformed your data, you validate that it has high quality using the Quick Model or Data Quality Insights report, you can quickly export the data to S3 or SageMaker Feature Store so that training can actually access it in a few clicks.

You can export data to S3 or as you can see here, you can add features directly into Feature Store as a new feature group or existing feature group. This will enable you to reuse the same features from other machine learning projects without having to redo the data preparation process.

Now that you have a complete recipe including the destination, the next step normally is you need to scale to very large datasets. And traditionally, this requires rewriting a lot of the code.

SageMaker will enable you to do this quickly and easily using a UI and API by creating a serverless processing job. You can configure this job to run on very large instances like an m5.24xlarge or you can configure it to run on multiple nodes in parallel to speed up the process of data preparation.

When you scale multiple transforms such as one hot encoding or fitted transforms, it will need to be recreated to fit the particular dataset. So SageMaker will enable you to re-fit this transform as you scale so that it will work correctly on the larger dataset.

And today, I'm excited to announce that you can now also configure advanced settings for the serverless job using the UI. In addition to the API, you can configure job partitioning settings as you write to a different destination and Spark memory configuration to further optimize your Spark job based on the workload and your dataset.

Next, we heard from many customers that their data scientists need to automate the data preparation process without the help of machine learning engineers, either for reporting or to train or experiment.

SageMaker makes this seamlessly for the data scientists by enabling you to schedule jobs just as part of the same serverless job UI that you've seen earlier. You can create first a custom parameter for the S3 source path so that SageMaker can dynamically recognize new data as they come in.

And then you can create a scheduled job so that you can run on an hourly, daily, weekly or custom schedule of your choice. As soon as you create this job with a schedule, SageMaker will then automatically create event-based rules on your behalf so that this job can be run automatically on the schedule that you selected.

When the job is running, you can check the status of it in the same job interface that SageMaker Studio provides or the SageMaker console.

Next many customers want to also complete the end to end machine learning process using a low code tool so that they can enable more machine learning practitioners who don't have the expertise or want to get a head start using a low code interface.

So this year we integrate with an automated solution that SageMaker provides called SageMaker Autopilot to help you do this end to end. As soon as you export data to S3, you now have the option to spin up an Autopilot experiment.

An Autopilot experiment will take the dataset that you just created and get your input on the problem that you're trying to solve as well as the business criteria that you want to use to evaluate the best model. So you walk through a wizard that enables you to configure all these settings and Autopilot will also enable you to configure deployment settings as well.

So if you want to deploy the best model into an endpoint, it will also do that on your behalf as it does that. It's also gonna package the same data flow that you just generated for the training dataset and deploy it into the endpoint. So that in production, data will be transformed into features that then the model can then consume and generate business insight.

So Autopilot has improved its performance by 10x this year through various integrations with AutoGluon and optimizations. So this is a great way for you to speed up the time from data to business insight. And SageMaker Wrangler and Autopilot can really help you accelerate data preparation and the end to end machine learning model.

We also know that many of our more mature organizations want to automate the data preparation process as part of a machine learning pipeline, whether it is for feature engineering, training or inference pipelines. So very early on we have integration with SageMaker Pipeline in order to enable you to do this easily through the same low code interface. This step is normally done by a machine learning engineer.

But now the data scientists have the option to publish the same data flow that they just created directly into a SageMaker Pipeline. This will then generate a Jupyter notebook that has the code they need to create a SageMaker Pipeline object with the data processing step, running the same data flow that they just created.

They can then easily add training step or feature step or publishing step for feature engine pipeline. And today with great pleasure, I also get to announce that we now makes the deployment of data flow into inference endpoint much, much easier using the locals interface in their wrangler. When you export the data flow, you now also have the option to export the data flow into a sagemaker infant pipeline. Similarly, it will also generate a jupiter notebook with a code that you need to deploy the data flow directly into an inference endpoint and you can bring your own model in this process or you can use autopilot infants endpoint or a built in stage speaker algorithm endpoint. And I will use, we'll show you all of this in the next few minutes.

So in summary, we've just walked through many great capability that sang will provide to help you accelerate data preparation in machine learning. Starting with import data from over 40 data source and application. You can then quickly clean and prepare your data using built in transformation and analysis without writing any code. After that, you can quickly scale to process millions and billions of, of data without having to rewrite any code. Finally, you can automate with ability to schedule job and integration with hr pipeline for training and infant pipeline.

So with that, I would like to introduce arun shankar, senior solution architect at uh aws to help demo their angle in action. Arun used to be a data scientist so he can he's going to tell you the story from a data scientist perspective. Thanks for the great.

Um so thanks for the great overview of data wrangler. So huang pretty much covered all the all the functionalities existing and the functionalities which we recently launched. So next, what we're gonna do is we're gonna see a live demo of data wrangler in action, right? So we'll see all those functionalities in action.

So this is basically the overview of the demo, what we're gonna be doing today. Uh so the data set we're gonna be looking at is loan default prediction, data set. So this is a, it's a very rich, nice data set. Let's let's assume that you are, you are you are a leading company or a bank, right? You're trying to give out loans to candidates or applicants. So you have two scenarios here. Either you might be lending out loans to applicants who will never end up paying back the loan, which means they default and you're gonna lose money or you sort of mis predict, right? So you rejected an applicant who has good historical background of actually paying back the loan at the right time. So you gotta make sure that you're looking into the historical sort of data of these applicants and coming back with a prediction model. So that's what we're gonna do.

So we're gonna use, we're gonna take the data set which exist in two pieces from s3 and sales force. We'll combine them, we'll see how to do joints and data wrangler. We'll also see how to do automated analysis, right? So without you doing anything, how you know with a with a few clicks, you can just kick off automated analysis that will run behind the scenes and generate a lot of insights for you. And then pretty much data data wrangler will guide you from there on like what are the right set of transformations you want to do? So we'll see all that in action.

And then at the end of it, we will also see how to take everything we did through the experimental phase and then convert it right operationalized into processing jobs where we can scale that on hundreds of millions of rows of data. And finally, we'll get to the end part where we'll take the output of data wrangler. We'll see how to launch an autopilot job to train a model with it automatically. Perfect. So let's get started.

Awesome. So here, what you see is sage maker studio. So that's our id, right? Integrated development enrollment for machine learning. So it's it's an enrollment that just ties all the sage maker competence together. So if you're familiar with it, you would know the interface, we just redesigned it. So it it looks a lot fresher. Now.

Now within sage maker studio, there are several ways you can get to data wrangler. So one of the easiest ways is you go to quick actions from home, you go to import and prepare data visually. Now, when you hit that, what happens is it's gonna create a data flow file for you, right? So what do you see here? The demo dot flow uh loan default dot flow. So it will create a data flow file for you. And then as the data flow is getting created, you'll see that it's launching the compute that you need to run this interactive session. So i've, i've done that already for you so that we don't waste any time. So you can see that data wrangling instance is already up and running. There's also a data wrangler app, sage maker, data wrangler app, if i expand this, that's up and running. So we let's go back to the data data flow which is demo do flow.

Now, what i also have done is i've taken part one of the data set that is sitting in s3 and then i've actually imported the data set to the data flow. But let's see how to import the data set. Part two, right part two of the data set, which is, which is in sales force. It exists as a table and sales force. Let's see how to go and actually create a connection to pull that data into the data flow.

So when you go and hit create connection here, so you're gonna see all the native connectors we support out of the box. So s3 athena red shift snowflake emr and data breaks. Uh you'll also see the 40 plus different data sources for which you have access to. So if you, if you're gonna have, if you're going to establish access to these data, data sources through outflow, you'll see them available right at the top. So in this case, we've already done that for sales force. I'm gonna go and click on it and then one of the first things i need to do here is give, give a name for my connection. So i'm gonna call it loan demographics. So part two is basically all demographics, data right about the applicants of the loan. So i'm gonna go and um let's call it loan demographics and then hit connect.

Now as soon as you do that, you'll see that it automatically goes and pulls all the databases and tables for which you have access to. So in this case, it's just a single table we have created um under data wrangler database. You can see, you know, there's a table that's coming through appflow. So when i click on it, you get a preview of it. Ok? So you can see four feature columns here, right? All feature columns are related to the demographics, information of the uh the applicant who is applying for the loan.

Now, what you could do is you can go and apply or write queries here. So this query in genus power batina. So not only you can import data, but you can also go and slice and dice the data like quarry the state. If you wanted to. In this case, i'm just going to be importing everything. But theoretically, you could run complex queries here, right? Athena queries here.

So let's go and run the query. So once i run the query, it's gonna go and actually pull all the data in here, you get a preview again, right? The query results. Now i can go and select sampling. Now, this is very important. Now, our data set is hundreds of millions of rows. Now we don't want to go and actually pull all of that in for this interactive session. What we wanna do rather is get a sample of it, right? So in this case, i'm gonna go and do the first case sample, right, the 1st 50,000 rows, you could also do a random sample if you wanted to or you could do a stratified, right? If you have a lot of classes, you want to make sure everything is balanced. So in this case, because i'm gonna be doing a joint, i'm gonna keep it simple. I'm gonna pull the 1st 50,000 rows. I selected that and then i import my query, then i just give a name. So i'm going to call it loans part two dot cs v.

So once you do that, what it's gonna do is now it's pulling all the data right from sales force table. Now it got it into this interactive session. So you get a, you get a preview of the table, you also see, you know options where you can go and add transforms. Now we'll get to the transform part later.

Now let's go back to the data flow. Now, we can see that we have two data sources, right, that we have imported. So the the s3, we imported directly from s3 through an s3 browser and then we imported the sales force making a connection and then pulling it in. Now, what we want to do next is combine them, right? I want to show how we can do joins with these data sets.

Now let's go and actually click on this plus icon here and then he'd join and once you do that again, opens up a join sort of a view where you can see options to go and select the other block. And then as soon as you do that, you get a joint preview, right? This diamond shaped um node that gets created. So you're seeing that every time you make an action through data brangle, it's translating that action into a directed acy graph, right? A dag.

Now you can go and hit on configure here that opens up another view where you can see the table side by side. So in this case, uh you can see that um part one is all the loan, loan data. So loan characteristics basically, right? Things like loan amount, uh what's the status of the loan and things like that? And then part one is demographics, let's go and actually select the type of join.

So in this case, i'm gonna do an in join on the id column because both these, both these tables have the id column. This is unique identifier for each, every, each and individual rows and then give it a name. So let's call it join data, join data set, right? Hit preview. So you're gonna get a preview of the join data, data set and then also you can go and add it now to the data flow.

So you'll see that what happened now is we basically joined it, right? So we joined the data set. Now, what we wanna do next is we have a data in, in good state. So we have the data ready to now uh you know uh for expert data analysis, we wanna do some but first thing we wanna do is we want some guidance on it so that's why we have this new feature called get data insights.

So when you hit on get data insights, what it's gonna do is it's gonna go and say give me, just tell me what's the target column that you're trying to predict here. In this case, my target column is loan status, right? Which is this column here, um which is a target column. So i have three classes here and i choose classification because i want to classify as part of my prediction model. And then i hit create.

Now, when you do that, it's gonna take anywhere between 2 to 3 minutes because it's going to run a lot of analysis for you behind the scenes. Now, i've done that already for you because you know, um i've, i've, i've played with this data set many, many times. So here in this case, you can see the output once you, once you create, it will look like this, right? Except you'll have a download option where you can download it as a png. Now i already download it. I'm opening up the um the completed report for you.

So it comes back with a lot of things. So we're gonna go um you know, deep into like each one of these um you know, interesting insights that it comes back with, right? So it starts with data set statistics where it's gonna give you things like what percentage of the data, right has missing values. A breakdown of all the feature types like the counts, you also get things like, you know, duplicate rows. Is there any duplicate rows in your uh data set? And then also it comes back with warnings, right?

So we call these red flags. So every time you're starting with the data set, there's going to be some obvious issues, critical issues that you need to take care of. So what the insights report does is it comes back with those issues, but it buckets them into high, medium and low. So in this case, we don't have anything high, but you'll see that at the bottom, we'll have a few warnings that are medium.

So no duplicate roles, we know that. So not, not an issue, you can see that it comes back with an anomalous samples. So in this case, so these are anomalies in a data set and also, you know, we kind of rank them by anomaly scores. So you can see that this works based on isolation forest. So you don't have to do all of this implementation, right? So in this case, it's it's already using isolation for behind the scenes, looking at your data set that 50,000 sample rose, right? And then it's coming back with these um anomalies.

Now this is something you can drop if you want to do or you can take a look at it uh in depth. The next thing it's going to look at is the target column. So it's very clear that our data set is highly imbalanced here. So from the from the distribution here. So it comes back with the medium war saying that you know this, there's a there are rare target labels or or classes in your data set. So we need to do something about it, right?

So it comes back with these prescriptions. So if you read through it, you'll see that it says we need to add more observations, right, add more data to the data set or go and consolidate all the minority classes. So the ones that are really um you know, low right, in um in percentage, in this case, less than probably 15% consolidate them into one class and change the problem type from multi class to binary class. So you get all that insights in here. It even goes and points to exact transforms, you can apply when we go to the transform side of things.

It also does a quick model. So this is just take this existing state as it is of the data set and see if you can create a model and see if you can split the model, basically split the data set and see if you can come up with the model. Um and and come up with these metrics to see how good is my current state of data. We know that current state, current state is like pretty raw. We didn't do any transforms as you can see that it's pretty, uh, pretty bad, right? The balance a because that's what we should be looking at is less than, it's, it's close to like 35%. If you look at other metrics like f score, uh, you know, the christian re recall, you can see that it's good for the majority class, but it's really, really bad for the minority class because, you know, the classes are imbalanced. The confucian metrics tells the same story.

We also get um we also get an insight here basically explaining that, you know, this is just worse than a dummy class fire, right? If you're going to predict the the most class, so we need to do something about it. Now keep going down, you're gonna get feature summaries like what, how, how important are your features towards that prediction, right? So this is again a ranked um scored sort of list of individual features and how good they are towards the prediction. Ok.

So we all we see all this and you can also see other things like distribution of individual features and how it correlates with, you know, target labels and all of that right individually. So you can even look at outliers if you keep scrolling down in this plot. But what we really want to do is we want, we want to take all this insight that it comes back with because it's it's also guiding you that you need to apply transforms to fix these issues.

So let's go to the the the full complete data flow. So this is the full complete data flow. So here if you see, let's go. And um yeah, so if you see here, it's a pretty, pretty long data flow, right? So we started with importing data joining and then you can see that we run the um the data insights report, but there's also you know, other transforms we have done.

So let's go and actually look at what are the things we did after we completed the insights report. So i've added close to after the joint, i've done eight steps here, right? So these are based on the report, what we, what we saw. So i'm dropping um you know, some of the written and columns that was created due due to the joints, i'm also dropping all the missing values, right? So we have a transform called drop missing. So you can just choose that uh select all of the columns, right? Because we have multi column, selects just select all of the columns and just say i wanna drop it

"But you also have other strategies. So, you know, for this demo, I'm just dropping it. But you can also go on impute if you wanted to and we have strategies for that or you can fill them with missing v like, you know, just values if you think, you know, you have a, you have a way to do it. And if you wanted to kind of fragment them and do it differently, fault layers, you can also do the same thing.

Um here, I've handled all my numeric outliers. So I go and pick all the columns that are numeric type and then I, I'm applying the standard deviation, right as a, as a way to sort of fix it. Here, I'm saying that any um any numeric sort of, you know, column with a value that is over three standard deviation, I'm gonna go and actually drop them, remove them, we can also go and fix some of the rare categories, right? We have a lot of categorical columns in our data set. So we wanna make sure that there are, you know, the the representation of these categories are sort of, you know, um even we don't want to go and have some categories that are, you know, just very less frequent, it doesn't make sense.

So what we're doing is we have transforms to actually do that uh where you can go and consolidate. So in this case for the home ownership, we're saying if you see any category with a fraction threshold, right, if it's less than 88% i'm gonna go and actually um just rename it to other, right? That way, I'm just creating another new category called other. That's just consolidating all the rare categories into like, you know, one big category. So we do that for the classes too. Because that's very important. We saw that, you know, we have these two minority classes, so we want to fix that.

So there is again a transform to do that. So in this case, all we have to do is just go to this column and then rename the charged off, right? But that was one of the minority classes and then the other one was um current. So we're gonna go and rename them to default, right? Because these are categories like the minority classes most probably they'll end up defaulting again, right? This is based on some assumptions. So we do that for both the classes, minority classes in the end, what happens is we have two classes now but they, but they still not balanced.

We want to make sure that, you know, can we do some sort of sampling here, stratified sampling to actually balance out so that we have an evenly balanced class representation. So in that case, we're gonna go and apply smo, right? So this is an option where you can go and choose strategies to balance the data set. We have random over sampling under sampling and smart.

So right now, you know, we uh we have it supported for binary classification and once that's done what we have done so far is we've, we've, we've taken some actions right in the form of transform. So if you go back to that original data flow, you can see that everything i showed here in that list, all that 889 steps, they just show up as notes here.

Now, now you wanna still validate, you wanna make sure that whatever actions you have taken is actually the right actions. So you can just go and rerun data quality insights report again, right? So you can just go and do a 2nd, 2nd shot at it to make sure that all those issues, what you saw initially, like the warnings and uh you know, things with um the distribution of the target classes and even distribution of those numeric sort of features, right? They all look better and different now.

So i have, i've actually done that. So i did go and rerun it and then created a second report and then downloaded it, which is what i'm showing you here. So you can see here now there's no missing values, right? So we fixed all the missing values because we did a transform for that if you come down, like i haven't handled the anomalous samples, so they still exit, but we could have really dropped them if you wanted to. But more importantly, we see that the classes are balanced now.

So we, we, you know, we converted into a binary class problem now, the quick model when it, so it's an x boost model, right? So it took, it took the current of the data set. Now when we ran it, you see that it's coming back with better scores. Now, we have only two classes and you can see that it's doing really good for both the classes, right? Almost um almost even performance. The confucian matrix again looks better. It's very less false positives and false negatives.

More importantly, the feature attribution table looks a lot better. Now, so you can see that every feature is almost contributing to the prediction. So um the plot looks really amazing. Now, similarly, you can go and look at individual features and do all of that. But now let's get to the um uh kind of the last phase.

So what what we are where we are at right now is um if we go back to the full data flow, so we've fixed, we created reports, but we haven't st still done any transforms for mission learning, meaning you know, the classic feature engineering transforms you would do. So that's the last batch we're doing here, right?

So we're quickly going and actually adding a few more transforms if we go and actually show that as a list here. So I'm adding for example, one hard end coding. So I, you know, I have, we have a lot of categorical features here. It's close to 76 features. So we didn't want hard encoding. And we're saying that, you know, for all these categorical features, we want, we want them converted to a vector.

So you can also, you know, create new columns. But in this case, we're just creating a sparse vector here, right? So it's very handy you can even go and choose other strategies. Uh if you're familiar with like similarity and coding, we also support that for categorical columns for numeric. Uh what we're gonna do is we're gonna go and scale them, right? This is this is like a classic feature engineering step uh that you should always do is I got all these numeric columns in different scales.

Now I want to bring them into, you know, one scale. So in this case, we're applying a standard scaler, but you could also be applying all these other different scalers we have here, right? Similarly, uh for the dates we had like two date columns, what we're doing here is we're bringing in custom transforms. So meaning I want to take the date but I want to convert that into a number.

So for example, I had a feature called issued on when the loan was issued on. So it's a date so it could be issued like seven years before. So I want to take that date and then actually convert that to a number saying how many how many days since the loan was issued. So we do that by a simple pandas transform.

So we take that code snippet and then put it here as a custom transform, right? In this case, pandas, you can also do p spark, you can also do user define like udfs and python and sql um sequel, just statements written p spark. All that is allowed, we also have a cool feature called templates.

So if you're creating custom transform this way, like go and search for custom, right? A t the top, actually we have it uh so called custom transform. So we have this option called search exa mples snippets. So you can go and actually look a t a lot of different possible um you know, transforms, right, that you can, you can come up with in p park.

So in this case, because it's p park, like for example, let's say i wanna go and actually um uh sort, sort, sort by a column or a subset of columns. So when you click on that template, you'll get a, you'll get the code right, at least to start with if you want to create your own custom transforms.

So let's go back to our list of things and then also we applied a few other things, right? Things like I had a text column in there. Uh so we have a way to actualize that using a count, count vectorize or if you're familiar with like tf id f and all of that, it's a classic sort of n lp. So we also support that.

So at the end of it, we have, we have taken all these steps. So we took a data that was pretty um dirty, right? So we did a lot of cleansing. So we did dq i which is data quality insights. We we cleaned it, we took some transforms, we validated that we also added some mi feature training steps that are very specific to mission learning.

Now, we are right at this point, let's say that that's the end state we are in, we want to go and actually um take this recipe, what we created so far and actually scale it because our data set is hundreds of millions, right? So it's millions of rows. But what we did here is only 50,000 rows.

So to do that, we have an option called destination node. So you click on the plus icon here and then go and add, add destinations. There's two options we support right now. So these are syncs, right? Destination syncs, meaning what, what we've created so far is actually a feature engineering pipeline, right? A data prep pipeline or featuring pipeline.

Now we want, we want the output, go to a sync, right? In this case, it's an s3 or feature store. So we're gonna go and add, let's say add s3 as a feature of as a destination sync. Once you do that, you just have to give it a name. So i'm going to call it output output, lawn, tear it up and then here just choose a location where you want to go and drop that output.

So i'm just gonna go and um let's say i want to pick a bucket here. That is three bucket. So let's go with that and then probably in a folder where you want to write it to, then you can specify partitions. Now like this is like you want to partition the data because you know it's it's this is gonna go and actually create a distributed processing job that's that's followed by spark, right?

So it's a spark job that will run. Um so you wanna make sure that you know the data that comes out if you want it to be partitioned, you can do that. So i'm just gonna say 10 partitions, right? You can also partition by column if you want to, it's all optional.

So add the destination first, right? So just you're saying that i want the output to go here, then now you have the option to go and create a job. So what you're doing here when you create a job is you just take the same data flow recipe. And then you're saying i'm gonna apply it on a data set, which is, which has the same schema, but it's, it's a larger data set.

So when you create a job, one of the first things you have to do is you have to say whether you want to refit or not. So things like scalers, right? So we use the standard scr for um normalizing the numeric columns. Now, do you want to go and actually refit that scalar the compact rr? We used to um you know, handle the employee title. Let's say we want to refit it and then configure jobs.

So here's where you're going to specify the number of nodes you want for the spare spark job. So i'm gonna go and specify i want like 10 nodes, right? So i want this huge cluster and i'm gonna, i'm gonna take this recipe and then scale it on this larger data set i have, you can go and specify spark memory configuration.

So who won't cover that? You can go and even specify like override like the default? If you wanted to, there's also ways to parameterize this entire data flow, what you've created so that you can, you can make it more flexible, right? If you feel that you know, your data might change a little bit or you know, you want to make sure that you can plug it in different ways.

But right at the bottom, you can also see this option to go and create a new schedule. So when you click that, you see, you know, it's gonna go and actually show you this option where you can specify. Ok. I wanted, let's say one, i want to repeat like i wanna repeat a run of this, this job that i'm creating the spa job run every hour because my data is getting refreshed almost every hour, right? If you wanted to.

So let's go and create. So when you create what it does is it just basically creates a job. So behind the scenes, if you go and uh look at under age, make a console, you'll see that, you know, there's a processing job, right? This the job that is running and actually crunching, crunching the larger larger data set.

Other things you can do here from this point on is i don't wanna actually scale, maybe my data set is small, but i wanna go and actually train a model, right, which h huang actually showed in a in the form of a screenshot is i wanna take the output of data wrangler and i wanna go and kick off an autopilot experiment.

So for that, when you click on train, it just basically takes you to to the screen where you can go and specify again, right, a location. So this is be a location where you the the data wrangler will write the data to and then when you hit export and train, what happens behind the scenes is it's exporting the data and then it's gonna go and actually bridge data wrangler with autopilot and open up this new interface called create an autopilot experiment.

So here you can see that all the fields are pretty much populated. So it already knows where to pull the data from, right? Because this is the transform data, you can go on, go on to um choose other things, right? You can do feature selection here you can go and choose, you know your target columns um and then go all the way till the end and then just kick off autopilot experiments, right?

So that's one way to end the flow where you're saying that i'm just gonna go with the sample output, but i'm gonna go and kick off autopilot experiments. Now, you can also go and do other things here. A lot of times. Um you know, developers and data scientists, they wanna look code, right?

So, so far everything we did was actually through the visual interface. But what if if i want to actually do the same thing through code? Because i wanna make it as part of ac i cd, right? I want to do the same processing job. I wanna take the recipe and then do the same processing job, but i wanted to do it in code.

So we also give you this option to export it as a jupiter notebook. So this case when you click that, that's gonna create, create a notebook with all the code that actually will go and create the processing job and then write the output to an s3 bucket, right? So this is the code so you can go and modify this code and just execute it.

So you have that option, you can also export it as pure python code if you wanted to. So there's an option to export it as pure python code, export it to feature store. And then we saw that you can also take all of this and then bundle it into a pipeline, right?

So sage maker pipelines is, is an orchestration tool. It's a tool where you can go and create pipelines and orchestrate pipelines rather and workflows. So when you do that again, you get a notebook where it'll take this and bundle up into a sage maker pipeline. But what i want to kind of cover a little deeper is the inference pipeline because that's a new feature we launched.

So think about it, right? So you start with data wrangler. So you did your feature engineering and then let's say you go on to use autopilot or your own custom sort of modeling strategy. You go and train a model. Now you want to go and deploy this model, you want to create an end point of this model. So you can make inferences.

Now at that point when you're making inferences, you have to make sure that your incoming payload goes to the same, same steps or same set of transforms to what your training data went through, right? So that's that's an important step. So until now we had until now, we didn't have a out of the box solution for that.

So you would have to go and actually create a container for that like take the data flow and then figure out how to go and actually apply the data flow to your incoming payload. So with this option, they export to sage make inference pipeline. So we're gonna allow you to go and actually do that automatically.

So when i open this, um let's open the notebook. So i already have it here. Ok? So here you can see that this generator notebook, if you go all the way down, it's gonna ask only for a few things. So you just have to point to the location where you have your train model.

Now, this can be an autopilot train model or your custom model or a model just using, you know any of the built in algorithms, right we support. So you specify the location of the model and then you go and actually create an inference pipeline, right? It's just two basic steps.

So when you create an inference pipeline, it creates something called a pipeline model. So it's basically two containers. So you have a preprocessing container which takes the data flow and then it applies that steps all the steps, right? So it has all the scalers and everything in there. Big 10, it's gonna apply it on the payload and then pass the transformed payload to the prediction container, which is the second container that is going to be, you know, the one that's gonna leverage your model.

Finally, there's also options, other options where you can go and import data sets directly, right? So we didn't show that. But there's also ways let's say, you know, i don't want to do all of this. I just want to go and export it directly to estuary. So we also have an option for that.

Now that that i think pretty much sums everything we wanted to cover today. So we started with data set that exist in two pieces. We combined it, we went through some you know, um cool ways to do automated analysis. We saw how to fix data, clean the data and then do feature engineering with the data. And then finally, we saw all the different ways you can operationalize the data, right? Like scale it to larger data sets, create models, you know, through autopilot. And also, you know, saw how to do inference pipelines, right? Which is basically, you know, getting all the way to the end, which is end points.

Now, what we wanna do now is just open up the floor. If you have any questions, feel free to get up and ask your questions. So we'll make it can be related to the demo, it can be related to any of the things who are covered. So we'll make sure, you know, we get it addressed. Thanks everyone."