Boost ML development productivity with managed Jupyter notebooks

最新推荐文章于 2024-06-13 15:47:25 发布

taibaili2023

最新推荐文章于 2024-06-13 15:47:25 发布

阅读量362

点赞数 8

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134595020

版权

Hello, everyone. Thanks for coming in. Welcome to AWS re:Invent. Um I'm Sumit Thakur. I'm Principal Product Manager at AWS and today I have with me Sean Morgan who is a Senior Solutions Architect at AWS and RSA who is the Chief ML Architect at Vanguard.

Now, before we dive into this topic of how you can use SageMaker Managed Notebooks to improve your productivity for developing machine learning models. I want you to take a look at these three images. Don't worry, they're not science fiction. These three images they represent my top 3 of my three favorite pivotal moments from three different scientific fields.

The one on the extreme left is the first ever picture of black hole captured in 2019. The one in the middle is the first ever direct observation of gravitational waves made in the LIGO lab at Caltech back in 2015. The one on the extreme right is a more recent one. This is from 2020 at the height of the pandemic when tons of data science communities came together to create a bunch of real time tracking dashboards for tracking the spread of COVID-19 infection, consuming data from the John Hopkins Coronavirus Research Center.

Now why I'm showing these images? There's one thing which ties them together and that's the use of the open source ecosystem of data visualization packages and data analysis packages in Jupiter Notebooks, Jupiter Notebooks were at the center of each of these three pivotal moments. This single image speaks to me how versatile Jupiter Notebooks are and how big of an impact they have had over the years on different scientific fields and data science.

So it comes as no surprise to me that Jupiter Notebooks have seen rapid adoption in last decade. Today, we have more than 10 million Jupiter Notebooks on GitHub created by tens and millions of Jupiter Notebook users, 50x increase in just last seven years. Also in a recent Kaggle survey which all of us follow more than 70% of data scientists, data engineers and machine learning developers acknowledged that they use Jupiter and Jupiter Lab in their day to day work.

Given this huge popularity of Jupiter Notebooks, SageMaker, which is AWS flagship service for data science and machine learning has been giving first class support for notebooks since day one. Back in 2017, we launched SageMaker Notebook Instance, which was a fully managed Jupiter Notebook instance in cloud prepackaged and preconfigured with popular open source data science and machine learning libraries to quickly get you started with doing machine learning on cloud.

Then in 2019, we took a major step forward. And we launched SageMaker Studio, the first ever IDE purpose built for machine learning built on top of Jupiter Lab in SageMaker Studio, you can build train tune, deploy and monitor your models, everything in one single place.

And then last year, we extended the benefits of SageMaker and SageMaker Studio to millions of Jupiter users out there by launching SageMaker Studio Lab, our free notebook service, which brings the notebooks along with free GPU compute and storage right in your browser.

Given this history of working with the notebook developers for last many years, we have developed some unique perspectives around the challenges of notebook developers and the opportunities for innovation. But before we go in there, let's look at a typical notebook builder workflow.

Of course, it all starts with data. You ingest a small sample of data in your notebook and then you explore that data to understand data distributions and their statistical characteristics. Then you go on to spend quite a bit of time in cleaning and preparing data and extracting meaningful features from the raw data. And once you have those features, you go about experimenting with a combination of features, algorithms and parameters to create different model candidates.

Once you reach a stage in your experimentation where you have identified a model with reasonable accuracy, you want to scale your experiments and scaling an experiment might mean training the model on large data sets. It would also mean retraining models over time whenever the new data comes in from upstream.

Now, each of these three steps of preparing data, experimenting with your models and then productionizing a notebook has challenges. Let's look at them for preparing data. Of course, data exploration. Still today, you have to write tons of boilerplate code using many different data visualization packages. Very often, these packages are not purposefully for machine learning, which means you have to write even more undifferentiated code to do stuff like data bias detection that might impact your model accuracy.

Now, when you go on to prepare data, very often, your model needs large data sets that may not fit on a single notebook instance. So you need to provision a cluster of machines to run large scale data prep jobs. Now provisioning a cluster and managing a cluster lifecycle is a huge operational overhead and you really want to focus on data science, not on operations.

Now, after you are done with your data prep and you go on to experiment, you want to closely collaborate with your peers because that's the time when you want to benchmark, compare and track experiments and compare with your peers. And I'm very sure any of us who have collaborated with our peers over the notebooks have used different communication channels to share notebooks. We will put them on a Slack, send us an attachment on email, put them or drop in a shared folder. These are easy but at the same time, you end up creating multiple copies of notebooks floating all around, which are so difficult to track and so hard to reproduce. Plus, it takes a lot of back and forth using these methods to communicate with your peers, which means longer feedback loops and delay time to insights.

And then finally, after doing all the hard work, you have got a good notebook and now you really want to scale it up, but the story doesn't end here. Now you have to copy those small snippets of code from each notebook cell, put that into a script, figure out the dependencies, package them all into a container, set up the infrastructure to run this container as a job and then manage this entire lifecycle. Doing each of these steps of course requires use of many devops tools and you have to learn different devops best practices if you really want to scale, which again means doing undifferentiated work for data scientists.

So our customers were telling us that these challenges in these specific jobs to be done for notebook developers is impeding their productivity and increasing their time to insights but not anymore. I'm very pleased to introduce the next generation of SageMaker Studio notebooks that improve data scientists productivity through a suite of new features. It includes a simplified built-in data prep capability, serverless kernels for Spark and Ray, real time collaboration and a one click automated conversion of notebook code into jobs.

We announced these capabilities yesterday in the AI keynotes in Swami and Bratin's keynote. Today, we are gonna do a little bit of a deep dive here and we'll go, we're gonna show you a demo as well. Let's look at the, let's focus on the first pillar, the data prep pillar where we spoke about the challenges of exploring data, writing a lot of undifferentiated code and then having to manage clusters to prepare data.

We are pleased to introduce a simplified data prep capability inside Studio notebooks. It's a built-in capability which automatically generates data distributions and data insights right in line in notebooks. All you need to do is go into your Studio notebook and query a DataFrame using Pandas. The moment you display the DataFrame, the built-in capability automatically generates histograms and bar charts for every column in your DataFrame. You can quickly scan these distributions to just understand your statistical properties of your dataset to also spot some of the issues like skewed distributions and outliers.

The notebook goes on further to even give you more data quality insights which might impact the accuracy of the model you are trying to build. For example, if you're building a classification model and you have identified a column as a target column for prediction. The notebook will tell you if there are unbalanced classes in that column or there are too few values for certain classes. You can review those recommendations in line in your notebook cell and apply those recommendations in just few clicks. The notebook will also make sure that for every click that you take to prepare data automatically generates the code and puts it back in the notebook cell. So you can take this notebook and reproduce your data pipeline anytime anywhere.

Now, once you have explored the data, cleaned it up, remove some of the data biases. You want to now prepare the data to extract features and you really don't want to set up large clusters. We are pleased to introduce the serverless kernels for Apache Spark and Ray in Studio notebooks all managed by AWS Glue to access the serverless environment. All you need to do is to open your SageMaker Studio notebook, go to the notebook toolbar where you select the kernels and select the blue Python kernel. Then you have to write just one line of code to initialize a serverless data prep session managed by Glue. And you can initialize the session with either Spark or Ray with a simple command line parameter.

Once you run this one line of code, it automatically spins up a serverless environment in seconds. And then you can continue doing all your stuff. For example, browsing catalog from Glue, ingesting data using SQL or writing your Spark and Ray jobs and executing them on the fly. And the endpoint will automatically scale up and down depending on the compute needs of your data prep job, you no more have to manage clusters. Also, once you are done preparing your data, the endpoint automatically scales down. So you're only paying for the amount of time that you use no more paying for large underutilized clusters.

And we are going to show you a demo of how, how cool this is now with the the visual data prep built inside the notebook and the serverless kernels. Now moving over to the experimentation phase and how do we solve the problem of this pacing collaboration between teammates which impedes their productivity.

We are pleased to introduce the real time collaboration capability in SageMaker Studio notebooks using which you and your peers can view run, execute and review the same notebook in real time. Let's see how it works.

So first thing you do, you go in SageMaker and you create a collaborative space. A collaborative space is like a workspace for you and your team members to collaborate together on an ML problem. What it gives you is access to a shared SageMaker Studio with a shared EFS file system which means all of you are accessing the same studio and see the same set of notebooks in real time.

You can edit the notebook with real time cursors for each of your users who are logged into the notebook displayed in real time. You can run the notebook output and view those, review those outputs together in real time. And at any point, if you want to take a snapshot of your work, you can commit your code to the attached code repositories. The spaces support the AWS CodeCommit, GitHub and Bitbucket repositories.

And not just that like any work that you do within the space, for example, creating new datasets or models or endpoints or experiments, they're automatically saved and scoped within the perimeter of the space. So your data science work is automatically organized and you can always revisit your work and reproduce your work easily in one single place.

Now, moving on from building the models using real time collaboration to now finally, you have a good notebook and you really want to put it in production. How do you go about circumventing the challenges of taking the notebook code and running a production ready job for that?

We are pleased to introduce a new capability in Studio notebooks using which you can schedule notebook jobs in just a few simple clicks. Again, you just go into SageMaker Studio notebooks, select the notebook file and then a visual interface pops up in SageMaker Studio where you can create a schedule for running the job. Now once the schedule gets triggered, SageMaker will then take the notebook file, determine its dependencies, create a container, stand up the infrastructure for running the job runs. The job pushes all the output of the job, including the notebook back in Studio and then tears down the job. So you only pay for what you use.

You can log in into Studio to review the output notebook with, let's say if you have created a notebook which compares multiple different model training runs to create the parameter optimization dashboards, those dashboards will be there in the output notebook once you come back after a job is complete and you can visualize them right in line in Studio notebook,

The community was telling us that this is a big enough problem for many many Jupiter notebook users out there. And very often the notebook users will flip into users using the other IDEs because of this problem. So we have decided to make this capability available to millions of Jupiter notebook users by making this available in SageMaker Studio Lab for no additional charge.

So today you can go in SageMaker Studio Lab which is the free notebook service and start scheduling your jobs on SageMaker. And now to give you a demo of all these capabilities. I am pleased to invite Sean on the stage and give us a live demo.

Thank you, Sumit. Uh testing my mic. Hello everybody. Today, I am pleased to be able to present and demonstrate these new next generation SageMaker Studio notebook capabilities. Ok. Uh for today's demonstration, we're gonna start with a hypothetical scenario uh that Sumit and I work within a company's sustainability research division.

To start, we're going to open our SageMaker domain for our particular line of business where you'll see that we each have a associated user profile attached to this domain. If you've used SageMaker Studio before, you may recognize the launch button here in order to launch a personal Studio IDE app. But we've recently added as of yesterday, the ability to run collaborative spaces as Sumit mentioned, collaborative spaces are designated machine learning endeavors where you can scope your work as far as experiments and file directories specific to the problem you're trying to solve in our sustainability research division.

We have an electricity consumption project as well as an air quality study for this use case. I'm going to launch the Sean user persona and I'm going to select a collaborative space for this. I'm going to use the air quality study which I would simply just click right there.

Once you land into a collaborative spaces Studio IDE, there's a few differences if you're already a previous Studio user. The first thing you may notice is the new SageMaker Studio UI is loaded here which just launched yesterday with a much cleaner interface to interact with all of the different steps that you need to do through ML development in the top right corner. Here you may notice that it says Sean user associated with the space air quality study.

This particular collaborative space has a designated file directory attached. So that we can share files in real time with each other. In this example, we have four notebooks that we can use as collaborative editing. And we'll do that throughout this demonstration. We also have a couple of directories for our own personal files so that we can keep some separation. And to start this experiment, we're going to start by using the serverless data prep capabilities as mentioned in Sumit's slides.

So expanding on the problem of how you can go about using scalable computation, we've added the ability to seamlessly connect to interactive computational clusters such as AWS Glue. As of yesterday, there is now support for three different data integration engines...

The first one being AWS Glue for Ray, the second one being AWS Glue for Python and the last one being AWS Glue for Apache Spark. Once you load up a Glue interactive session, you can seamlessly connect to various data sources like Amazon S3, Redshift or other databases.

In this example, we're going to be determining if we're able to build a machine learning model to help predict the amount of nitrogen dioxide in the area based off the weather conditions. Nitrogen dioxide and other compounds have a association with acid rain as well as other greenhouse gasses. So as part of our sustainability division, we'd like to be able to predict based off of weather conditions, uh what we can expect within our area.

To do this, we're gonna use a rather large data set. Uh it's a 42 gigabyte data set of physical air quality data that has been aggregated from public data sources provided by government and research organizations. It is a open S3 bucket uh which we'll use for the data. We're going to marry that data with no aa weather patterns and summary data for various days throughout the past 10 years or so.

So let's get started.

How can we go about seamlessly connecting this SageMaker Studio notebook to scalable compute? The first thing that we're going to want to do is we're going to want to view what type of configuration options can we run from this interface? There's a plethora of them, but I'll call out a few key ones.

The first one is an idle time out so you can set amount of minutes that you want your AWS Glue session to end if you are no longer using it. This prevents leftover idle clusters running. Uh but we'll talk about other ways in which you can stop the session as well that are very native to a Jupiter user.

You can seamlessly set the number of workers by just saying that I need more workers. I can have within seconds more computational power to handle the data set that I need to work with. You can specify the worker type as well as additional python modules or python files or extra jars, if you're working with Apache Spark.

For our example, here, I'm just setting a Glue version of 3.0 session prefix and a number of workers, which i'm gonna use 25 workers with 100 and 80 minutes of idle time out.

Now, typically as an end user, you wouldn't need to go to the AWS Glue console, but i'm gonna jump back and forth here so that you can see it launch in real time. If I go to the Glue console, you'll see that I currently have no sessions running. And after running the configuration cell, all that has done is just set the settings. So at this point, I still have yet to begin the initiation of a AWS Glue serverless compute session.

As soon as I run a cell that is no longer a setting like printing the Spark version here, you'll notice that the session will begin to spawn immediately. If I go back to the AWS console and refresh, you'll see that it is now provisioning an interactive session, you could have specified this to use Apache Spark or R as recently announced this week. It can take anywhere from 15 seconds to 30 seconds. Uh but we should see it here shortly and there we go.

We can see that we are running a Spark 3.1 Amazon version of Apache Spark. And if we go back to the AWS Glue console and refresh, we can see that that is now in the ready state so within a matter of seconds, I now have 25 worker nodes of scalable compute attached to my notebook. I didn't need to spawn an EMR cluster or other longer running process. As a data scientist, I can simply come in here, grab the scalable compute that I need, do my processing and then kill it when I'm done.

For our data set, we're going to begin by loading the open air quality data and we're gonna take a quick peek at what it looks like. Uh it's a rather large data set with global um weather or um air quality measurements. And so we're going to need to subset that down to our specific problem of our city.

We can see for example that uh you know, we have cities like Dubai and other places available for this data set. So let's go ahead and scope down the measurements of the air quality as well as the location to do this. I'm writing standard open source Spark APIs here to filter this data frame for Seattle as well as the parameter of nitrogen dioxide.

You can see that we have now filtered this for Seattle and I'm now I'm going to go ahead and perform aggregated data computation. I want to average the nitrogen dioxide concentration for each day of our data set. Typically, this would be quite difficult to do on a single computational node, especially for a data set of this size. But by using the scalable compute with AWS Glue, we can do this in a matter of seconds.

And now you can see that for associated days we've done the nitrogen dioxide concentration. I'm going to go ahead and write this aggregated data set down to S3 as we'll use it in our later model building process.

Next, I'm going to pull the year boundaries, the minimum year and the maximum year for our data set so that we can go ahead and pull another data set from open source NOAA weather data. Again, we're not doing too much here. We're simply pulling the weather data and scoping it down to our particular longitude and latitude of Seattle.

And lastly, we will go ahead and write that data set to S3.

Ok. So as a data scientist, I've completed the amount of data preparation that I wanted to do. How can I go about and kill that AWS or that Glue interactive session? Uh I could simply write the magic command stop session, which works very well in a programmatic way. Uh but instead I can have a very native feel to my notebook where I can actually go and restart the kernel.

So if I were to kill this notebook kernel, it will automatically go ahead and kill the AWS Glue session as you can see here. So by using kind of native development workflow in Jupiter, I am able to handle the scalable compute and save costs by just using the compute for the time that I need.

The next thing we're gonna show in this process is how to go about building collaborative models. So we'll start by importing some common machine learning libraries and data science - pandas, numpy. And we're going to import SageMaker Data Wrangler. This enables the built in data preparation capability powered by Data Wrangler.

I'm going to open my data frames of nitrogen dioxide and weather. And now I really want to inspect that data set so I can understand what's going on. So by seamlessly showing that data frame, I'm now able to see a plethora of graphical information presented to me. If I select view the Pandas table, I can see what this would have looked like without the built in data prep capability. It's a tabular data set. Um but it's quite hard for me to kind of understand that data without writing a bunch of boilerplate code to make those plots.

By using the built in data preparation capability, I can see the various distributions I can see call outs for things like outliers directly in this interface. It also adds a number of insights and recommendations. If I simply click this warning sign on the visibility column, I can see that I have a missing value in this column.

So in order to analyze that and make a new correction, I will go over to the data quality tab where I can see a number of suggested transforms for this use case. As a subject matter expert in the weather patterns, I know that visibility can be a good indicator here. So I don't want to drop this column. I simply want to replace it with the median value. So I will click the replace with median and you can see the description applies in real time.

One distinctive capability about this built in data insights is that it's able to analyze for machine learning problems. If I were to select this as a target column, I can see I can set the problem type as classification or regression. And it could tell me things about like the cardinality or the different distributions that I'd be trying to predict.

Once I've gone ahead and made a transform, it will automatically be populated in the cell below. You can see that the graphical user interface that I use is now written in code as it was generated below. This is great for programmatically running this notebook.

After you've done graphical user interface for the rest of the data set here, I'm gonna do some other quick modifications and then I'm going to merge these data frames together, both the wet and the nitrogen dioxide average data in our hypothetical situation.

Here, I am acting as a subject matter expert in environmental science and Submit will be our machine learning expert. We can see that there's clearly a correlation between the wind speed and the concentration. So we have a pretty good feel that this data set will be able to have some amount of predictive power.

I'm just going and writing some quick functions for scoring our methods. And then I'm gonna go ahead as a first pass and try to create my own machine learning model for this problem. To do that, I'm going to import a linear regression model and train it on the data sets that I have prepared.

We can see that this score is a 0.018. Uh so it's not a great model right off the bat. And so at this point, I may want to call in a machine learning expert to help improve the model.

To do this, we can use our real time collaboration capabilities. If I select this panel here on the left of online collaborators, I can see that Submit user is also on the shared space and is actively viewing the model building IPython notebook. If I jump into a split screen here, I can see that as the other user, I'm able to come in and write my cursor directly into the screen. So I can see where that user is writing and what they're doing in real time.

For this example, Submit is gonna come in here and he's going to use a random forest regressor in order to apply some nonlinearity to our machine learning model here. We can see that our score has drastically increased. And so I can thank Submit for his help in improving our model.

Some more metrics as we have now completed our model building and these metrics look good to us for a deployment. So we're gonna quickly plot how this would have done over the past years worth or data. And we can see that our predicted forecast versus actual forecast is pretty good, good enough for this demonstration.

So now that we've done some model building and experimenting, what happens if we wanted to record our results through time, recording those metrics and associating that with our air quality study space?

Well, if I go to the SageMaker experiments tab, what I'll notice is that the only experiments that are visible to me are the ones that were created from this air quality study. So I don't see other users experiments or other shared spaces. It's simply scoped down to the experiments that I need to use that applies for model registry and other tangible resources that can be created from here. So you'll only see what's pertinent to your machine learning endeavor.

Now that we've built the model, I want to go about using a notebook interface to schedule some inference. One of the common use cases that we're seeing for these production notebooks would be to run a model on some cadence or analyze some data like your SageMaker usage for the previous day.

To do this, I'm going to start with a parameter cell. This is going to be a parameterized variable that we can change as we go about setting our production job. So we'll start by looking two days backwards. And we're gonna try to predict the nitrogen content from two days ago.

In order to make this cell a parameter, I simply add a cell tag known as parameters to this cell that will instruct the notebook job that the variables defined in this cell can be modified by the actual job that runs it.

Going to just quickly run through the notebook as I would run it in real time. You can see that it renders a markdown saying we're pulling weather from two days ago because that was the variable I wrote in my interactive notebook.

I'm then going to load the previous model as well as the data from two days ago. So 1129 is where I pulled the weather data for. And that is where we will try to predict.

I will load the model into RAM here. And then I'm going to predict the nitrogen concentration from two days ago.

For the purposes of demonstrating how these scheduled notebooks can render graphics or other reporting metrics, I've just gone ahead and recreated that plot here so that we can see in the schedule job how you can go about building reports that are able to access a variety of different data sources or access a variety of different options for visualization.

Once I've determined that this is the notebook that I want to use, I don't need to go package this up as some other Python module or figure out the dependencies that I need for this notebook since I've already found a working environment that I'm running it in.

I can click this button up here that says create a notebook job in order to run this as a scheduled job or immediately. I can name the job anything I would like. So I'll say my new scheduled job, and then it will point to the input file that will be prepopulated.

You can select the compute type that you want to run it on. You can use GPUs or other instances provided in the SageMaker environment. I'm going to add that parameter here because for my schedule job, I don't want to look two days back. I want to look one day back. So I'm going to write in the parameter days back and put in the value of one.

There's a lot of other options that you can set in your configuration. You can see that this notebook is already using the kernel that I was using in the notebook automatically. It prepopulates for the kernel that I've selected though you can change that if you would like.

I can also specify environment variables or start up scripts that I need to adjust the environment that plans to run this notebook. So it's very flexible as to what you need in order to provision. And then lastly, I can click create now or I can run this on a schedule.

We have common types of schedules, you know, minutes hours, days, weekdays, as well as the ability to set your own custom schedule using cron expressions.

Once you've created a notebook job or schedule, you can monitor these through the SageMaker Studio interface. If I go to the home button up here and I select Notebook Jobs, I'll be able to see the notebook job definitions that I have. So if I refresh this, I can see that I have an active schedule already running for scheduled nitrogen dioxide prediction. It runs every four hours and then is currently active. I could pause that schedule or start this other schedule all through this convenient interface as an end user.

Once a notebook job is run and then completed, you can see the output file available. If I click this download button, I'm able to look at both the logs of the container that ran the notebook as well as a rendered final notebook. So for example, if I were to open a previously ran notebook, you don't need to select a kernel because we don't have to run it. It's more of a read only exercise here. We can see that our cell that had the days back parameter uh is then followed by the injected days back parameter uh that we specified as part of our job. So you can do this in a way that allows you to modularize your notebook and modify them with the uh you know parameters that you need to modify. You can see that it rather rendered that we're pulling the weather data from one day ago now. Uh and it also has the ability to store your plotted um analysis.

Lastly, if you have a failed job, you can view the notebook directly from here too. It will show you where it failed just as you would expect for an easy debugging experience. I wanna thank you for all watching this demonstration. I'm very happy to see these new next generation SageMaker Studio notebook features and hope you enjoy the demonstration.

Next, I'm going to hand this off to Ritesh Shah to talk about how these features play into Vanguard's vision for the future of machine learning development. Thank you all and thank you Sean so much for showcasing all the capabilities uh before.

Ok. Am I on the right side guard before we go into Vanguard's journey? For decreasing time to insight? I have few questions. How many of you know, Vanguard or have interfaced with one god? Ok. I see maybe 20% of the room. Uh how about how many of you have used SageMaker for not building models but for processing data or any other activities. Ok? Looks like similar number. Cool.

Let's start with Vanguard, right? Core purpose of Vanguard is to take a stand for all in investors treat them fairly and give them the best chance for investment success. That's our core purpose, core mission. And I Ritesh Shah work in the Chief Data and Analytics Office as a Chief Architect for EIML Machine Learning Platforms.

We were founded in 1975 and have $7.1 trillion worth of assets under management. Let's dive a little deeper on CDO and what our mission is within Vanguard. We work in a very highly regulated industry and we always have to care about having a strong data defense strategy. But also we need a very strong data offense strategy to make sure that we can be very competitive in this very dynamic environment. This two sided strategy helps us build data and helps us transform data and use data to drive what we call Vanguard's flywheel to help you all by providing best performing mutual funds, providing you best experience on our websites and mobile applications, et cetera. And all of this is driven by crew members who are highly engaged and working with data all the time. So that's our strategic vision to achieve and give you benefits from all all the investments you do.

Taking one more step deeper. What are our strategic imperatives? How are we gonna get there as Vanguard? Our strategic imperatives are around data culture where we want to build a strong data culture so that every employee at Vanguard, we call them crew members, use data for making all the decisions all the time. We want to provide these crew members a mature data ecosystem and analytics ecosystem that is very simple to use and they can derive insights fast. We want to move to the world of data as a product, build data products that are consumed internally and externally by many of our clients. And then insights drive outcomes and focus on not just building insights but build insights that help the flywheel and drive the outcome forward.

And since I work in the platforms area, the most important key results for us, we call it a widely important goal, is how do we decrease time to insight from weeks to days? Many of you may have experienced it takes months and you may be trying to get to weeks. But our vision is to get to days instead of weeks and months.

Now that you have seen what our vision is, what is the CDO vision, strategic imperatives and key results. Let's look at current state. How does current state look like at Vanguard? It's not a blank slate, right? We have been doing this in cloud for since 2016. So all the data workers, we have different type of data workers, analysts, engineers, machine learning scientists, et cetera. And I'm gonna walk you all through through different scenarios.

So let's assume a data worker wants to build some analytics or do some ad hoc querying. How do they do that today? They go to a browser, they launch what we call Hadoop user interface or Hadoop user experience which comes with EMR and they start running ad hoc queries against Hive and Presto and all of this is authorized using IAM roles, giving them access to S3. This is a complicated setup that has to be performed by IT technicians so that these data workers can go and run ad hoc queries.

Now, if you are a data engineer and you want to write some core, we provision a similar environment using Jupyter Hub with an EMR and they can then start writing their Spark code or PySpark code or just plain old Python code to go against data in S3 and then they also have to create IAM rules, et cetera. So again, they need somebody in IT to help them with all this infrastructure set up.

Similarly, if you want to do ad hoc query analysis using Athena, you have to set up the whole environment, access control and run Athena queries through the console, which is the best experience we can give them.

Now, if you want to write Glue jobs, you run VS Code on your desktop. Connect to Glue interactive sessions and run your Glue queries, which is also a complicated, complicated setup for a data worker to do it by themselves.

Finally, similar complications exist in DataBrew where they have to go through the whole setup to create a DataBrew workflow.

Finally, the most advanced people, machine learning scientists, they have the similar problem. They want to focus on writing machine learning code and build insights. But most of their time is spent with engineers trying to set up their SageMaker Studio environment, connect them to write data and that's when they start doing work.

So this is the life at Vanguard for a data worker, not bad, but can we improve it? But before we improve, let's talk about some metrics. What does Vanguard environment look like today, we have EMR clusters running today and we run around on a daily basis. 27,000 nodes across all the EMR clusters. We have Redshift infrastructure where we have 90 some nodes spread across many Redshift clusters running. Today, we are an average have a light use. I would call it light use of Athena because predominantly we are using EMR for everything but we are around 5000 queries a day. Finally, we have around 1100 Glue ETL workloads being run on a daily basis.

Switching to size of our data lake, we are around 20 petabyte of data lake ingested in S3 and I haven't published the numbers for Redshift, but it's pretty significant size of Redshift clusters that we have.

Finally going to SageMaker. We have around 350 unique users of SageMaker. And on a daily basis, we have an average load of 150 users using SageMaker today. And these are all the expert data scientists who are typically using SageMakers today based on the current state that I presented.

Let's talk about opportunities. You all saw all the different setups that had to be done, different interfaces. All of this adds a lot of cognitive load on the end users, data workers to learn how these user interfaces work. This leads to decreased time and insight, but also it increases the start time instead of being able to write code in seconds, it takes weeks for them to be able to write first line of code.

Second problem. What we have observed is context switching. I have to write a query. I do things differently. If I have to run pre query, I do things differently. Now, if I want to write Glue job, the whole experience changes and this context switching really hurts these data workers on a daily basis, causing decreased time to incite.

Now, let's talk about all the setup I showcased that's an operational overhead for many data engineers, IT engineers who have to go in on a daily basis, set up these environments and or modify them on a periodic basis based on what insight needs to be derived. And then lastly like Summit pointed out with the new capabilities were built because of inefficient collaboration environment. You cannot easily collaborate between a Jupyter notebook, running an EMR versus SageMaker plus you write code in one environment, take it to the other, it doesn't work. So all of these lead to a lot of opportunities in the current state that we saw that we should maybe try to mitigate.

But before we go down the path of mitigating these opportunities, we set ourselves some principal tenants. First one was around, hey, can we get to a world where we have a unified user interface or a unified user experience for all the data workers to be able to work in a single interface? Whether they run ad hoc query, whether they run a Glue job or whether they write a machine learning code can they work in one environment?

Second was how do we minimize handoffs between these individuals and data workers working on a project to solve a business problem? And finally, how do we reduce that operational overhead by automating their environment and set up of their environment but not just that. How do we automate, set up access to data? How do we make it secure? How do we set up their computes? How do we give them an environment where they can run things on their own without having to ask or hire an IT engineer? To help them every day. How do we minimize that overload that these data workers have?

So, with that said, as a principal tenant, we started working on a unified user experience with AWS and I'm gonna show you an upcoming architecture that we are planning to work towards in 2023.

So this architecture is about data workers still come to a browser based interface, log on to SageMaker Studio as their unified user experience. This experience provides not just the notebook environment but also shared spaces environment that Sean was showcasing a few minutes ago.

Now once you are in SageMaker Studio, you may want to perform different kind of actions. You may want to do data engineering workload using a notebook interface or you may want to write machine learning algorithm and you may want to perform data engineering or machine learning workload through these different compute options in this upcoming state.

We are going to automate set up of all these components for the data worker so that they don't have to take action of setting up all these. And we are gonna work towards a unified access model using Lake Formation as a standard mechanism to authorize data, irrespective of what compute the user is using at that point.

Now, if you don't want to write a data engineering workload or machine learning workload, you just want to run an ad hoc query or maybe you want to take what you have written and schedule it to be run every day. We plan to roll out the jobs capability that Sean and I talked about which will allow these users to take their notebook environment, schedule it as a job across all these compute lines and do their work more efficiently without having to call an IT person saying set me an even bridge rule, et cetera. They don't have to worry about all those complications. They can just schedule their notebook and run it.

Now. Going back to I said ad hoc workloads, we we are going to launch Data Wrangler for these end users to work with all these compute options where they can go in Data Wrangler, just write a simple query SQL and run it and it will run it against all these different types of computes which are shown with the dotted lines.

Now, once you have written your Data Wrangler workload for ad hoc queries, you may want to build a flow using Data Wrangler as a local no-code, data engineering option using the same SageMaker Studio user interface. You would be able to now write Data Wrangler workloads against all the compute options shown here on the on the screen.

All of this is preset up for the end user access is set up so that they can either use a notebook or Data Wrangler. They can use any of these compute options, et cetera.

Once we roll out, I would say, what would be your key takeaways, right? I would say SageMaker Studio if you can, if it's allowed in your company, it provides you this universal interface where you can do all these workloads and improve your experience of all of your clients. I would also say you SageMaker we are you should use SageMaker notebooks and its automation to deploy pipelines. The way Sean was articulating you shared workspaces. Collaboration is very important in this space. You have heard on the keynote multiple times that building insights was not a single person job. It's a team level activity, use this capabilities. It will help you drive your insights faster.

Last but not least I would say keep your eye on cost, implement fins capabilities where you continuously monitor cost, use inbuilt capabilities. Like when you terminate your notebook session, it terminates your Glue session or build your own termination capabilities, which AWS SageMaker Studio already provides. So manage your cost is also important while you give all of these data workers all these options.

Now, maybe a peek into future. What are we planning to work with AWS in partnership on? Hey, how do we take the current SageMaker Studio experience and make it a best in class experience for data workers? Today, it provides you all these capabilities. But from a user experience perspective, it's built more towards building a model rather than building a data pipeline or data engineering. So how do we give one unified experience for them, which is conducive for all different kind of workloads.

Second, enhance the collaboration capabilities that are provided by SageMaker Studio by bringing in integrations with Amazon Data Zones and other capabilities which were launched yesterday. So that end user does not have to worry about how to set up all these plus we as clients of AWS don't have to build all custom automation to set up all these compute options that our data workers need to use.

Introduce data ops into the platform right now. You can run the job but you cannot look at the logs, you have to go to console. There are many data ops scenarios that can be brought into a universal interface, mitigating all the context switching problems. They would still have to go based on current implementation.

And the last thing is how do we make this whole platform highly resilient multi region enabled so that data insights and insights from data can be performed any time without minimum or almost zero downtime. That's where we are going to focus a future state on.

Lastly. Thank you for coming and I think Summit we are gonna hand it over to you for more questions.

taibaili2023

关注

8
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
Boost ML development productivity with managed Jupyter notebooks

Hello, everyone. Thanks for coming in. Welcome to AWS re:Invent. Um I'm Sumit Thakur. I'm Principal Product Manager at AWS and today I have with me Sean Morgan who is a Senior Solutions Architect at AWS and RSA who is the Chief ML Architect at Vanguard.Now
复制链接

扫一扫