Introduction to MLOps engineering on AWS

最新推荐文章于 2024-09-27 13:55:26 发布

李白的朋友王维

最新推荐文章于 2024-09-27 13:55:26 发布

阅读量67

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134809536

版权

Ok, I don't say any more about any more folks coming in. I suppose we can go ahead and get started here and I want to take this opportunity to welcome you all to NM Ops UH Engineering on AWS.

Uh thanks for coming out for this presentation today. My name is John. I'm gonna be the UH presenter here for this particular session. I live in the Houston, Texas area. Uh been there for many, many years with my wife and my son and my two dogs.

Uh we UH have been UH lately up in the north side of the city. So UH are you familiar with the Woodlands area? That's where we're at? And UH I wanna welcome you here to this talk. This is UH one that's near and dear to my heart. I get the chance to teach this as a UH three day class on a pretty regular basis to get to talk with folks that are in this ML operation space and looking at UH ways to kind of operationalize UH what we do UH in machine learning.

So, so our, our goals here in this particular session is to spend some time talking about the machine learning operations here. UH kind of give us a little bit of definition of that compare and contrast, maybe those of you that might come from a dev ops world UH and how that might look different here in an ML ops world. Some of the, some of the things that we have to do in addition to maybe tasks that you perform UH in, in dev ops today, we also want to maybe talk about UH the people that are involved here.

So we want to look at some of the different personas here because people are one of the most important parts of any kind of transition as we go into ML operations. UH we also want to describe the workflow of machine learning, make sure we have a good understanding of what the workflow can look like. And then we want to get into the phases of what a mature UH ML UH maturity model kind of looks like and also talk about governance a little bit and security UH in this ML space as we're going through it as well because it becomes very important these days UH that we have these conversations as we're talking about maybe automating more and more things here.

So this is a great session. I hope you get a lot out of it. UH please ask questions, there should be UH microphones here UH towards this UH in the middle here, of each one of the aisles. UH but go ahead and UH UH please feel free to ask questions as we go on. I can hang out after the session is over as well. UH and UH take any questions also.

So ML UH is for, as we looked at the machine learning, you've probably heard a lot about it this week. You probably heard a lot about JA I, you've probably heard a lot about all of these different models out there and about all the innovation that it's going to bring into organizations.

The interesting thing about machine learning today is that there's a lot of experimentation that goes on. There's plenty of experimentation, there's plenty of model building that's happening out there, but there's not a lot of operational that occurs from those models to get them into use in actual production. And in order to become a really mature organization, we have to really think about and start to adopt the processes to really think about how we're going to operationalize models, how we're gonna get code out of where we're doing them today and get that code into production and get automated pipelines flowing here. And this is going to establish really the whole entire need UH for ML operations here.

UH some of the tasks that we have to perform as you can see on the slide, the tasks that we have to do on a regular basis, we might have to do things like build and train and deploy models. UH but also we need to be able to monitor those when they are in production itself, to be able to capture how that model is performing. To be able to understand if we have drift and things like data or the things to understand if maybe reality has started to change on us a little bit and, and maybe the model predictions that were good six months ago aren't so great today, we need to be able to understand those kinds of things.

We need to be able to manage this as we scale as no, because as we start to become more of a mature operation, as we start to think about ML for other types of problems, scalability becomes a very big factor here in how much we are able to operationalize some of the tasks that we're imagining here and that we're thinking about in organizations here.

So I love this slide. It's one of the few animated ones that I have. And if you don't know anything about me, I love of animated slides. Ok? But we're some of us maybe have come from organizations where um perhaps we have done dev ops in the past, right? And we're trying to understand now from a dev ops perspective of how we can apply some of those principles into machine learning. And we're trying to view this maybe through our dev ops lenses. But here it comes, we need to take and put on our ML ops glasses now and start to view these practices and understand what's involved.

Wasn't that great? Ok. So we need to kind of understand all of the things that are involved because ML is going to encompass more things perhaps than traditional dev ops has done in the past because we're going to introduce maybe some areas that ML that dev ops, excuse me, has not had to really focus on but becomes very critically important in an ML ops based types operation.

So we need to put on our operations perspective glasses here and understand from a culture, what changes might be needed in our culture and our organizations today to be able to really move into a mature type of ML organization that has truly cross functional teams that are communicating with each other and maybe even not only communicating with each other, but perhaps even speaking languages that each other can understand. Ok, so that they're able to understand and what the problems that the each organization faces and be able to take that and come up with the right set of processes, the right set of tools to be able to solve those problems.

We need to understand where the priorities are. We need to understand the personas are that are involved. It's not just maybe dev security and operations here that we need to focus on. Now, we have some new people that we're gonna add to this team. We're gonna be talking to data scientists all of a sudden. Ok. We're gonna be talking to data engineers as well. We're gonna be talking maybe even in some cases to machine learning governance people in many cases.

So it's, it's understanding who are the different personas that we're going to need to encompass. UH when we're starting to UH figure out who are the people that are involved here. What is our organizational structure look like? Is it going to allow us to create these kinds of teams similar to what we saw in the transition in the dev ops world?

You know that if you've gone through a dev ops transition, what did you have to do? Often it involved realigning teams, not around things like technology stacks, right? That was how we did it back in the olden days when i started. Ok, we, we, we were, you know, all the database people were in one group and all the middleware people were in another. They all did java too, by the way, and then you had the front end people and i don't know what they did. So anyway, but you had all of these different organizational pieces, but they were aligned by technology, not necessarily aligned by what the business needed and what the business priority was.

Ok. And we have to be careful that we don't fall into this kind of same trap here. When we're looking at how our ML operations are going to be structured, we need to look at skill sets, skill sets are going to be really important, especially when we get into choosing tools. What skills can i leverage that are already in place in my organization today? What can we use that maybe is well understood. Ok. It may not be the latest coolest, greatest thing, it might be leveraging some tools that hey, we already have a high degree of confidence in it. And a lot of people really understand how those UH tool sets work.

And then of course, um most importantly, i think UH i is understanding the unique life cycle of ML and understanding what's involved there. So if you've ever done any dev ops, this is nothing new. People, processes, tools, people over processes over tools in terms of what our priorities are in our organizations today. If we're going to truly create an organization that can scale ML and really be able to be able to operationalize some of the things that we're imagining today. It's gonna start there.

Everybody wants to talk about tools. We are no different. You've been bombarded with tools this week and we've probably shown you some cool things. I'm gonna talk about some too. Ok? But if we don't get the first two right, the other one doesn't really matter all that much. Ok? Because we end up kind of mired in the same place that we are today.

So it's really important to understand who those personas are. Who those people are and what those processes, how they break down how they're different in a dev ops world here. Ok. So it's how you approach machine learning is really where ML ops is going to come into play here.

This is something that is interesting when you start to break down what is involved in the ML process. We tend to focus a lot on models, how many of you have heard something about a model this week? And we were talking about gen AI and all the different model types that are out there and all the new models that are coming out and we tend to put a lot of focus on that. And rightly, so because, you know, that's where a lot of the exciting things happen. But overall, it's a very small piece of ML operations as a whole. It is actually an extremely small piece.

There are so many other pieces that surround this that are hugely and critically important and are places that we can stumble significantly at. Ok. So, so while all the talk and really all of the interest seems to revolve around the model, there's a lot of data operations that have to happen ahead of time before the, a lot of the times those models can be of any kind of use to us, there are also operational type challenges that happen once that model is deployed into production, UH that can be stumbling blocks for us as well, so it's, it's, it's, again, it's, it's not to say it's not important, but it becomes a really small piece when we look at it as a whole.

So here's our agenda. Ok? I said people over, process over tools, but I'm actually am gonna talk about processes a little bit first so we can understand maybe some of the differences perhaps in how ML ops is a little bit different. Maybe if you come from a dev ops world, what some of the new things might be different UH in the ML op space, then we'll talk about people, we'll talk about those personas who might be involved. Give you kind of a general idea of that. Then we'll get into technology.

I love talking about technologies, what i do for a living. I do it every single day of my life. I love talking about tools. I love talking about SageMaker. UH I, I spend a lot of time UH actually talking about it. I'm gonna talk to you all about it a little bit too, but again, it's so important to get the first two things right? And also we'll look at security and governance and then we'll look at ML maturity models here UH as well. Ok

Let's kick it off here in the first topic, talking a little bit about processes.

Now, in most ML type of operations, you have two main types of operations that typically have to occur:

You have a batch type of set of processes.
And then maybe you have along with that some kind of real time or infer type of case where you're actually using the model.

The batch process is typically in the model training where you're in the building process, maybe you're in the fine tuning stage. Those are typically batch type of operations.

They involve:

A significant amount of data to be able to train these models in a lot of cases.
And they involve a significant amount of work to get that data ready to be trained.

And that is an important part of, you know, formulating our initial business problem and understanding what kind of things do we want to solve and then starting some experimentation and really understanding, you know, how can I solve that problem. And here's this model that I need to build in order to solve this particular business problem. And here's the data that I have and things like that.

Once we get through that phase, then there is often a real time type component of actually using the model where we're going to take the model, we're going to serve it up through some sort of infer capability, either a real time endpoint maybe or a batch endpoint in some cases. And we're going to feed that model new data and get some predictions or get some output out of it at that point.

So there's two very distinct sets of tasks that are occurring that are very, very dependent on each other in order for us to really get good quality predictions and good quality models.

Because over time, what's going to happen is as we're getting those predictions from the model itself, as we're getting that new data that's coming in, we're, we're also saving a lot of that too. We're, we're saving that somewhere because that's going to end up becoming the data at some point that might be used to retrain the model at some point in the future here.

So there, there are two processes, they are different, but there, there are some interdependencies between them there.

So you can see here it starts with the business problem. We take the business problem, we frame that into an ML problem. Once we framed it into an ML problem, we go and look and see what kind of data we have, we go and collect the data. Where is the data? What type of data is it? Is it? For example, does this problem require labeled data? Do we have labeled data? We, we need to go and see what is out there, what's available to us?

And then from that data would typically come a phase where we would look to see what features we're going to need to actually train the model itself. And maybe we need to engineer features out of that existing data in order to take that data and make it ready to get the kind of predictions or get the kind of output from the model that we're looking for.

Then there's going to become the whole entire training process. That training, that tuning, we are doing some experiments, maybe we're experimenting with different types of algorithms here or maybe we're experimenting with, with different types of actual models that we're using for our AI types of solutions.

And finally, there's an evaluation phase, there's an evaluation phase to kind of see. Ok. Well, we know how the model performed when it was training, but how is it gonna perform on maybe data that it hasn't seen yet? And that evaluation is going to kind of give us some idea of whether or not we're meeting our business goals and whether we take the next step of deploying the model and starting to get that model actually rolling out into production.

But if you'll notice that the very far left hand side there of this particular slide, that's all dealing with data that might be dealing with a totally different set of personas in the past that maybe we haven't dealt with data engineers for example, and, and how are all of those processes happening? How are we getting data into and getting it prepared so that we can begin that whole training process?

Now, we're bringing in people that maybe we haven't worked with in the past. For example, in our DevOps cultures.

So you can see here the workflow kind of laid out in a little bit more linear fashion here on the slide. Again, we we have thinking about the different tasks that we're going to perform here and where we're probably going to see. Um for example, data flow through this workflow, where are we gonna see a lot of data activities? Well, obviously in the data preparation phase, right, we would, we would definitely say we would see data tasks that would be there, right?

But interestingly enough in the model build in the model evaluation phase, we're also going to see a number of data tasks that are gonna be involved here.

There's going to be, for example, in some cases, splitting of the data, maybe taking the data set that we have and splitting it up into training data validation data, perhaps test data in some cases a whole a complete holdout set. So we can use that for evaluation later on.

There is also going to be in the model selection phase data tasks that we're going to have to use to be able to determine when i did this experiment when i ran this experiment. What was the result of it? Right. And and that's captured often in metadata about that particular run and that metadata needs to be stored somewhere, it needs to be stored with the model.

So that we are able to go back and look and see well, what hyper parameters were used here. Ok. What was the outcome of this particular experimental run? That's also going to be a data task that we need to think about in that phase as well.

And then monitoring on the monitoring side, there's often going to be tasks of us capturing production data and storing that production data and perhaps using that for evaluation of how is the model doing? Ok. So taking that, taking that production data and maybe storing it for evaluation, but also perhaps storing it so that we can use that production data at some point if we need to retrain the model.

So there are also going to be task data tasks that are occurring in the actual day to day running of the model itself that we are going to have to think about when we're thinking about our processes here and what we're going to use and how we're going to perform these different tasks.

What about managing code? Where are we likely to interact with a lot of code in this process?

Because interestingly here, we could see it in several different locations. We could see code. In fact, at the data preparation stage, maybe a lot of those data preparation tasks that we're looking at. There are things that could be automated. We could write, say for example, Python scripts that could use a SciKit Learn library to maybe go through and perform a lot of the data processing, data preprocessing tasks that have to be done.

And that's going to involve us managing code in some form or fashion.

There's also gonna be a lot of code in the model build phase. But where is that code going to typically be as we're running through these experiments, right? As we're running through the builds as we're running through, you know, trying out this algorithm or as we're running at a tuning job, where is that code typically at? It's in Jupyter Notebooks. In a lot of cases, the data scientists, the machine learning engineers, they're typically running a lot of that already inside of Jupyter Notebooks. And that's where that actual code base is being stored at.

And that might present a problem for us as ML operations engineers when we come along later and say, well, you know, it'd be nice if we were able to kind of automate some of these tasks to some degree, it'd be nice if we would be able to take these tasks and be able to put those into some kind of automated pipeline to some extent.

So also we'll we'll see code and deployment because our software developers that are working with the applications are going to need to be able to call the model and get predictions from it. So there has to be a serving component that's going to allow that software developer to interact with that particular model. So there's going to be code that's going we're going to encounter there as well.

And interestingly enough when it comes to the ML workflow is that, you know, changing one thing like maybe going through and, and and changing one thing in the data preparation stage, maybe dropping a particular feature, for example, will tend to cause changes in everything. So change one thing, change everything can often be the case when we're working with code here in this environment.

So it's important to understand how those processes are gonna work and then managing the model through the workflow, understanding what candidate models we have out there. What the model metrics are, how are we going to measure that? How are we going to measure how the model is performing? How are we going to measure if we start to get data drift that looks different from the data we trained it on? Ok.

So, so how do we detect that? How do we detect in cases where, you know, maybe this doesn't look exactly the way that we train it and we're starting to see new features show up, for example, or we're starting to see this feature sort of skew in a certain direction. That wasn't the way that data looked when we initially trained that. Ok.

So having some kind of good quality monitoring, monitoring the predictions, do the predictions themselves, do they align to some sort of ground truth. So let's take a fraud use case, for example, right?

Um i get an alert on my phone that says, hey, we we suspect that this is fraud. Your credit card just got used in Las Vegas, John, you don't live there and I say no that's not fraud. I'm traveling, right?

Ok. So um now there's some label data to work with that. We can store and use to work with potentially to retrain that model in the future to maybe update it. And, and update the way that it's actually scoring things.

Ok. So dealing with that understanding the type of predictions that it's making and comparing those predictions to ground truth. So it gets a little complex fast, it gets a little complex fast because now we're dealing with teams on the data preparation side and the data management side, we're dealing with machine learning and engineers in the actual development of the code, but then trying to operationalize that code with the ML engineers and then trying to monitor that model in real time as it's actually deployed for infer and understanding what it's doing and being able to get a good idea of whether or not we're seeing any drift or not, whether or not reality is changing on us or anything like that.

That starts to become the picture that we're looking at here in ML operations today.

So look at, look at the comparison here. Look, look at some of the comparisons in DevOps and MLOps, we, we, we both need to focus on code versioning. Absolutely. We need to focus on the compute environment. It's really important right models train on certain types of hardware, for example, we need to make sure that, you know, we're, we're focusing on that continuous integration, continuous delivery pipelines. Ok?

But perhaps in some cases, more complex or different pipelines than maybe we have dealt with in the past and then monitoring, we we definitely need to focus on monitoring. But typically in DevOps, we were focused on monitoring the application itself.

Here in MLOps, we actually may need to not only focus on again what the model itself is doing, but we actually might need to focus on in production the data that we're getting and whether or not the data is changing on us, so that could be an important part as well.

Then there's data provenance which is important because this is going to be ok. Well, where did this data come from? What changes have been made over time being able to track the lineage of all of this different data that we're working with and having some way to keep track of that, keeping track of the models themselves.

What is this model? When was it trained? What were the hyper parameters that was trained upon? What was the result of it? Where is it stored? Where can I find it that's on S3 somewhere where, I don't know, it's, that's in an output folder. No, we need to be able to store those centrally so we can find them again when we start talking about scale here.

Model building workflows or something else we have to focus on in MLOps as well as model deployment workflows also. So a lot more going on here, a lot more tasks, a lot more considerations that we're gonna have to think about when it comes to defining our processes here.

Ok. So how are we doing so far?

The people, it's one of the most important parts. The DevOps engineers, hey, we're, we're comfortable with them, we know them, we've worked with them for years, right? But now we have these data engineers, we have these data scientists, we have these ML people, what we, we don't even speak the same language because we're asking for what their, where's your code? Show me where your code is. So I can put that in part of a CI/CD pipeline. And what do they give me? They give me a Jupyter notebook, what is this? I can't, I, I can't run this. What are you talking about?

And, and, and so you, you, you end up with two groups that are unable to communicate with one another.

One, you know, is, is looking for the ability to operationalize that code and one is not even understanding what you're saying. Of course, we build it in Jupiter. What else would we build it in? Ok. I mean, that, that's, that's ridiculous. Where's the documentation? It's in the notebook, can't you read? It's right there. Ok.

So we're, we're trying to get a lot of that code that is in some of these various notebooks out there. We're trying to get that into some sort of operationalized system here. And then we bring on the governance officers and we don't understand anything they're saying, you know, they're, they're, they're, they're talking about, you know, things like uh you know, bias detection and compliance and regulatory requirements and explainability and where is that coming from? I don't know, both teams are looking at them, scratching their head trying to figure out well, it's uh according to the data people, it's a black box. So I don't know.

So, so where is that information gonna come from? So that, you know, we can bring along the, the governance people as well. And then, you know, we, we have somebody that's going to determine whether or not this model goes, whether or not we have a, a approval process in place here and, and where we might see these different personas in play here as, as you can see you know, we might have the data scientist, data engineer involved in the data prep phase. The data scientist might be more involved as well in the model build model selection phase uh as we go forward. And then once the model selection phase uh is kind of uh approved, you know, that maybe kicks off a dev ops pipeline that would then take that model and get that model, maybe starting to deploy into production uh and get the uh actual infer component uh in play.

But then there's monitoring that's involved here too. And, and, and monitoring is interesting because yes, the, the, the dev ops engineer is going to be involved in monitoring but very likely. So are the governance people, right? They're going to want to know things like whether how that model is making predictions and have those predictions. Are they changing over time and what features are weighing on the predictions that are being made because they may be really focused on things like explainability of that model. Should that ever come into question here? And so that becomes kind of a joint responsibility because we could implement monitoring and then not understand the results and then have to go back to the ml engineers and get them to clarify certain things for us.

So this is my second one. Ok. You can see here that without everybody involved, right, the ball doesn't get over the net. Ok? They're playing volleyball here for those of you that don't know they're playing volleyball here. Ok. And, right, the, the, the, the person in the back, you know, they, uh, what did they do? That, that, that's the set? Is that right? Is that the set gig? Ok. Then this is the set, right? And then you have the person that spikes it over. Right. But if they're not all playing together, you know, the ball just kind of just goes dribbles out of nowhere, right? And nobody's listening to the coach here on the sidelines and nobody's paying any attention to the coach where communication is really key here and establishing communication across these teams, establishing common languages that everybody understands and and can all speak the same thing becomes a very important part uh of the ml process so that we get the ball over the net here.

And then as you can see each different persona uh working together is what it's gonna take to actually implement uh ml operations. Ok. So where, you know, think about where you are on your ml team today, like what's your role? And then who are you going to have to talk to here uh in order to be able to really implement uh ml operations at scale.

Now, technology from a technology perspective here, uh this is where we get into the part where we start talking about tools and an important considerations here. Think about the skill sets that you already have when it comes to tool selection, whatever you do, always think about what skills your organization may employ today. When making tool set selection, everybody wants to go for. The tool is the coolest thing. Excuse me. I said tool list, coolest thing. I understand that because it is cool. But sometimes, you know, cool can be a little bit overrated if we're not ready for it. Right. And especially if we have mature processes that are already in place that are doing some of the heavy lifting and people are really with that, the more that we can lean on that, the more that we can leverage that really the smoother, that whole process is going to go and while you may not be using, oh, this is the greatest thing ever. You can get a lot of mileage out of existing tool sets uh in this particular space and especially a lot out of existing processes and adapting those for ml engineering.

So think about that be flexible in what you consider for your tool set, right? Uh i is it reproducible? Can you scale with it as you start to grow into more and more ml task? Because we're starting to reimagine more and more parts of our business and starting to incorporate more ml here. Can we scale with it? Is it auditable? Is it explainable? Ok. So think about again, your your training environment, your serving environment, right? So think about how you're going to train models what you're going to use, uh how you're going to optimize your models. How are you going to host your model? Where are you going to store your models? What kind of registries or things do you have in place today? What kind of repositories? And then what kind of monitoring are you going to do as you consider moving into production and then start to look at what's out there for things like creating and managing workflows. There's a lot of great tools out there leverage those that maybe have a familiar feel to your organization today because you can get quite a bit of mileage out of them, implement those c i cd practices and leverage what you may already have at your organization today.

So look at tool selection, manage services like sage maker offer some really good benefits because a lot of the tooling that we just looked at can often be built inside of one platform. So we don't have to leave the platform itself to get all of the different features that we need to be able to create end to end workflows and something maybe that we are already building models in today. Managed environments work really well for that custom solutions. However, might be the answer again when you have cases where things are already working pipelines are already built. Tool sets are already out there. Sometimes it's good to leverage those in a lot of cases for ml operations.

So think about your end to end part of all of this. Think about all of the different tasks that you have to perform when you're leveraging tools. Think about the fact that we're going to have separate pipelines, most likely, most likely we're going to have a data pipeline for data tasks and we're going to have a building and testing pipeline for building the models. Once the model perhaps is stored in a repository or a registry somewhere, we may have another pipeline that triggers off of that to get that deployed into production. So there, there's not just a pipeline here that we're looking at. It's most likely multiple pipelines that we're going to be building. And we need to look at tools that are going to uh perhaps serve us well in this space. And also again, give us the ability to monitor and audit things. Ok.

So sage maker again, what does it allow you to do automate workloads at scale? It does have the capability sage maker pipelines does have the capability to build c i cd pipelines at scale. In fact, uh there's a tool in there called sage maker pipelines. And in fact, they just, they just today announced that there were some improvements uh in that to make them a little bit easier to use. You have a catalog in there to catalog your different models, to put metadata cards, catalog cards along with them, you can track lineage, you can track experiments that you're performing. You can keep all of those experiments and all the metadata around those experiments in one location and then you can maintain accuracy as things are deployed. Leveraging tools like sage maker, clarify for monitoring for things like model explainability model bias data drift and things that we know that we're going to have to pay attention to in an ml ops space.

So there are some really good ml operations features here that are part of sage maker that are built into it already that really allow us to start to adopt the c i cd practices here. A lot of tools on the data side, data preparation side, data wrangler. If you were here maybe on monday and sat through the the data processing class that they had or the uh the stage maker uh processing class that they had. You'll notice data wrangler, nice visual tool for preparing and getting that data ready without having to write a lot of code, right? But to also be able to take the task that you've done and maybe turn that into a processing job so that we could automate a lot of that. So some really good tools there, a lot of capabilities for experiment tracking, for training models, for tuning models and keeping track of all of the different tasks that we're performing. While we're doing that for registering the model, we have a model registry built into sage maker and then we can deploy those models right from sage maker itself in on end points that uh amazon will host for you.

Now, you have to basically tell us are a couple of things, what's the model and what kind of instance do you want to deploy it? And we get that deployed out there for you. And we even make it easier to kind of deploy new versions through things like a b testing and shadow testing of new models. So a lot of really nice features again, manage service here. We're bundling all these tools all into sage maker studio to make it easy for you to implement these practices today.

Ok. So security and governance, let's talk about that a little bit. Let's talk a little bit about some of the security and governance tasks because they're going to become really important as we look at. You may have heard in the dev ops world, dev ops, we're incorporating security best practices into our c id builds. It really shouldn't be much difference here as we go into the ml ops world. Again, we, we also are now incorporating not just the security of the application, but we may also need to think about security of the data as well. Ok.

So we may also now have some additional concerns that maybe we didn't have in the dev ops space, but incorporating security, incorporating those security practices, automating that into the flow as we're doing that and not making it something that we do after the fact that consideration of things like the security of the training environment itself, how are we hosting the training environment? Are we doing it through network secured vpc s for example, that are not exposed to the internet directly? Maybe hosting that with a sage maker endpoint that allows us to access the sage maker service and maybe if we do need to go out to the internet, maybe to go get some libraries or something like that, we're doing that through a a gateway, we're not exposing it directly out there.

What about the container security of the model itself? Models and sage maker are served up through containers. So the container itself, the security of that as we're running it, the model security, the predictions that it makes, who has access to those predictions, who can look at the prediction that is actually made and the result so that, you know, first off, we're securing the system itself and making sure that we're protecting any kind of data privacy. We also don't maybe get things like, you know, um undocumented clients of the model as well that can cause scalability problems later on. So making sure that the people, the processes that are allowed to access the model are doing so so through the right permissions.

And finally, and most importantly too, because these models serve up a lot of data, they interact with a lot of data, they are trained on massive amounts of data. What about the data security? Ok. What about the personally identifiable information that may make up some of that data? Is that data, has that been masked when we've trained on it? Have we removed personally identifiable information or have we, you know, secured the system enough so that the data is at least protected as it's stored, stored encrypted at rest. For example, how about as it's moving across the network? Is it encrypted as it's moving across the network too? We need to be concerned about that. We need to be concerned about data and the protection and security that data through all phases of this process. I mean that needs to be auditable too and who has access and what's their access to it? Ok.

So again, talking about technology here, sage maker has tools like the capability to isolate the infrastructure through vpc s authentication and authorization. We can look very specifically at who has permissions and what they're allowed to do.

"We have data protection mechanisms built in for things like encrypting data at rest and encrypting data in transit, which happens automatically through SSL encrypted endpoints. We have monitoring capabilities built in through several different tools like CloudWatch, CloudTrail but also tools like Clarify that allow us to kind of do bias detection, data drift and things like that as well as SageMaker Model Monitor where we can capture things like predictions that are coming into the endpoint and save those out there and then check to see whether or not we're getting data drift on those.

And also SageMaker has been a part of a number of different compliance and certification programs already. You can go out to Amazon Artifact and you can download and see which ones that we are a part of.

So the last section here before we start to wrap up kind of deals with ML governance and in ML governance, the types of activities that we would be concerned with here and the things and best practices that we want to concern ourselves with would be things like creating a documentation and process of the processes that we're doing. Right. Again, I go back to my example earlier of talking with the data science team and asking where the documentation is. Oh, it's in the notebook, right? That's it's often we need to get it out of there. In a lot of cases, we need to understand from a user perspective, all these different personas and what their permissions need to be and make sure that we have the right permissions but not too many permissions, the models.

We need to understand their behavior, we need to detect things like performance issues. You know, how many invocations, for example, per second are we getting? Are we see latency increasing? Um what's starting to happen as the invocations increase or, or the latency starting to go up as a result? Do we need to scale endpoints uh as a factor. And then we need auditability and compliance with regulations as well. Uh from a governance perspective, that's really important.

So with SageMaker, again, roles can help us here, roles can determine exactly what i can do when i assume the role i become that persona and i get the necessary policies that are associated with it. That is true of a user, that is true of a service, whether or not we're talking users or service, the role can determine what you're allowed to do. And by associating what role you're allowed to assume, you therefore can have the right personas in SageMaker for with the types of permissions that you need.

Model cards can become your documentation for the models that you have built. So model cards can actually tell you a lot. And in SageMaker Registry, one of the nice things about the model cards, especially if you're tracking things like experiments. For example, it'll start to fill out some of those model cards for you automatically. Here's what this model was trained on when it was trained on. Here are the hyper parameters it was trained on here were the results of the performance tuning job. Here's the score it got out of it, for example. Ok. So a lot of that can be built and model cards can start to become the documentation that can tell you exactly what this model is and what it's for and that becomes critically important as we start to scale.

And then a model dashboard as well to kind of help us keep track of all the the models that are currently deployed where they're deployed. What types of hardware are they using? We can even set up notifications on things like different types of performance metrics or any kinds of errors that we're getting. Uh we can set that up from the model dashboard so we can troubleshoot performance issues and things like that as they arise. Ok. So governance is really an important part of this whole ML process here. And it's an important thing to think about implementing explainability, understanding what features are so important in that model process.

Ok. Find, let's talk about a maturity model here and, and look at the four stages of an ML Ops maturity model and, and maybe try to figure out as you're looking through this where where you might fall uh on these uh four different stages. So you have initial initial is typically we're, we're experimenting here, we move into repeatable, we're starting to implement some pipelines, some automation here uh in the repeatable state, reliable. Maybe now as we are automating things, we are also automating a lot of our testing processes here uh in the repeatable state, reliable. Maybe now as we are automating things, we are also automating a lot of our testing processes here. We are starting to see that we can bring on more testing, more automatic type of testing here, more monitoring capabilities as well. And finally scalable to be able to be an organization that can handle multiple different types of ML solutions at one time in the tens, moving from the tens, perhaps into the hundreds, even into the thousands of use cases for larger organizations being mature enough can scale into that.

And so initially, you can see, hey, there's a little, very little collaboration here. There's maybe again, some experimentation going on uh limited cross training across the different teams here as to what's happening. There's a lot of one time data collect tasks here at this phase. A lot of, you know, let's ok, let's get this data. Great. I've got it and i'm gonna use it for my particular problem. But, but that data is not anywhere where it's shareable, it could be reused by another team, for example, um manual training processes, maybe no retraining or automatic retraining that gets here at all and manual deployments as well. The deployment process happens manually. There's a lot of effort to get the model from the data science team and get that into production and get that operationalized now.

But as we start to move into repeatable now, what we're starting to see here at the repeatable stage is more collaboration between the ML engineers, uh the eml ops engineers and the data science teams defined data strategies here. Ok. So instead of maybe for just, for example, instead of maybe all these one off data collection tasks where data is stored just all over the place in various everybody's got it in different buckets. Maybe we start to implement centralized feature stores so i can see that some of the work that you have already done and and maybe that's something that i could use uh in my experiments going forward. So maybe we start to centralize some of that. We start to have a defined strategy of how we're gonna handle data here at scale.

We have defined paths for training, automatic training pipelines start to emerge here. Our tools here start to automate a lot of these tasks that used to be done manually in notebooks now. And we're moving that into an automated tool and then automated deployments, of course come out of that and in the reliable phase now, ok. More cross training, more automated validation, more automated testing is happening, things are validated. And so the push into production now as we're moving the model into production, we can have high confidence that we can move that model and we can test it. If we're seeing any issues, we can roll things back without affecting any sort of production system that's going on right now.

So maybe we implement blue green deployments perhaps or maybe we implement more automated deployments here. We're starting to see those really good CI CD dev ops practices emerge in the ML space now. Ok. We're becoming reliable in our automated processes here and being able to trust what we're doing in uh a lot of our deployments here and again, implementing some really good monitoring here and automating the monitoring process.

And it's scalable is being able to just take that what we're doing and do it at a very large scale to be able to tackle more ML problems simultaneously with the resources that we have because we have high confidence in the processes that we've implemented. So it's scalable again, cross functional teams, cross training mechanisms to share best practices, good communication, emerging, good processes and a comfort and a high understanding of how to use our tools to make it work for our environments here.

Ok. So a couple of key takeaways, I'm right on time. So hopefully i can get you out of here right on time here. Collaboration, right is key to this cross functional teams speaking the same languages, right? Being able to understand what each other is saying. Experimentation is really important. I don't wanna, i don't wanna downplay that. Experimenting with some of these great new tools is fantastic, right? Hopefully, you're all coming away from this event and you're really excited about maybe some of the stuff that you heard today in the keynote and you can't wait to try it out, but we can't just stop at experimentation. We have to have processes in place to go from there to get to a path into production. And hopefully that's what uh some of the talk today uh will help you kind of spark.

So use purpose built tools and built in integrations again. Cool is great. I hope you all use SageMaker Studio. I really do. I i'd love to come to your site and train you on it and use it but use the tools that make sense for your org. Look at CI CD pipelines that you have automate your workflows, monitor the quality of those models, implement some kind of quality monitoring. Look at the tools that are out there that will avail help you to do that and to automate that and find ways to track lineage of data automatically so that you know, where did this training data come from? What, what changes has it under? Because at some point, somebody will ask that. Ok, so where did all this originate? What happened to it along this path that becomes very important in a very mature organization to be able to have some mechanism for tracking that and then continue.

This is a one hour version of a three day class. Ok? I hope that you would all consider maybe taking or types of training on this, developing your skill sets around MLML operations. We have like i say a three day classroom learning. But look, there's also digital learning take that is very, very good. I know you've probably heard skill builder that is a very good resource to go to. There's some good training out there to get you started on some of these different projects and then think about some of the different ML ops tools like i mentioned, check and see what's available in the SageMaker. How would that fit in your organization today? How could you maybe leverage some of the tools there and interact with some of the tools that you're already using?

So for example, maybe you're, you're using Jenkins as your orchestration server and you're really comfortable with that. Well, there's mechanisms to integrate Jenkins with SageMaker pipelines to kick off a new build. So there are integrations that may exist perhaps that you're unaware of. So pay attention to what the tools can offer you and what the technology can offer, you know, keep in mind that it is always changing. So again, there's the link for skill builder, definitely check it out because it's going to be really important to continue your learning. This space is gonna change fast. There's gonna be new things coming all the time. As you've seen this week, there's all kinds of new services, all kinds of new technologies for us to go out there and learn. But again, uh i hope that you'll start with people and processes. Uh and then look at all the cool tools because there's a lot of really great ones out there."

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫