Accelerate foundation model evaluation with Amazon SageMaker Clarify

Mike: All right, let's start with uh with something interactive. So large language models don't always get it right. Here's a particularly infamous example. So what new discoveries from the James Webb telescope? Can I tell my nine year old, this was a question posed to one of the leading large language models. And I'm going to ask folks in the audience of what's wrong with the answer. So the answer given was the James Webb space telescope was used to take the very first pictures of a planet outside the earth's solar system or exoplanets. So anyone just shout out if they know what's, what's wrong with that answer?

Audience: Yes, exactly. It wasn't the first, the first, the first discovery of exoplanets was a decade earlier by the European Space Agency's very large telescope.

Mike: This is an example of what we call hallucination. Large language models are designed to give plausible answers. They don't always give accurate answers or truthful answers. They'll give plausible answers based on the underlying relationship between words that they're trained on. This particular example happened at a very unfortunate time. It was when the provider of this large language model was having a roll out in presentation and the stock value for that company dropped over a billion dollars because of it.

Here's another example of large language models, not always getting it right. This is also an infamous example and it was a Berkeley scientist who asked one of the leading large language models to write a Python function. If a candidate to be a scientist would be a good scientist based on two input parameters, race and gender. And the the large language model responded with the function that you see here, which is if the candidate is white and the gender is male, then return true. And if either of those conditions are false, then to return false. This is an example of large language models, stereotyping or reproducing, perpetuating the biases that are encoded in the data on which they were trained.

My name is Mike Diamond. I'm a principal product manager with SageMaker Clarify and joining me today are Emily and Tarn and they're going to be introducing themselves in just a few minutes when they come up after me. And we're going to be talking about foundation model evaluations with Sage of Maker Clarify, which was introduced this morning in the keynote by Swami. And we'll be talking about how it could help mitigate the risks of large language models during model selection and through the model customization work flow.

So, generative AI or the nature that it's generative introduces new risks for your organization. We already talked about a couple of them, we talked about hallucinations. We talked about stereotyping, which is a form of bias, but they also have other risks. They could, they could reveal in the responses, privacy information such as personally identifiable information or copyright. They've been known to have examples of toxic or hateful or profane, for example, responses and many others risks as bad actors get their hands on this capability.

But making sure that your generative AI application is giving responsible responses is not the only risk as we saw in the example. The first example that I showed, we also have to have risks that they're not accurate, that they're low quality. And why are these risks important? Well, unlike kind of consumer uses where you're interfacing with a, with a chat interface and it gives you, you know a wrong answer or you could kind of it's silly and you could laugh it off, but when it's being used for business cases, then its customers trust and your brand rotation that are ultimately at risk. And in addition to those concerns, as we saw earlier, early last month with Biden's executive order and with the EU AI act that was drafted earlier in the year that there is going to be regulations that will be following guidances such as ISO 4201 which call for evaluations of large language models, not just for the providers of those models, but for the consumers of those models, especially in industries with high impacts to individuals.

But evaluating large language models for risk is complicated and is time-consuming there are hundreds of possible LMs to choose from every day. In fact, there are more coming online and you need to perform the evaluations, not just upfront when you're trying to select from these hundreds of models, but as you, as the workflow here shows as you customize that model, you're continually evaluating as part of your FM Ops processes and workflows.

There are academic benchmark sites that are available out there where you can go to, they will have a subset of the models you're interested in and they require highly specialized knowledge to understand the scores of metrics that are available is an MM score of 236 good or bad. You can try to create your own evaluation capabilities by finding tools and deploying or importing them into your development environment. But that itself is very time consuming and and requires a lot of effort. We, we tried to run Helm which is one of the large large benchmarking sites. We tried to run Helm locally within our environment. It took over 100 large instances and it ran for over 24 hours.

But the the concern that I hear most often when I talk to SageMaker customers is that ultimately the scores provided by these tools and capabilities don't relate to their specific use cases. So earlier today I was very excited when Swami announced and we released in preview, I should say Foundational Model Evaluations with Amazon SageMaker Clarify. And this capability will allow you to start evaluating any LM anywhere for quality and responsibility in minutes.

And there's two parts of that north star vision that I want to underscore the first is quality and responsibility being twinned together. So we intentionally want to build an evaluation capability where responsible AI is not a secondary step. It's not an afterthought, but something you do at each stage in the, in the ML workflow when working with large language models. And second is the start is to evaluate any LM anywhere. So as we talked to SageMaker customers, the feedback we got was that they didn't want their evaluation controls spread out over multiple environments where they're using the large language models. They wanted to concentrate and centralize within one place their evaluation logic and be able to use it across the different models. So that's what we're bringing to market.

And the benefits are shown here on the right. We bring, we've curated open open source data sets to get you started evaluating in minutes. But importantly, we've built this to work with your prompts and your, your, your data so that you're getting results that are meaningful to your use case. Not, not every evaluation can be done with algorithms. There, there are some evaluations that just as Swami talked about this morning, like brand or helpfulness that are subjective and that you will want to have human evaluations. And I think we're, as far as we know, we are the only evaluation solution now that's built from the ground up as a single solution that covers both algorithmic metric based evaluations and human evaluations.

We've put a lot of effort to make rigorous science accessible and actionable and the capabilities for evaluation are integrated within the SageMaker set of services so that you could have enterprise level FM Ops capabilities and all of these benefits will help you as you comply with the guidances that I spoke about earlier.

So a little bit on the use cases, I touched on them but to go a little deeper, the first use case that we're looking at is to help me select my model. So and that initially when you're trying to choose the right for your use case, this is a capability to help you make those decisions. And then secondly, as once you've selected the model, there are many options to customize that model to get, make it perform better, such as prompt engineering, such as RAG RLF and supervised fine tuning. And Emily will be going into into these in more detail. But you could run evaluations after each of these optimization exercises in order to see which optimizations are working best and ultimately which ones you want to take to production and the users for this capability that we designed it for are the primarily ones are the ones in in highlighted in think up at the top there on the left, which is the core SageMaker users of data scientists and engineers. And also as I mentioned, for the human annotators who will be review, who will be reviewing and providing evaluation feedback.

And these two groups will will work complimentary with the tool set. So you can use the algorithmic scores produced by the automatic evaluations to optimize the usage of your human workforce that is only have them review the ones that are the riskiest and getting, therefore getting the most use out of them. And then the feedback that the human reviewers provide will obviously be consumed at the ground truth by the data scientists. In order to optimize and improve secondary users that we thought of when we were building this capability are the business and product owners who will use our summary reports in order to make decisions around which version of a model to take to production and then your second level oversight function. We have a much more detailed report that's 40 plus pages. It gives you lots of information that can be used for documentation for compliance purposes.

All right, let me go a little bit deeper now into the um to the actual features and flows. So we have um three user interface options available for you and these are listed in order of simplest to use to more sophisticated to use but greater control. So the first option is a UI inside of SageMakers Studio where you can click and within a few clicks, create an evaluation within minutes. The second option here is you can run processing jobs and have the benefit of a managed service. So with a few lines of python code, you can um you can get started with an evaluation and then the version that gives you the most control is. And I'm proud to say this. We've released the FM Evaluation library that's driving this on GitHub as an open source library. And you could use that library and the python code and the examples that we provide anywhere within SageMaker, we come with uh tasks. So these are the four tasks that we've we've launched with, which is general text generation, summarization Q and A and classification. And you would uh you could pick any of these four tasks. The flow that I'm going through is the same pretty much. I'll point out where there's differences between the different interface options.

Then you would select uh human or automated evaluation depending on which type of evaluation you want to perform. The uh the human evaluation is can only be set up in the UX, but the automated could be could be set up from all three options. If you're using the automated evaluation, you will likely get started with built in data sets that we provide and these data sets are configured across the evaluation dimensions here on the right that I'll talk about in a second or you can bring your own data set, we give you documentation and it's easy JSON lines format. You could follow the built in data sets and bring your own the algorithms that we provide. We have multiple algorithms which are organized across these dimensions. So the quality dimensions of is this accurate? Is it robust? Is it factual? And then the responsible AI dimensions is it's stereotyping, is it gonna be toxic responses? And we will be extending these to additional dimensions shortly? And then we uh provide the results to you in S3. We have both summary and detailed reports, but you can have all of the underlying data and be able to configure the reports. As, as you want Emily, you want to come up and talk about LM evaluation, use cases and SageMaker.

Emily: I do. Thank you, Mike. All right. Uh good evening. My name is Emily Webber. I lead our generative AI foundation's technical field community. Uh and in particular, I'm here to, to share with you some thoughts about LLM evaluation. So let's dive in. Uh so first, it's helpful to define what indeed an evaluation is. Um those of you who are familiar with this, this is no surprise but if this is new to you, um we like to bring math number and statistics into the generative AI process. Um of course, it's, it's very convincing to think that what comes out of an LOM is always good every time you see it. But as Mike demonstrated clearly, that's not the case. Uh so we need evaluation uh to produce scores essentially. So we'll take a prompt uh such as asking our model uh to create a short story about a doctor in Manhattan. We'll send the prompt to our model and we get some type of output. Uh in this case, the model goes straight to the male pronoun. Uh and you with female doctors out though, we know that is a biased response. Uh that's a type of bias in particular. Uh and what we want to do is we want uh some type of evaluation algorithm uh to calculate this, to define it uh in a numerical terminology. Uh so that we can bring those numbers to bring those data points uh into our governance workflows and into our operational workflows so that we can define this uh monitor this bias and then mitigate it.

Um but of course, it's not just the bias, that's the problem. It's also the, the lack of quality and the lack of accurate responses. And so generally, evaluation um again, is bringing numbers back into the, the AI process if you will uh for large language models. Now, we know that there are many ways to customize a large language model. Uh I suggest that they vary in terms of complexity and cost on the x axis down there. And then of course accuracy. Uh prompt engineering is the way most of us get started when we're evaluating and customizing a large language model

"Uh this is because it's, it's easy to get started and it's not too expensive. Um but the techniques grow over time, right, most customers will move into some type of retrieval augmented process that's using a vector database to improve the performance of that LLM. This then gradually morphs into a fine tuning process where again, you're using your data to customize a model. And then pre training is sort of the holy grail where you access large data sets and and create your own models.

Uh and the only problem with this curve here is there's no point that says where you should start. It's not immediately obvious um where uh we should optimize our uh development and deployment teams in this life cycle. So I'm excited to say that you can use large language model evaluation to make the right choice across your entire development and deployment life cycle.

Starting on the top left, you can use LLM evaluation to find the right foundation model to make the right choice about a a foundation model. Uh this choice can include the cost obviously the performance of the model, the accuracy of the model, its ability to handle the types of questions that you care about. Uh and the bias of the model, the responsibility.

Uh so you'll start with selecting the model and then that entire workflow we looked at moving from prompting into retrieve augmented generation into fine tuning LLM. Evaluation helps you know where you are in this process. It, it helps you establish, you know, a firm plateau or firm foundation to help you make the right decision about spending more time and resources as you move across this life cycle.

LLM evaluation is helpful in reducing costs. Um because if you realize that you're spending maybe too much money uh on a particular model, um we can use LLM evaluation to see that actually a different model might help us hit similar accuracy levels while just being lower on cost. So there's a cost reduction opportunity um migration opportunities as well as you're developing applications and considering, you know, where the rest of those applications can sit. LLM evaluation will help you make that case. And then as Mike mentioned, reinforcement learning with human feedback and then governance and MLS.

Uh these are additional capabilities that you can use to improve the performance of your LLM's and all of this hinges on how good the model is and how much time we spend on getting there. And so this is why LLM evaluation is so important.

Now, how does this work? Uh so the first step in using this this capability we're launching is of course, is of course first selecting a model so in the managed web application, the uh flow that's available in SageMaker Studio, you'll start by pointing at any model in SageMaker JumpStart or any model you have hosted in SageMaker. If you are using the Python SDK, you can point to any model hosted anywhere. So you can point to Bedrock. You can also point to an open AI model using our custom examples.

So you'll point to your model and then you'll configure your evaluation. This can be both a human driven process and you'll use a human driven process when you're trying to define something that isn't already prebuilt. Like the brand of your company, your voice, how friendly you want them to be how creative you want the model to be. Um but if there's already a well defined NLP metric and a data set, then you can just use the automated one. But if there's not already one or if the performance isn't great, if it's maybe 80% of the way and you want to close the rest of the gap, then the human labeling is gonna be uh a good choice.

After that, you'll select the task for the foundation model. Uh this can be open-ended generation, this can be question answering uh summarization and then text classification. So one of those four tasks uh and then each task comes with a number of algorithms uh that implement basically the test for how well the model will perform. And then as Mike mentioned, we also have prebuilt data sets. So uh standard NLP data sets that make this entire experience again, very easy and very fast.

Uh for a Python developer with not a lot of data science experience to be able to get up and running in minutes. It's it's really very friendly to use. So you'll set up your evaluation uh and then you'll configure the processing job actually. So it's gonna run a SageMaker processing job in the back end. Uh you'll point to an IAM execution role and set up a number of processing instances. Uh we manage the distribution of that for you. So you can set however many instances you would like to use, you'll pick this size of that instance actually. So as you're using larger data sets, you'll use larger instances.

So you'll configure the job, the human feedback process will render directly in SageMaker Studio actually. So it has a prebuilt labeling user interface that you'll interact with. You just add your email address very easy to set up the human flow. Um and then uh there will be a number of models that you can set up in the human flow. Um and then they can add many uh metrics about their preferences and you can gather aggregated human win rates and preference ranking. In any case at the end, you'll store all of this, capture the results. Uh and we'll down, you'll perceive a 40 page PDF report that's not just giving you the metrics, but also explaining what they mean. It will show you really nice graphs and visuals that help you understand what these metrics mean for your business and for your customers so that you can make the right decision.

So that at a high level is how to use foundation model evaluation. I'm gonna move into the demo in a minute and we'll, we'll take a look at this. Um but I have two pipelines to share with you actually. So in addition to the open source uh FMEL library, we also have an example library of running the Python library in SageMaker pipelines. Uh so if you're looking for a way to standardize governance and operations for foundation models, we have examples for you to easily validate and confirm performance uh for both single models and for multiple models.

Uh one of the pipelines we're gonna look at actually evaluates a LLama 7 billion parameter model with a Falcon 7 billion parameter and a fine tuned LLama and runs all of these at the same time. And then does one evaluation step to help you determine which model you would like to do. So with that, we're gonna look at the demo.

Alright. Uh so here we are, I'm in US West 2 right now. You'll notice I'm using the new version of SageMaker Studio. So we just pushed out a new version of SageMaker Studio as of this morning, um if you are using SageMaker Studio Classic, this UI is not going to render. So make sure you're using the new version of SageMaker Studio.

Uh and so again, this is a fully managed web application uh with a number of options including of course our Studio and Studio Classic uh for the the older interface. So on the left hand side, I'm going to navigate down to the Jobs tab, then we'll see Model Evaluation, you'll also see JumpStart. And so if I'm looking for a new model to evaluate, I'll pick JumpStart and then within JumpStart, I'll just type Falcon. And so for Falcon, of course, you have a number of options in the JumpStart model repository, I'll pick Falcon 7B and so on the model card page, uh you have three options. You can fine tune this model, you can deploy the model and you can evaluate the model, actually the evaluation job um can take a pre hosted endpoint. So if you already have a SageMaker endpoint created in your account, and you just want to point to that and run a, a job, you're welcome to do that.

Um if you want to create a new endpoint um temporarily while the processing job runs, that's fine as well. So you have, you have many options there. So we'll say evaluate and then I'm gonna zoom in here just a little bit to get a bit of a deeper view. So I say Emily test.

Alright. So as promised, both the human driven evaluation and what we're calling automatic evaluation. And so the automatic evaluation really means again those pre-existing NLP metrics. So things like F1 score, things like accuracy, um things like toxicity, uh every evaluation metric that uses basically math and then that runs an algorithm or that uses a prebuilt data set. We, we put that in this automatic workflow. And then the alternative is using human based evaluation which will create a labeling job inside of this new SageMaker studio experience um for your teams to click through and then figure out your preferences to ultimately again perform your model evaluation.

So we'll say automatic fine tuning or automatic evaluation. And then we see the Falcon 7B model is already available here. If you want to pick a different one, I'll just say remove and then I'm going to add the model here. So this is a second little interface for picking the model. So let's say we want to do Falcon. So they're available here. I'll say Falcon and then here's the option to pick a endpoint that I've already deployed. So I'll go find my Falcon endpoint, which is right here and I'm going to add this. So I'm pointing to my Falcon model.

Now, I'm gonna select the task type again, we have four prebuilt task types. Uh one is uh text summarization. Another is question answering classification and then open ended generation. So I'll select open ended generation and then we have different dimensions inside of this task. So, prompt stereotyping uh is a set of is a data set uh with two types of uh prompts. Actually one is very stereotypical and one is less stereotypical. And then this dimension will test the model's likelihood for being more or less stereotypical. And so that's this Crows pairs. So again, I can use the built-in Crows pairs or I can use my own data set. And then when I'm using my own data set, I just need to format the data uh to hit this model.

Um all of the data formats are JSON lines and that's the case for uh the human based evaluation, the Python SDK and the UI everything is in JSON lines. Um some of the keys in the JSON lines are slightly different, but generally, uh we built the back ends to be very uh flexible across this entire stack. So I'll use a built in data set and I'll select the Crows pairs and then toxicity.

Um so toxicity will bring a pre trained model. Uh this is called the Unitary AI toxicity model and it will uh basically run a type of binary classification to see uh how toxic this model will be. Um and then we have two data sets, the real toxicity prompts. And then this bold data set and then factual knowledge.

Uh so, in the factual knowledge uh dimension, the data set here is called uh T-Rex. Uh there are no dinosaurs in, in this T-Rex data set. I did check. Um but it's essentially questions and answers uh in this data set, the answers are typically single word. Um so questions like uh who is the president of the United States in 2018? Um or in which state is the city of Las Vegas located? Very sort of one or two word questions.

Um and then yeah, we bring that in the JSON lines format. Uh and then for all of these, we take the prompts and again, the prompts here are, are built in or it's the prompts that you are loading. So you have both of that, both of those options. And then in this case, we are sending those prompts to the model on your behalf. So we send the prompts, we get the responses and then we compute the metrics and we'll give you all of the evaluation results so that you can understand how well this model is performing.

So we'll say factual knowledge use the built in. Great"

And then I'm going to go find my S3 bucket actually see if I can find my three bucket just a second here. All right. So US was two and what's at this one? Great. So I'll pass my S3 location.

Great. So that is where my evaluation results will go uh on completion of this processing job. Uh I will see a PDF report, I'll see a markdown file um and I'll see all of the uh configuration data in addition to of course, the input data uh when I'm bringing my own and then I'll set up my processing job.

Uh so this is a smaller data set. So I'll use just one instance and in particular, the M 5.2 XL, uh if I were working with a larger data set, I would simply increase this. Um so you can bump this up to the M 5.4 maybe the 0.12 if you're working with something larger. All right, great. We'll take questions at the end and then my uh execution role is here. So I'll select next.

Uh I get a nice job summary. So my job details, uh the evaluation setup that I have and I'll say create resource and of course, I used the wrong S3 bucket. Oh, did I? Oh yikes. Oh, thank you. Classic merit. Here we go. All right, great. So my job is coming online.

Uh I can also see a number of jobs that I've been running uh to test and confirm this. So I'll just show you uh some of the other jobs that I've run previously. Uh so I ran a job to test the Falcon 40 billion parameter model. Um in this case, the service is creating an endpoint for me again, using SageMaker Jumpstart uh running the evaluations and then tearing down the endpoint on completion and giving me the evaluation results.

Uh the view we're seeing here is a summary. Uh so this is the condensed uh view that's telling me um the uh number that I'm getting on prompt stereotyping. Uh and then all of the reports about toxicity again, using the Unitary AI uh pre trained model factual knowledge and so on. Uh all of the metrics are scaled between zero and one. Most of the time, a higher number is better uh except for toxicity. Uh when of course, a, a lower number is much better.

So from the model summary page, I can then view evaluation results and that's gonna take me out to S3 uh where I'll see again the markdown PDF uh and then the JSON config and outputs I'm gonna jump straight to the reports. Uh so this is again that 40 page detailed report uh that summarizes both at a high level what my job is about. So, again, factual knowledge um with the aggregated performance prompt stereotyping toxicity, uh the evaluation job config and then a detailed report.

And then again, uh I love this report because it's providing insight, not just about what the result was for the metric, but what the metric means, how you can interpret it, um how to consider the implications of that. Uh and how it is, how it's commonly used a couple of nice visuals for you. Uh and so the um the stack actually lets you bring what's called the knowledge categories.

So especially for factual knowledge, um usually customers will bring uh data sets that have many different categories already. So uh one of the diy jobs that I ran was about the AgeMaker FAQs. Uh and so I had many different categories including about like hosting and training and endpoints. And so all of those like detailed knowledge categories you can bring into the stack and then it'll give you a further level of analysis.

So in any case, that is my report, we're gonna move on. Uh so the report is just the beginning again, you also have an open source Python library, this is available on GitHub. Um so FMEVal is the, the name of this framework spectacularly easy to use. We, we designed this to be really, really accessible um so that, you know, no matter who is coming in to, to help you stand up your LLM development and deployment life cycles, they should really be able to evaluate a Foundation Model.

Uh so you'll, you'll do a general pip install. Um the pip install is extensive. It has a lot of the low level frameworks and packages built in. Um so you do one pip install. And then again, we have many examples um including examples for Bedrock, um Jumpstart Chat, GPT Custom models and Hugging Face. And then actually another mode where you can, we call it, bring your own output.

So if you want to go, you know, interact with the model wherever you like and then download the model results and bring them into this environment to do the evaluation. Uh then you can do that right there. I have a version of this running. Uh so we're gonna step through this very quickly. You see, I'm in StageMaker Studio and the library runs locally. Actually, the library is gonna run um locally in a notebook and then you can put that in a job if you want to. Um but when you're stepping through this, just, you know, to explore it, um you want to set up a slightly larger instance, the M 5.2 is fine. Um I am probably running on a little bit more um just to make sure this, this was nice.

So in any case, pip install FMEVal import SageMaker, I'm gonna point to, again LAMA 7B here, we'll use Jumpstart to just deploy it. And then the config should be really easy. So again, from FMEVal, you'll employ a data config, call it a model runner, which is just a pointer to a model and then uh the configuration for the algorithm itself. So this one is factual knowledge and then you'll configure that.

So we're gonna point to this is a local data set that's just sitting in my notebook, you can also pass an S3 path uh which will then uh retrieve it if it has access. And then the factual knowledge uh data set here is again, question and answer. Um it does have this uh this or string here. So we'll point to our model again, we're pointing to a Jumpstart model and then we set up the factual knowledge evaluation run model dot evaluates.

Uh and then again, this is running locally on the notebook. Once this is finished, uh you'll get the evaluation output. And then here i can see um my LAMA 7B actually did uh quite well. So I'll 0.59 which was, which was uh surprising for a 7 billion parameter model. That was, that was, that was a nice, nice find. And then uh you can review the results out here.

Uh you can load them into a data frame, you can store them in a three. You can also put them in a pipeline. That's what I'm gonna show you next. Uh so we have another open source repository. Uh this is taking this new framework and putting it inside the SageMaker pipelines.

Uh so I've been playing with this all week uh for the last couple of weeks. Actually, it is very nice and easy to use works very well and it's easy to customize. So there's a large YAML file uh to configure this entire pipeline and then you can customize this YAML file to point to different models and to do slightly different things in your entire pipeline.

Um it's really valuable again to build a pipeline for evaluation because it lets you evaluate new developments very quickly. So whether it's a new step in your retrieval process, whether it's a new agent, you're testing a new model, a new data set. When your evaluation stack is built as a pipeline, it makes it really easy to just run and see very quickly.

Um if you need to evaluate a new model even further. And so we're gonna look at this in SageMaker Studio. So again, I'm back in SageMaker Studio. This is the classic view. I'm running this framework and I have two pipelines.

Um the first is evaluating a single model. So I'm gonna open the execution details here and here is my pipeline. So I deployed a LAMA 27 billion parameters. I ran a preprocessing step and I and then I have this evaluation step that is using this library and it's taking the output of that model, running an evaluation and then storing the results in S3. And so actually, there's an HTML file generated in S3 showing me how my model performed. So that's the single model, I can also evaluate many models, all right.

Uh and so here is my beautiful trimo evaluation pipeline. Uh so there's uh so I'm deploying LAMA 7B over here running the same preprocessing across all three of you. Then I'm deploying Falcon 7B, then I'm fine tuning LAMA deploying this and evaluating it. So it's the same evaluation stock for all three of these models. I'm gonna bring them all together. I'm gonna make the choice about which model to use and then I'm gonna clean everything up.

And so I hope you enjoyed the demo. Uh I'm gonna hand it over to Tin and she's going to share you with you how they think about responsible AI and model evaluation at. Indeed on your back slides. Yeah. Ok. We'll go ahead and skip that demo.

Awesome. Um hi, everybody. I'm Tarn Heilman. I am a uh senior data scientist on the responsible AI team at. Indeed. Um first, I'll tell you a little bit just about how we think about AI. Um and Indeed, our mission statement is we help people get jobs and so we work really hard in our two sided marketplace to open access to opportunities for all job seekers. And so our development of AI applications to aid this um are guided by that mission above all else. Uh and so I'm on the responsible AI team which sits in the ESG or Environmental Social and Governance organization

In that organization, we believe that talent is universal, but opportunity is not. So we're delivering opportunities at scale on our platform. And if we don't take proper care, the biases and barriers that job seekers have been facing since paid work became a thing are gonna be reinforced and amplified at scale which we don't want to do.

So within ESG we spin the model a little bit to say that we want to help all people get jobs. And so we want to ensure that all job seekers are benefiting from AI.

On this slide here, you can see we have these five principles that we think about when we are interacting with AI:

  • We always put the job seeker first.
  • We center fairness and equity in all of our development work.
  • We are continually listening and taking feedback from our customers and our users.
  • Hiring a human is a really important one. AI aids so many parts of the job seeking and matching process already, but both job seekers and employers both really share this belief that a human should be in the loop and that a human should be making the final decision.
  • And then lastly, we innovate responsibly, which is why I have a job.

A little bit more about the responsible AI team in particular:

We're a very small team. The CEO always describes us as small but mighty. We have scientists, we have a software engineer, we have user researchers. And we try to help all people get jobs and sort of break down some of those bias and barriers that people face and make sure that our AI products are inclusive to everyone. And we do this in a couple of different ways:

We have two personas that we operate on. The first one is more of a blue team persona. So we do it in a collaborative way with the development teams. We ensure that their data and labels are collected responsibly, we help them ensure they have balanced data sets, etc. We use inclusive human centered design principles at every stage of the process. And then we really focus on transparency and explainability.

And then we also have a red team persona where we take care of some of the more regulatory functions. Hiring is a very heavily regulated field, not unlike healthcare and finance. So some of our algorithms are subject to audits under law. And so we look at those algorithms independently and perform what we call fairness audits, and we look at the traditional fairness metrics for those.

This is a complex process evaluating models. There are all sorts of different models. Historically in AI ethics, binary classification has received all of the research and attention. But on our platform, we do a lot of recommendation and ranking. There is also scoring - you could have a simple regression model. And so all of those different types of models need to be evaluated differently.

Then you also have to consider the impact that a product has on a user. For example, if you've ever gotten an email inviting you to apply to a job - that's generally considered a positive intervention. So we'd want to make sure that our selection rate for that, for example, might look the same across different demographic groups.

As a negative example, facial recognition algorithms are often used in policing contexts. Usually if your face is getting flagged in a facial recognition algorithm, it's not great. So you wanna make sure that your false positives are the same throughout groups.

But now let's talk about LMs because that's what everybody's here to talk about - LMs and J2Bay. This has added many new dimensions of complexity to how we think about evaluating models for fairness and for responsibility.

Discrimination and toxicity are a big one. If you're generating an AI recruiter email, you don't want there to be any inappropriate language in there. You don't want any violence or hate speech, any of the really obvious stuff, but there's also more subtle things like exclusion bias, representation harm. You also wouldn't want that recruiter email to express a preference for young energetic candidates - that's not great, there's some age bias there.

What if you're automatically generating a job description? We would not want a job description for a nurse to automatically put a bunch of female pronouns in there - that's a representation harm.

Other things we might care about are factual errors and misinformation. If we're generating a summary of a resume or why we think a candidate is a good match, are the things the text is generating actually true? We know they're not always true. And they may be true but not useful.

And then of course privacy violations. We will be training models on sensitive HR data and we would not want that leaked in any responses from these language models.

A bit more about how we're thinking about using language models:

One thing is a standard classification context - matching. We have job seekers, we have jobs, we want to match job seekers who are qualified for jobs likely to get contacted by an employer. We've been using language processing for this for a decade, but LMs do a better job than traditional NLP approaches like n-grams because they take into account more context with transformers.

That's a classification approach, but we also are using for generative reasons like drafting job descriptions to help employers. We might use it to explain job seeker/job matches - you might see an example where we're telling a job seeker this is why we think you should apply to this job - it matches your schedule, this employer pays well, etc. These are things the language model thinks are important to explain the match.

We're also thinking about chatbot use cases to help job seekers or employers, suggest jobs, etc. Of course this isn't a complete list - I'm sure everybody has 50 different generative AI projects going on.

So with all these complex use cases, we can use this product to help evaluate those models. One of the first things we care about is - is the output text factual? In the previous example we saw an explanation of why a job is a good fit - skills, relevant experience. We want to make sure that's actually true. If someone said I have data engineering experience but I don't, we'd want to catch that. Similarly maybe it points to something old/irrelevant rather than recent experience. We'd use the factual knowledge evaluation and summarization evaluation modules to see if we're producing factual, useful summaries. We may also need human annotators.

One thing we care about is biased language. I mentioned not wanting a nursing job description to use female pronouns. We wouldn't want a truck driver or doctor job description to use male pronouns. We could use the stereotype evaluation to look at how stereotypical the outputs are and hopefully reduce that.

Toxicity/inappropriate outputs is obvious - we don't want any of that. We can use the toxicity detection models to test for toxicity among different dimensions and get a classification of whether it's severe, hate speech, etc.

And then performance disparities between groups - a big thing we do in responsible AI is measure model performance across groups. We can use the classification evaluation and join with demographic data to measure performance across groups - we want the model to perform as well for men, women, all races, ages, etc.

I think we're a bit over time, so I'll hand it back to Emily.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值