Accelerate experiment design with Amazon Bedrock (sponsored by LaunchDarkly) (LaunchDarkly)

最新推荐文章于 2024-07-20 19:31:22 发布

李白的朋友王维

最新推荐文章于 2024-07-20 19:31:22 发布

阅读量95

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134818552

版权

Robert Neil: Hello, thank you all for coming to this talk. I'm Robert Neil. I lead a group within product delivery at LaunchDarkly called Decision Science. We work on decision science products including experimentation.

Today I'll be talking about an underrepresented subject - you may not have heard about it before. It's called generative AI. Let's do a show of hands - has anybody heard of generative AI? Oh well, color me green and call me a pickle, I guess I'm wrong!

I'll be talking specifically about generative AI in the context of experimentation. My background is in decision science, including experimentation. I'll discuss some of the problems we have in experimentation and cool ways that LLMs help solve these in really novel ways. I'll talk about it within the context of a project we recently launched.

First, let me tell you about our company because it helps set the stage for the problems we're solving. LaunchDarkly has over 5000 customers across almost every industry. This poses unique challenges for developing decision science products.

As an example, I've worked on experimentation at several companies. Previously I was at Twitter, and I've built experimentation tools for Credit Karma and Udemy. Those are internal tools with internal customers, which makes building experimentation products easier.

With so many external customers in different industries, experimentation becomes more challenging. The decisions, metrics, and experimental designs needed vary widely. This requires some domain knowledge that LLMs can help unlock.

Let's start with the baseline - a good experimentation system needs a parameterization system. LaunchDarkly started as a feature flagging company to decouple deploys from releases.

Usually when you deploy code, it goes live in production. If something goes wrong, you have to roll back. With feature flags, you can release things without behavior changes. Teams can release features independently and roll them back without rolling back everything. This is critical for experimentation.

Why experimentation? Studies show if you just go by gut instinct, you'll be wrong most of the time - 70-90%. Just thinking really hard about an idea doesn't mean you'll ship the right feature. Experimentation measures effectiveness.

Personalization is another big reason for experimentation. LaunchDarkly is a rules engine to target users - like different versions for US vs Europe, age groups, or plan entitlements. You can experiment on those segments to build an optimized personalization engine.

I want to contrast this with how some teams deliver features. It's not "bad" but just a "feature factory" mentality of shipping and patting yourself on the back without measuring value. Experimentation is the gold standard for measuring value.

Some use pre/post measurement - ship something then compare metrics before and after. But other factors could cause differences, not your feature. Experimentation randomizes users to isolate the effect.

Let me explain how this is different to set the stage for how LLMs are so powerful. At Credit Karma 10 years ago, we had a PHP monolith. Everyone merged to main, then once a week we deployed. We watched key metrics and if something went wrong, we had to painfully roll back everything.

With feature flags, we still deployed weekly but people could release features independently. Introducing "run an experiment" isn't enough - you'd have to redeploy to roll out experiments. You need remote parameterization to modify experiments without new deploys.

Think of your code as a model. Experimentation is policy optimization to maximize value. Your software is the policy. Parameterize it to explore the space and find optimal values without changing code.

And then someone like marketing um could just modify those parameter values, create a bunch of them and serve them without having to um uh to have an engineer involved. So the nice thing about that is when you deploy your software, you no longer have to hope that you included a variation that is gonna be the winner. And I'll talk about this in a little bit. But when you think about exploring the parameter space space and how much different that is than how people do product development, it's pretty striking.

So imagine like you come up with a product idea and it's a great idea. I mean, we talked about 90% of ideas being bad, but that's people not in this room. I'm sure for all of us we have really good ideas. So we come up with this really good idea and then like the way it works is we usually go talk to design and go talk to products and they're like, oh well, we should tweak it this way, we should tweak it this way. And you finally all agree on the best possible instantiation of that idea. But then you run an experiment and what if you find out that it actually didn't do well, what do you, what do you, um, supposed to take from that, that outcome?

So, one thing you could take is that your idea is bad, but we know that for us in this room, that's not true. So another thing you might think is just the particular instantiation of that idea is bad. And a lot of people don't actually um get to explore that much more because the feature failed and are already moved on to the next thing. And so what you really want to do is you wanna take that idea and parameterize a lot of it instead of just like working with design and product and throwing out a bunch of ideas, parameterize your feature such that you can um experiment on all of those ideas or iterate very quickly on those ideas once you've launched it and learned that that particular instance is bad.

So the idea here is that uh uh you have some parameter space and it's very unlikely that you pick out an area in the parameter space that's better than the thing that you already have your, your control. Cool. So now I'm gonna talk a little bit about um what I've been working on at launch darkly uh for the past couple of years and um how my previous experience at um different companies influenced this.

So I mentioned that um I worked at other companies building experimentation products where the customers were internal customers. And one thing that all of these places had in common is that there would be a group of data scientists who were experts in experimentation. And when someone wanted to run an experiment, they would go talk to these data scientists. And they'd say, hey, here's an idea for an experiment that i have. And then the data scientists would be like, oh you know what, you should use these metrics or you need this particular type of experiment design and then they'd run the experiment and then they'd go to the data scientists and say, hey, here are the results i got, what should i make of this?

But when you're building a product um for external customers, like we are at launch darkly, you can't depend on them having the same sort of resources. So when we built this product, one of the things that we wanted to do was codify a lot of those questions into the exper experiment designer in the product itself.

Um so one example is uh and I'll show you actually on this next screen. Um when you first start an experiment in addition to asking for the the hypothesis and the name of the experiment, we ask you what type of experiment you're gonna run because the type of experiment you're gonna run actually changes the parameters that go into your um experiment design in really important ways. Like if you don't do this it could invalidate the conclusion of your experiment or it can make it very difficult to interpret without a data scientist.

Um so another key point here is the hypothesis. So, so you can imagine like if any of you have used a b testing platforms before a lot of them um don't have this, this idea of like experiment types. And in fact, this is a problem that we had at twitter, which was we built this experimentation platform and um everybody in the company was expected to use it. And so what would happen is people would want to run different experiment types. And they'd say, oh you know what we run, we wanna run this type of experiment. I don't know, let's say it's a multi arm bandit. And in order to run that type of experiment, we need your experimentation platform to support some features.

And so um before i joined twitter, um the team would add features every time someone asked. And when I got there, the experimentation platform was a little bit frankenstein, everybody's feature was supported, but it became so difficult for people to use because they didn't need most of those features. And so one of the things i worked on a twitter, which is sort of incorporated here as well is this idea of making sure that we know the intent that people have when they're interacting with the product. So the product's not overwhelming.

So one of the things we do is we ask for what type of experiment you're running? Um we also want to ask for your hypothesis, which is also critical. So uh a little bit about my background, um i started studying psychology as an undergrad, then moved to philosophy and got and studied philosophy both in undergrad and graduate school. Um i started in psychology and then end up doing philosophy of cognitive science. So have a background in psycho psychological sciences, as well as cognitive sciences and um studied experimentation a lot. But as a practitioner of experiments in in academia and one of the things that's influenced a lot of my thinking about experimentation is something called the replication crisis.

So this has become fairly popular. So some of you may be aware of it. But if you aren't, the replication crisis was particularly bad in in um psychology. And the problem was that a lot of studies that we thought helped us understand uh human behavior were not able to be replicated. So people would find a statistically significant result, but then someone else would try to replicate that study and it turns out that it wasn't replicable. So people started to have all kinds of ridiculous beliefs about human behavior.

One i guess prime example is um or if any of you were with uh uh aware of a power power poses, you're familiar with this. It was a ted x talk. Um i think it was like the number two most downloaded ted talk for a while. Um, and the idea was, uh, that if you did some cool power pose that, um, you would have higher testosterone levels, you'd be more successful. And if you did some other type of pose that was like, considered negative then you would be less successful.

Um, so it turns out that you couldn't replicate that and, and the problem there was that what a lot of people were doing in um psychology and uh nutritional science and all these things were, they would run these experiments, they would track dozens of metrics and they would just look for the ones that were statistically significant and then they'd make up some sort of narrative to fit this.

Um and so the, the reaction to this to sort of fix this has been something called pre registration. And the idea here is that you state what you expect to happen in the experiment before you run the experiment. And if you find some other result that looks interesting, you don't get credit for having found some new scientific discovery.

Um instead you um need to run another experiment with that hypothesis. And so that's what this is, is sort of doing is um capturing that hypothesis. And then you'll see in a second also setting what your primary metric is.

Um so that people don't just go fishing for metrics uh that move um i, i'll share one more uh funny anecdote. So when i was in college, i worked for um an online marketing firm and hopefully none of our clients are in the room because uh they might be mad although it's been like 20 years. So hopefully they'll have forgotten by now. But one of the things that i did was i ran experiments. So i worked as a user experience researcher and we would design these um experiences for our clients and we'd run experiments and i was right, like 90% of the time. And the reason i was right, 90% of the time is because google had this experimentation product at the time and you could just keep running it until it became statistically significant in your favor. And then you can report it to your clients and the clients would write these big checks. It was great.

Um in hindsight, it wasn't really great because i was just abusing statistics and doing the same sort of things that that happened in psychology that made those things not replicable. So we wanna avoid those things. We want people to make good decisions. We're in the decision science, business. And so we wanna help them make good decisions even if they don't have uh data scientists.

And so you can see how this is sort of setting the stage for like where ll ms might be really interesting. So we capture this. Um we give you some good defaults. We have you select metrics notice that we have a primary metric. This is that idea that um you really need to state like what is the thing that you think will move? Don't just add 100 metrics. And if one of them moves, pat yourself on the back, uh you really need to specify that you expect a very specific metric to move. If some of the other, you can add a bunch of other metrics.

Um but if those move, you should really run a follow up experiment. Um when you're looking at that, that's considered an exploratory analysis and to confirm it, you wanna run a follow up experiment. So again, we're codifying all of this stuff in the u i. This is the sort of thing that data scientists would walk you through. Ok. So you choose your flag variation.

Um one of the things that's sort of interesting is if you look across experiments and this is another thing that, that's gonna be really valuable when it comes to ll ms, if you look across experiments that people run, um it's often a b testing where it's literally just an a and a b and uh you know, maybe we should blame the name a b testing. But um what people are doing is what i i mentioned earlier, which is they come up with an idea and they end up iterating on that idea with their designers and their product people and they tell their engineers go build this and they don't talk about parameterization or anything like that. They say go build this feature that we finally all agreed on. I got signed off.

Um and they test that one variation against the control here. We show three variations. Um but you often see people just doing an a b test. So let me talk a little bit about that parameter space thing and how it affects um experimentation.

So i talked a bit about exploring the parameter space space and finding an optimal policy. Um now, let's say that we're, we're doing that thing where we, we talk to our designers and we talk to our product managers and we um come up with one idea. So if you think about this circle with all the points in it being your parameter space, um and you're thinking about picking out just that one way of instantiating that, that idea that you have, then you're in that, that yellow circle.

So maybe the blue circle is your control and then that yellow circle is is where your idea is. And then if you think about all these other things that you could try, you should think about what is the probability that i picked out like the optimal, the optimal point in all that parameter space? And these parameter spaces are often effectively infinite. So you have effectively infinite options and you have one instantiation of the idea.

What is the probability that you picked out the best one? It's probably pretty low. So you could use generative AI to help with this. One way you could use generative AI is you could go to some prompt interface and be like, "Hey, I've got this product idea. What are some good variations of that product idea that I might want to test?"

But of course, that's going to be inconvenient to go back and forth between this prompt and your product development. You also are probably losing some information. So one thing we'll talk about today is how integrating it into a product allows you to use more information and get better product ideas.

Let's say you're still using that prompt. That's good. It improves your exploration a little bit. Then you combine that with feature management. So you can reduce the time it takes to experiment on all these things because you're able to parameterize your idea, incorporate these ideas that generative AI is helping you come up with. But you still have some challenges.

In fact, one problem is you might be exploring the parameter space too inefficiently. Here's one way you could explore more of the parameter space - you could just come up with random numbers or random strings depending on what you're testing. So let's say you want to test some marketing headlines, you could just have some random English text generator generate all possible headlines.

But that's actually pretty inefficient because it's very unlikely those headlines are gonna work for the use case you need. So you actually want to both explore a lot of the parameter space but in a very efficient way.

One thing we work on is thinking about how we can use the signals we have within the product already to help you explore that parameter space in a really intelligent way. We know we can learn from previous experiments, information about the flag, information about your metrics, information about your hypothesis. We can use all that information to feed into a large language model to focus just on the area of the parameter space that's most efficient to explore.

Now that's what we're doing. We've got this parameterized experiment, using feature management, that's good. We're using generative AI, but integrating it with the product to maximize our search. Importantly, generative AI is very valuable here, but only if you're using it the right way. If you use it inefficiently, it's only moderately better. But if you really integrate it into your product and leverage the data you have, you'll get way more value from that generative AI interaction.

Let me tell you a bit about what we're doing specifically and how we're integrating with the large language model. When you fill out the experiment, you're putting in a name, hypothesis, but you also have historical experiments to draw from. We can feed those into the large language model to figure out what's worked before. Instead of just going to Amazon and asking for suggestions without context, if you take historical info and info about your product feature already in LaunchDarkly, you can get an even better search space for the parameters you're trying to experiment on.

I talked a lot about how that's helping. I think there's one other thing I wanted to mention - personalization. Think about personalizing your product - you might want to change how you display products for certain audiences. We're able to take all that info and put it into the prompts for generative AI to come up with really interesting results.

I'll show a short demo of how this works in the product. This demo is by Arnold, the primary engineer who built this out. Shout out to Johnny Rolka, our lead designer for decision science, who also worked really hard on this.

Arnold: I'm an engineer at LaunchDarkly here to walk through using AI to help build an experiment on an ecommerce website. Let's say I have a store called Toggle Outfitters selling toggles. I've noticed an uptick in customers aged 20-25 but a downtick in toggles sold to them. Someone suggested the text on our landing page isn't relevant and we should try new options to sell more toggles.

So I'll use LaunchDarkly experimentation to test some text snippets. In the experiment builder, I'll name the experiment and hypothesis - improving the text will increase purchases.

LaunchDarkly's AI assistant will read the hypothesis and suggest how to design the experiment. I need to select a metric - the AI found the one I need, whether the user purchased a toggle. It also suggested if I want to create a new metric, I could use this name.

Next I'll choose the feature flag to control the experiment. The AI suggested names but none look right, so I'll create a new one using its suggestion - let's call it "header text" instead of "landing page text."

Now I need some text variations to try out. Let's make the current text the control. I'll use AI to help generate new variations targeting Gen Z. The AI will use this prompt and existing variations to create new ones.

Some of these look great, but I'll remove one I don't like. I still want more variations, so I'll ask for more with emojis in each. Sweet, we've got variations with emojis. This one's too long so I'll delete it.

Now that I have my feature flag with generated variations, I can start the experiment. One cool thing about integrating the LLM is the interface looks simple but we're doing a lot under the hood to make it really valuable.

Arnold showed in a demo that we can prompt the AI with specifics if we want to, but it's also adding a lot of additional information into that prompt under the hood. This allows you to have a really seamless experience and get a lot of value out of the LLMs without the end user knowing what's going on at all.

This is really cool because it allows you to generate more ideas without adding complexity. You can easily generate more variations without making it more cumbersome. You don't have to go to your prompt like in some other products for getting ideas. You could do that all under the hood and leverage the existing data that we already have in LaunchDarkly for you.

This is an important point - I spend a lot of time thinking about experimentation. I go to conferences and hang out with experimentation people. One thing we've been talking about is this problem of people not exploring enough of the parameter space, which I've been thinking about for a long time.

We've tried to help with this in our product before, but LLMs make it really easy to encourage people to explore more of that parameter space in ways we struggled with previously. Especially in our product area with external customers across industries, think about a pre-generative AI world - how do you encourage people to come up with more variations for their experiments? This is where you'd typically have a data scientist consulting if you were working on an internal system.

I want to highlight that we designed our experiment builder 2 years ago to codify what data scientists would ask when designing an experiment, so you wouldn't need one next to you. But we knew there were gaps, and generative AI has made it easy to fill those gaps in ways we couldn't have imagined before.

Next, I'll talk about how we use Amazon Bedrock specifically - it's been a cool journey. We're big AWS users and work with Amazon on innovation projects since they see us as very innovative. Their AI Innovation Center came to us and asked if we'd like to partner on a Bedrock project.

We had some generative AI ideas, including this one, and tried some other models but weren't happy with the results. Being in decision science, you want people to make good decisions, and sometimes LLMs give funny answers. Some ideas we were iffy on had bad results.

But then Amazon gave us Bedrock access. Within a day, we validated an idea just using the UI prompts. We figured out how to incorporate it into our product - AWS already has the security we want, especially for customer data privacy.

Within weeks we had production code, and within months we built this entire product feature. It's been impressive to go from nothing to something great so fast. Working with the tooling has enabled us to establish a foundation for future generative AI work.

For example, when you type your hypothesis, we look at your flags to find a good match. To do that, we encode flag names for semantic search and leverage vectorized customer data to give good, tailored results. With the Bedrock codebase, it's easy to keep adding features.

There are 3 key ways we think about building generative AI products:

Transformational - What would the product look like if built new with generative AI?
Every new feature - What's the generative AI way to build this?
Enhance existing features.

Tools like Bedrock allow quick prototyping and idea validation. I'm excited about generative AI - some of what we can do is really cool. We just published a guide on using LaunchDarkly when building generative AI products, like validating LLMs. You can parameterize your pipeline and experiment on it to optimize your generative AI approach.

That's it from me! Please visit our booth to see a demo, it's more impactful than just talking about LaunchDarkly. And feel free to ask any questions.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Accelerate experiment design with Amazon Bedrock (sponsored by LaunchDarkly) (LaunchDarkly)

Robert Neil: Hello, thank you all for coming to this talk. I'm Robert Neil. I lead a group within product delivery at LaunchDarkly called Decision Science. We work on decision science products including experimentation.Today I'll be talking about an underr
复制链接

扫一扫

Accelerate experiment design with Amazon Bedrock (sponsored by LaunchDarkly) (LaunchDarkly)

“相关推荐”对你有帮助么？