Explore image generation and search with FMs on Amazon Bedrock

最新推荐文章于 2024-10-16 10:28:56 发布

李白的朋友王维

最新推荐文章于 2024-10-16 10:28:56 发布

阅读量157

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134833557

版权

why? um my name is rohit mittal. i'm a principal product manager in amazon bedrock team. and today we are going to talk about a very interesting topic. it's called explore image generation and image search using foundation models on amazon bedrock.

i'll be joined with two of the other speakers, ashman swaminathan, who is a senior manager of applied science in amazon a i team and also andres welles who is a principal data scientist at offer up.

so images are a very interesting form of expression. they are able to grab our attention immediately and um our brain can process an image in less than 100 milliseconds in the world of a i. one of the most captivating things that have recently happened is in the form of visual art.

now, using just some text prompts, you can easily generate compelling images. you can have your creativity run go wild. we have already seen that we have sent astronauts on horses in the galaxy. we have created visually striking unicorns or even put a sailboat inside a light bulb like here.

so the amount of visual content that is available right now and that is being generated at the moment is is skyrocketing with smartphones, especially people take more than 5 billion images per day, which is more than 1.8 trillion images taken per day. that's an enormous amount, right?

so so think about searching of these of these images for any business application or searching through a product catalog on a retail website or even going through your past memories to look for the image that you spent with with your family.

a few years ago, it is getting challenging and challenging every day with the image generation tools. on the other hand, things have exploded in the last year itself, people have generated more than 15 billion images using these tools and the trend is going exponential in this.

so we have a big problem of the content being generated at a massive scale. and we do not have the right tools using the traditional machine learning and the keyword based searching tools where you can easily search through these content um on your devices or for your business needs.

right now, generative a i has come to the rescue here as well in terms of searching of these contents and also to generate the compelling images by improving the efficiency of content creation.

so in today's session, we will see how amazon bedrock is solving both these problems of helping you to create compelling images through the image generation model. and also through our embeddings model to search through the existing set of images.

so in the agenda today, um we'll go over the two new titan models that were launched yesterday in swami. keynote.

the first of them is titan multimodal embeddings. the second one is the titan image generator which also does the image editing tasks.

then we'll look at some of the example use cases and architecture. all of them using the amazon bedrock service, which is our foundation model as a service.

and then um andres will walk us through how search is relevant for offer up, helping local buyers and sellers come together on their platform and also how these titan models will help achieve their mission.

so a quick primer on amazon bedrock, right? so amazon bedrock is, is a fully managed service that provides access and choice of some of the best of the breed foundation models from leading companies such as a i 21 anthropic cohere meta stability a i and also our amazon titan family of models which are, which are built from scratch at amazon.

now, until two days ago, amazon titan models were primarily text focus. you were able to generate texts using the the two text, the titan text lite and titan text express models and also the titan text embeddings, which is similar to the multimodal embeddings, but but for the text search and text similarity type of use cases.

now bedrock not only provides a choice of these models but also provides a lot of capabilities that helps you build the entire gen based applications for your use cases and also doing it in a very private and secure manner.

so let's get down to what we launched yesterday. so these two new models, the titan multimodal embeddings and image generator are again built entirely from scratch.

the first one, the tight and multimodal embeddings, it takes text and image or a combination of both of them to represent them in a, in a set of vectors, a numerical representation which we call embeddings. and the main idea behind that is you can capture the semantic meaning of the input, whether it's a text or the image of the combination in the set of in the form of vectors, which makes it very easy to search on them in a very fast manner.

and also as compared to the traditional form of searching using keywords, where you tag the images with a set of keywords in a manual or an automated way and then looking at the input query keywords and trying to match them. it doesn't give you the right set of accuracy. plus it's it's not resource efficient as well.

so this model helps you with the use cases such as search recommendations, personalization, especially if you have just the images. for an for example, stock photography companies that have hundreds of millions of images. their primary business is search where end users try to discover the unique content to buy right.

and especially for the content creators themselves where the discoverability is a major problem, right? so this model helps you with all of that by providing the state of the art.

so this is a state of the art model which provides the best in class accuracy in terms of the embeddings that are generated so that you can build search and recommendations type of applications.

on top of that, the second model is the titan image generator. one again, what we have focused here is reducing the time it takes for you to to create the content, especially for industry such as advertising and marketing, where they have products and they want to generate compelling lifestyle images for them so that their customers can see the benefit click through on that and it results in the more audience and more eyeballs and more revenue for them.

right now, both of these models have high accuracy out of the box, right? so you can use the base models get very accurate results. what we also differentiate on is providing the customization feature through bedrock a p which is a simple a p to just customize these models on your proprietary data.

as with any aws service, all your data is highly secure and private. so whatever inference calls you make is not stored on the bedrock, right on aws. it's not shared with any of the third party, but in the case of titan, also we do not use any of the data that you use to customize these models, to train our models to make them better. everything is very secure and private including the the customized model, which is so we create a secure copy of the customized model for you.

so not only we provide a way to easily customize based on your data. think of it as again, i'll take an example of the advertising industry where i mean every brand has a particular style, particular brand aesthetics in which they create these images, the add imagery.

so you can you can train the model on only 100 or a few 1000 pairs of images, bring it to the bedrock a p, create a customized copy of your model. and then the model will reflect your brand aesthetics in the images it generates.

um we also provide a comprehensive set of features with our image generation model. so not only text to image but in painting out painting and we differ in the case of out painting where again, focusing on the advertising and marketing use cases or even e-commerce.

if you are a seller, you can bring in your product image, you can easily create it such as couch. i mean, it's a different thing of a couch with a plain background versus in a homely setting, in a living room with a fireplace. so it creates more warmth.

so the idea is you can create very lifestyle images very easily using a simple api with very low distortions, high accuracy using this model. and we have other set of features that will go dive deep into in a few minutes.

now, one of the other things that is very important for us at amazon and we have done that with every titan model is the responsible use of a i and to continue supporting the best practices in the responsible use of a i.

all the titan f ms are built to detect and remove harmful content from the training data, reject inappropriate prompts from the user at the influence time and also remove any harmful content that the model might generate.

so again, you do not have to worry about um implementing additional responsibility, i guard rails on top of the model output. and that that is especially important with i a generation kind of models you can imagine given how powerful this medium is.

uh i mean, we do not want to generate things that can be sensitive to certain people or toxic in in any way or even bias, right? we see a lot of biases in our world. what we are trying to do with these models is we help to mitigate those biases around gender and skin tone and other stereotypes.

so going to the first model, the amazon titan multimodal embeddings model, this is now generally available.

so as i was mentioning before, traditionally, a lot of content owners used to create tags for the images, right, and then try to match the input query keywords to those tags.

similarly, in the case of retail industry, you may have a lot of product images you want to sell or you want to search for them. but in a lot of cases, the description is either missing or inadequate or inaccurate, right?

um and i'll, i'll show you an example in a minute, but all of that leads to is the discoverability problem. so at the end of the day end users are not able to discover the right content that the seller wants to sell, right?

so it's, it's not a good experience for either the seller or the buyer or the platform itself.

so what the embeddings do is capture the semantic meaning from the images. i'll show you this example.

now, on the left hand side of this image, what you see is the result for the query blue sneakers without laces using just the text based searching just based on the product description, whatever and using the keywords and things like without like negation type of things are particularly harder, right?

on the right hand side, which is highlighted as you can imagine that's a result of our multimodal embeddings model because the images themselves capture the information of the color that there are no laces.

so the information that is represented in the embeddings capture the meaning from the images. and then when the user searches for these keywords, they are mapped into the same semantic space.

so we are able to correlate the input query to this set of images of the of the right shoes that customer was looking for.

so it's a great user experience for the end user. it's a great experience for the sellers because their right items are being surfaced and for the platform itself, the retail platform itself, so that customers are happy they'll come back that.

now, there are three main benefits of these models of this model. specifically.

the first one is accuracy.

the second one is ease of use and

the third one is respon.

so let's dive deep into each one of them.

now, the accuracy. so as i said before, this model is state of the art in terms of how accurate search and recommendations you can build out of it out of the base model itself.

we also provide some mechanisms for customers to further improve that accuracy, especially if you are, if you have some data which is unique to your domain such as autonomous vehicles, they capture billions of images of a particular type, right?

i mean, they they could be wide, wide angle, wide lens images, right? so you can further improve the accuracy of the embeddings by again providing a few 100 or a few 1000 images of your type. and then the accuracy will be better.

again, it's all securely customized, privately customized to you in terms of the data and the model itself, we also provide another lever to the customers.

now, the default embedding size that we generate is 1024 which is based on our talking to hundreds of customers analyzing the right accuracy and the latency. we believe that is the right embedding size for majority of the customers.

but as with bedrock, we, we want to provide the choice in front of the customers. so we also provide you the option to generate smaller size embeddings 384 and 256.

so that based on your needs, if you want to optimize further for latency, you can go down in terms of the embedding size, right? so we provide that option to the customers.

um let's move on

so the second one is the ease of use. now, the whole bedrock premise is also around ease of use and this, this is also reflected in our t and multimodal embeddings.

so we have a single api to generate the embeddings. we have a batch a p so that you can input say thousands of your images to generate embeddings. we also have open source connectivity. so if your content is already on amazon open source service, or if you are exploring something which is, which provides a more integrated and seamless experience with using the embeddings as well as indexing your content, you can use amazon open source service that has a connector to bedrock. so you can directly generate the embeddings through this model from opensearch and store it in the vector db that amazon opensearch provides. so you don't have to leave anywhere.

the third benefit is the responsibility. now we filter harmful content from the training data for this model. and we help to mitigate the demographic biases, especially if you have a lot of images containing people and you don't want to have the search results that only represent a specific skin tone or a gender or other types of stereotypes biases that we see in the real world. then we have a patent pending technology that helps you minimize or mitigate all of that. so that's that's another benefit or a differentiator that none of the other embeddings models will provide to you.

now, let's look at the second model, the titan image generator, which is now in preview. now it provides a lot of features, text to image is the main one the way we differentiate from our competitors. and you, i mean, you have seen, i'm sure you have seen a lot of models in the space already, right? but we have a lot of differentiators in this space. we want to go through them in terms of features as well as in terms of the benefits that we offer.

now text to image, uh we generate high quality images out of the text prompts the way to analyze how good a model is is not just in the output but also how easily you can generate those images, right? you should not have to craft a very lengthy or very detailed prompts. so our idea here is simplicity. so you can simplify your prompts have simple prompts to generate compelling images. the images have low distortions, right? so or you can call it hallucinations as well.

um what we also do is generate text in the images. so if you want to say, i want to generate a greeting card that says something, say happy thanksgiving or something, right? we'll have an example to show you can easily do that, right?

um it understands complex prompts as well. so, so this is different from the simplicity that i mentioned, right? in simplicity, if you if you have something in mind, you should be able to easily define that. but in certain cases, you want to say, ok, i want this object to be placed there and then um doing something. so where it's basically a two or three step process, we can easily understand those complex prompts and generate accurate images. i'm gonna show some more examples there.

then in terms of um customization, the last one, as i said before, customization is a key thing that majority of our customers are looking for. um because every company has their own proprietary or unique data, their brand aesthetics that they want to convey in the model, right? while the the base model will do a decent job, but it can't know what, what is proprietary to you, right? so you can again easily, with only a few 100 or 1000 images, you can easily do that, easily customize.

in terms of image editing we provide in painting where you can replace an existing object by another one or remove an object. traditional in painting, we have automatic editing, which is very powerful driven by our own model that you'll hear more from ashman about. but what it does is you input an image and you just enter a text prompt of what modifications you want to make to it. and it will understand that your intent as well as what specific object in the image you're talking about and do the modification for you.

um you can also do the same thing with uh with using an image mask if you want. right? definitely, it will provide more accuracy if you have a specific mask. but then yeah, so you can do either the mask mask based editing or the mask, free editing, which we call automatic editing.

we have generative resizing, which you may be very familiar with in terms of dali or other models where you extend the image boundaries and it can be guided with a text prompt as well. so you can say that ok, i want to generate an image on the left side. by default, it will try to keep the elements that are in the original image, but you can also guide it that add something on this left hand side. and we try to do that.

our out painting is, is specifically aimed for advertising and marketing use cases where you bring in a product image such as a perfume bottle, right? and you want to showcase in lifestyle setting like i was i was telling before. so you can very easily do that with this, with this feature, with some of the other traditional models, you'll have to go through a few steps in order to generate that. and then it adds hallucinations or distortions to the input product image. so here we are focusing on very less distortions plus ease of use in terms of generating what you're looking for and image variations where you can have an input image and we can generate similar looking images based on that.

so let's look at some examples here. now this is a very interesting use case because as you know, wine glass and elephant are not similar size. so it's a it's a very complex thing to have them scale to the right size and then reflect that in this composition of different objects. now, we have tried all of these examples with the competitor models and um i mean, you can try them yourself and see the difference uh decorating in fedora.

now, this is very interesting. so if you read the prompt here it has three parts. first, it's a dog wearing a suit and a tie, right? so first you generate that and then you have to make sure that it is printed on a pillow with the right aspect size aspect ratio and size, et cetera. and then the pillow is also on display with other furniture. so it's it's a very complex prompt and and as you can see in the quality of the image, it gets it right.

um this one is, is again one of my favorites which not only shows the complexity of the prompt. again, a person wearing a hat and dark glasses running in a forest and then the properties of the forest itself, which is lush green with red and yellow flowers, right? another thing to point out here is the diversity, as i said before, responsible a i is very important for us and we ensure that gender skin tone, all of that is reflected as it reflects in the um um in the real world.

so uh another example, again, i tried this with uh some of the competitors e either both the boy and the girl were wearing the hat or none of them were wearing the hat or the kimono was not, right? so, or both boys or both girls were generated. but again, a very complex prompts where you're defining the boy is wearing the hat and standing next to a girl wearing a kimono, right?

uh this is an example of how it can also generate artistic uh images. this is an example where i wanted to show you how in the images you can also generate text. so the prompt is image of fireworks with new york city. suppose i mean, we have some customers who want to enable their end users to generate these kind of greeting cards and things that, that they can share on holidays or events. so they can accurately generate such images.

as you know by now, if you heard swami's keynote yesterday with love iguanas here. so we have a photo of iguana in a colorful background. so these were all the text to image examples before where we showed how not only the quality is good but how we understand the complexity of the prompts and still able to generate the right images for you.

now taking the iguana further as we can see here, suppose you are a children's book company. you wanna um you, you want to automatically generate images of a particular style, say a cartoon or or a sketch, right? and suppose the model cannot understand that it will. but suppose it does not, then you can again bring in some of the examples to the model, train it. and then the model will be able to generate the customized images based on what your use cases are.

an example of automatic editing where you start with an input image which is actually generated one by our model. and then you define a prompt change flowers to orange color. we understand what you're talking about where the flowers are and we update that in painting again, we start with the left hand image which is also generated by a model, a barn with a lush green ah forests with mountains behind it lake in front of it. you can add um a car in front of it right now.

in the next step, you can take one of the generated images which is now the left image and you want to change the mountains. so i mean, you can adjust that you can add some ducks on the lake as well.

an example of the out painting like i was showing before a product which you can bring in and then you can generate in any lifestyle background. you want image variations. again, this this is an example that swami also showed yesterday by from the reference image, it can generate similar looking images.

we also want to show you here our new image playground that is on the bedrock console with rich features where you can either generate or edit images, you can provide negative prompts as well. you can select a style, you can select a size the number of images between one and five that you want to generate the seed. and yeah, and you can get the results directly.

um with that, i'll hand over to washin to walk you through the benefits of this model as well as the use cases and architecture for this model. thanks thanks to.

so as robert was mentioning, as we are building the image generator models. we started over the three basic principles and we wanted to make sure that we do that really well.

so one is we want to provide images with very high accuracy, so and high quality. so it's very easy for you as customers to be able to use our models.

second, we want to make it very easy for you to use in terms of building the right a p is right tools so as to so it's easy for you to integrate the models into your own applications.

and finally, we consider responsible a i front and center so that we can as our models, you feel comfortable using our models and the models provide the right kind of data and right diverse kind of data, nontoxic information so that it can integrate it with your applications directly.

let's talk a little bit more about each of these categories. so um we do um as as as you can as you have seen in some of the images that we just showed, we try to ensure that the quality of maze that we get from our models are very good. so as part of that, we did extensive evaluation studies over a wide range of problems, start looking into all the different categories of things that customers would be looking at using.

so in categories such as composition where you're looking at placing objects relative to each other in the scene or complex problems you saw some examples of complex problems in the ones that had showed where you had a person running through a forest, wearing a hat, wearing dark glasses and the forest had its own properties. so we want to make sure that our models as we generate them are very good in terms of being able to use for these kind of use cases.

and we also want to ensure that it understands these things and it can support text generation. so if you want to be able to use it for generating birthday cards, greeting cards or signs, road signs, you will be able to use our models, put the text in the right places so that you can use it and directly integrate with your applications in terms of building the applications itself, just like everything in bedrock. we try to ensure that we have the right tools right a ps and we are able to use and customers, you guys are able to call the a ps to generate the application itself.

so we provide a simple a p for for image generation. so it takes a text as an input, generates an image and and you have the option as you are seeing in the playground, you have the option to say different resolution images

"We support around 8 to 10 different kind of resolutions all the way from a 512x512 resolution to 1024x1024 resolution. You are also able to play with different seeds. You can try different kinds of images, generate multiple number of images at the same time. So we provide all that as part of the API definition itself. So you can integrate it directly with your application and expose and use it as part of your workflows.

Um, we are and one of the important things that we did is also looking at each of these features that we launched to make it to understand how can we make it easier for you guys as customers to be able to use it. So taking in painting as an example, so we have, you have we saw an image of a lush green forest and you saw a car in the lush grain forest. So let's say I want to replace the car with a truck. So a traditional workflow would be to go into the image, take a mask, take your pointer mouse pointer mask out the regions of the car and the image and use a text prompt to say, ok, replace this part with a truck and we do support those use cases for people who want to be able to use it with those kind of mask outputs.

In addition to that, we also provide an automatic editing feature where you don't have to go in and update the mask. You can use a text prompt and say, ok, replace the car of the truck and we built our own in house segmentation model which understands the world knowledge, which understands objects in the image and is able to identify the pixels in the image which corresponds to a car go in and create a mask automatically for you. And then you can it will use call the in painting model to edit the mask to generate the output in terms of generating a truck in that place.

So we looked at how can we make it easier and simplify our workflows so that we can generate large quantities of data easily with our model. Um, we have an API also for generating image image variations. So you've generated an image of the car, a car, a truck in the mountain, you want to generate different variations of a truck, you want to be able to generate different styles of the same photography. How do we able to generate it in a way that it makes it easier for us as customers? So that's something that we also looked at and we have a separate API for generating image variations.

And finally, we also wanted to see what more can we do from a basic core model itself. So we provide the fine tuning capability as customers. You can bring in your own data sets. We will train our model specifically on a fine tune the model specifically on our data set, provide a model to you which only you will have access to and the data is not going to be used for, used for any of the other training that we use for our own model creation. So this model is separately hosted for you. You have full access to the model and you can call the model to run any of the applications that you want to be able to run for your own applications.

So as part of the fine tuning workflow, the traditional way to do fine tuning is you bring in an image, you bring in a text pair, you have an image, text pair, pass it to the as an input to the API, ah it can be an in s3 container and you point a link to the three container to the API and we use the image and text path to be able to train our model. And one of the things that we also looked at is how do we simplify this workflow. And and in this case, we said, ok, if the customer doesn't have to bring in the text and only provide the images, can we make it easier? So as one of the things that we will be launching soon enough, we have our own captioning model where we will caption the images for you and use that for generating the image text pairs for fine tuning the model.

So as every step of these use cases, we try to look at how can we make it easier for you as customers to be able to use a model and try to incorporate that in your workflows so that the overall ease and our overall work process becomes as simple as possible. And, and thirdly, we all, like I said, we also made sure that our models are friendly from responsible AI standpoint. We, we do wide range of mitigations throughout our model development process.

So we take responsibility I ground up from day one to see how do we make sure that the models that we generate, the emails that we generate are safe and and you as customers would be happy to use our models without worrying about generating not safe or content or toxic content and things like that. So we did extensive filtering on the data side to improve our model creation process. So the data that goes into the model is good and clean, we also have guard rails as part of our system and create the filters as part of our system to ensure that the text prompts that you pass in are checked and verified that it's not going to generate toxic content.

At the same time, the image content that you the model creates or generates is also set up in a way that it goes through filters to make sure it does not generate not safe for work content or violent content. As part of the image generation process as Swami mentioned in the keynote as well. Yesterday, we do take the, we ensure that the models are also watermarked. So you have an authenticity trace. So any image generator but titan image generator model has an invisible water mark that goes with the image.

So you very soon we'll have an API where you can call the API with the with the image. And we will verify for you that the image was created with a titan image generator. So this way we ensure that we ensure that the masks generated are safe. We ensure that there is authenticity proof and we check that it does not create the effects around in this environment.

And the last part of it is also about demographic bias. And you would have seen a lot of this in a lot of the text to image models that has been already out there. So let's take some examples. So if you have an image of a couple posting together, most traditional models are tailored based on a data set that was trained on and and you can, and there is generally some bias in the data set. So the data set could be biased to a important category of people or a particular skin tone of people or ethnicity of people or certain geographical locations where the people come from.

And as part of the whole training process, we also ensure that our models have can see diverse amount of data of all different skin tones and genders. So that as we are building these models, these models can represent the wide range of people under the popular bias is when you have high quality enemies like this, where you're looking for a lawyer, there is an inherent bias in terms of the kind of lawyers that you find in certain demographic areas, it could be a particular color, particular gender. But we make sure that the models that we generate are cover a wide range of possibilities. That way us customers can use our models to generate these variations and different kinds of people and be comfortable about using them in your own applications.

So as part of it, we also build a patent pending technology where we specifically look at how do we take the image information and the embedding information and have them in unbiased space so that we can use this data to provide all these diverse kind of outputs that you see in this images and all that is integrated as part of the titan image.

Now let's walk through some example, use cases on how you can use our models. Um, let's say you want to build a multimodal search experience. Um, so you have a set of images and text pairs. The first step is to do the indexing. So you will call our multimodal embeddings model, you will index the model using you will call our and generate embeddings. As Rohit mentioned, we provide the option for us customers to have different embedding lens. So for customers who want high latency or very low low latency. We provide a smaller length embedding of 256 length and you can use those embeddings to, to generate those faster searches for customers who want very high accuracy in your searches. You can use 2024 length embedding index, index, your own image database using 1024 lengths and use that for generating the image database.

So we did also make sure that it's easy for in terms of database integration, we have a whole vector db store, we have integrated with opensearch. So it makes it very easy for you as customers to be able to call opensearch APIs, you can call, you can call a titan embeddings model from opensearch. And that way the whole system is integrated. If you are a part of the whole amazon ecosystem, we also have integration with pine cone. So we have you have a wide range of options to choose from depending upon your use case and your means so that you can use the power of the embeddings model, power of the foundational models on as part of the embeddings. And at the same time, be able to support your use cases, be it use cases which require higher latency or be it use cases that require higher accuracy.

Um, as part of the embeddings model. Again, we do make sure that the embeddings are also defined in a more responsible manner so if you have an image of, if you have a database of images, say database of c and we do and if you have kinds of gender and different kinds of skin tone as part of the database, we do ensure that the embedding that you're generating cover all these things in a very fair manner. So that when a person puts in a query, you are able to generate, retrieve the corresponding images regardless of gender and skin tone.

So then let's talk through the querying workflow. So in a querying workflow, you would go through the same process. You have a text you can use, you can call our, you can search through the database using text prompts. You can search through the database using images to find similarity searches or you can also search through the database using a combination of text and images.

So let's say I want to be able for retail kind of use cases. I want to be able to retrieve an image of a green shirt, but I want it in a different checker pattern or I wanted to be able to do it in a different color. So you can think of complex searches which has images and text queries and we can support those kind of queries. So I want to be able to find an image which is of the particular style but a different color. And I can add an image to the search query. And I can use these kind of search complex search queries and we support those as well.

Um, and then you would call the uh so you call the multimodal embeddings model, we'll generate the embeddings and then you can use it for retrieval applications.

Let's go through one more example of a text, two image model, the text two image model, you start with the input text prompt, you call the bedrock API and it will generate an image is as simple as that. So it makes it very easy as customers for you to be able to use it. And for image editing, you can have a combination of it, you can have an input image along with the mask. So we do provide have the option of for customers who want to use masks. So you can have an input image along with the mask and we can do editing in painting or painting with those masks or we can do it with the text prompt like like we described.

So we would, in that case, we would generate the mask and we would run in painting and not painting with those masks. Um, let's take an example of a creation process. So here you have, you have a database of perfume bottles, you won't be able to create an ad copy from the perfume from this database. So as a first step, you want to find the retrieve a good image of a perfume bottle that you want to create the ad with. So you would start out and say ok, I want to be able to generate, retrieve the top five top 10 search results from my database.

As a next step, you want to be able to generate a very nice looking image of the perfume bottle, which you can integrate it with your whole ad copy workflow. So there you can use our image generation models to be able to do that. Um, and then you can use image variations to create different variations. Ok? I want to be able to have a perfume bottle in a different style. Um or I want to have a background in a different style. You can use the image variations to be able to do that.

And finally, you can use text models and we have a wide range of text models as part of the better suite where you can use one of the text models to generate the whole test description which you can post as part of the. So with a combination of all the different kinds of models that we have in the bedrock family of models, you'll be able to go through this whole work.

So you can use a titan multimodal embeddings model to search through a database, find the perfume set of which have a perfume model."

You can use a Titan image generator to generate variations to add backgrounds, add different kinds of sceneries to it. And finally, you can also use the Titan text model and the Bedrock text models to do text generation for copy.

Let's walk through a quick example. The first step would be to create a database. So you have a database of all perfume bottles, all products that you have as part of your whole workflow.

You would use a embeddings model, multimodal embeddings model to index the database, create the reverse index and then given a search query like a perfume bottle, you would search through the database, find the most relevant information from the database so that you have a list of all perfume bottles that you can look through.

As the next step, you would call the image editing API - as image editing API. You can use automatic masks in painting, you can out paint it with different backgrounds. We can, we can stylize and look at how do I create the perfume bottle in a beach environment or in a forest or a cityscape environment. We can create different kinds of backgrounds, play with it and ensure that you have a good looking image that you can use for your product use case.

And finally, you can generate different variations of it where you can say, okay, I want to be able to look at different styles, different kinds of backgrounds so that it becomes a very interesting catchy image that the customer can look at.

And finally use the image text generation model. So here you would take the text call the text API - Bedrock LM, text API - along with things like the search keywords or product descriptions or image backgrounds and use all this information together. The LM will create a good looking text description, you can attach it with the ad copy and then create the whole version.

So this way we try to make sure that as part of the Bedrock family of models, the Titan image generator, the Titan multimodal embeddings and the whole LM suite that we have hosted in Bedrock. It makes it very easy for the whole use case workflow to come together so that you have one stop place for running these kinds of experiments and creating these kinds of workflows.

So with that, I'll hand it over to Andres and he will talk about how they are using it for the OfferUp applications.

Good morning to everyone. My name is Andres Veles. I'm a Principal Data Scientist at OfferUp. I'm here to talk pretty much about how OfferUp builds transformative experiences using search with Amazon Bedrock.

For those who don't know about OfferUp, it was founded in 2011 and nowadays is one of the largest mobile marketplaces for local buyers and sellers in the US. One interesting fact is one in five people in the US use OfferUp nowadays and we have transformed whole local communities to buy and sell products by providing an application which is unique, simple and trusted to come up with these transactions.

Our mission is to be the platform of choice for local commerce by connecting buyers and sellers through an interface that makes selling an item as easy as snapping a picture from a mobile device.

We are here to talk about search, right? So search in general is a really important piece in any application. And this is because by tweaking search, investing in search, you are able to leverage multiple application metrics.

Let's imagine a user goes to the application, searches for a product. If the product results are personalized, then it is going to save time and effort looking for what they need. If this process is optimized, the user finds what they need and will try to sign up.

Otherwise you get insights to improve the platform, go back and iterate to keep improving. After you apply these changes, they will materialize in the user experience, which in the end impacts user retention metrics which will increase.

There are multiple types of search engines and mobile search has its own challenges. The first one is broad and short search query types. Users tend to use approximately 1-5 keyword search terms. This is really complex because you give short context to actually look for what you want compared to other search engines where you generate more description, more context for your search.

Another aspect which is challenging is the variation in quality of user provided content. Some users provide really good quality images and information. However some provide low quality images and scarce titles and descriptions, which don't provide much context about the product.

Another challenge is the limited real estate making good ranking crucial to succeed in the application. These are some complexities in mobile marketplace search and it's an evolving area.

At OfferUp we are working really hard to keep up with the latest changes. We began with keyword search used all around the application. This technology has pros and cons. We use the keywords the user provides and match it in the engine to retrieve relevant listings.

On top of this technology, we explored neural search. We started using vector databases and exploring open source tools. We also used Amazon Kendra to deploy ML models, building in-house for semantic search.

We took titles, descriptions and generated embeddings. We stored these embeddings in vector databases. Before going to production, we tested this new technology stack on two main aspects - latency and quality.

For latency, we performed experiments, backfilling millions of listings into OpenSearch. After backfilling, we applied 100 req/sec load on the database. We saw p99 latency under 16ms which is great. We are really happy to collaborate with the OpenSearch team at Amazon because of the great community support.

For quality assessment, we retrieved items using the top 25 most used keyword searches in the app. With this, we retrieved listings and computed relevant recall versus our keyword search baseline.

There was approximately a 23% increase in relevant recall using semantic search for low density locations and 27% increase for high density locations (high density = more products per area).

So great numbers! But there was a missing piece - the images. As you can see in this example, the title and description don't provide much meaningful information about the product. But the image gives a lot of details like color, shape, number of seeds, size, etc. So the image provides a lot more useful context.

That's why it's really important to take images into consideration. For this, we are using the Amazon Titan multimodal model. We integrated this new component - we take the image alongside the title and description, pass it through Titan multimodal embeddings, and index into our vector database.

This enables us to search with these new multimodal components. Without this approach, certain items would not be discoverable. From a technical perspective, the partnership with AWS and Amazon has allowed us to test new technologies very quickly - this would have taken us months to build ourselves.

We tested the quality using the same methodology as before. With the Titan multimodal embeddings model, we saw a 9% improvement in relevant recall for low density locations. It also reduced variance for both low and high density locations which makes the system more stable in general.

To summarize our journey - we started with keyword search, then explored neural search and semantic search which increased relevant recall by 23-27%. Then we explored multimodal embeddings which further increased low density recall by 9% while maintaining performance and reducing variance.

What's next? We are researching how we can further leverage AI tools to support search, recommendations, image generation and more within OfferUp.

Thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫