Building an AI comic video generator with Amazon Bedrock-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/135090400

Building an AI Economic Video Generator with Amazon Polly

Before I start, I want to see how many of you are actually data scientists - if you can raise your hand. Ok, very little right? And how many of you are software engineers or know how to write code - some majority of you. How many of you know how to use GPT or use? Ok, see nearly everyone.

So just before I start, I want to highlight that AI is getting easier by the day with large language models. It's even getting a lot easier. You do not need to be technical people to use it. Yeah, so this is what AI can benefit society - enabling everyone with AI capability.

Let me start by telling you a little bit about myself, so you understand what I do outside of work and why I do certain things. I really love technology so much. Yeah, so apart from using technology obviously at work for the benefit of the company, I also like to use technology to solve problems at home, especially to entertain my family.

I use my wife and kids as guinea pigs for the AI I build. It started way long ago, six years ago when my son Dexie was still in mommy’s tummy. We thought he was going to be a girl because of the signs - the tummy shape and the way my wife was behaving, started eating like a veggie, not meat. We thought it's a girl.

So we spent months trying to find a baby girl's name. But after a visit to the sonograph, she told me it's a boy. So we had to go back to square one.

So one early Saturday morning, I tried to see if I can build an AI model that can mix Japanese and Russian names downloaded from the internet, combine them together and come up with an original name. This was the name that came up. Definitely I do not want to name my baby this - I do not want him to get bullied when he grows up. And plus, when my wife finds out what I'm doing, she was like giving me this look - what are you doing? Ok, a failed experimentation. That's ok, I never give up.

So when Dexie was born two years later, this is where the fun begins. I like to play video games and he really loves watching me play video games - PS5, Xbox. But he does not know how to control these games because he's still very young. The controls are very complicated for him.

I really wanted him to feel how you play video games. I wanted to introduce him to tech as early as possible. So what I've done here is called Project Ring. I built an AI with a camera that actually in real time tracks his body posture and translates it into a flying bird in a 3D world. I hacked this game engine and integrated it with my homemade motion capture. That's what he is enjoying so much - he can play the whole day.

The next project is called Project F. This time, my wife Yi is my experimental target because every morning, she's always looking at her wardrobe and asking me what should I wear today? This is a very hard question for me. I have 10 t-shirts like this because I like it, it looks ok and comfortable. So I buy 10 of them.

Answering this question is very hard. So I wanted to build something that she can self-serve. So what I've done is build a system with a camera pointing to the front door. Every time she exits the house, it will do facial recognition, know it's her, capture her photo, call the weather bureau to check the weather, record it in a database.

I also built a mobile app she can browse what she looked like on those days - it's like a fashion calendar. This is integrated with a machine learning system that analyzes what clothes to wear based on the weather. It can self-recommend what to wear that day.

Yeah, this was very fun. I really loved it.

The last project I built is called Project Ellie. I'm pretty sure you are all familiar with COVID lockdowns. In Australia, especially my state, we had 264 days locked down - very depressing, very stressful.

So December 2020, we had nowhere to go, cannot go anywhere. We were locked at home. I had two weeks spare, so I thought maybe I could do something fun.

Dexie at that time was 4 years old. He has his favorite teddy bear named Ella. Luckily we had two exact same ones. So while he was playing with Ella, within a week and a half, I turned the other Ella into a robotic bear.

Inside the teddy bear are so many moving parts. I replaced one of the eyes with a camera, put a nano device in the tummy with a computer vision model to track where the person is, do facial recognition, and a microphone and speaker to talk and listen. Inside her head, I built four servo motors so she can look around.

The brain is powered by a large language model - before GPT was popular, so 2.5 years ago, I was using GPT-3 to converse naturally like a human.

I have a very short video to show you exactly when Dexie discovered this for the first time and was having a conversation with her. Her name is Ellie because she's an electronic version of Ella.

[video plays]

So that's a very short video. If you want to know about all these projects, you can scan the QR code or find me on Medium. I share all the code on my GitHub publicly for free.

I think that's the end of my talk today, so you can all go home thinking about the main topic - my latest project, a personalized comic video generator.

I'd like to start with why I built this. Not COVID lockdown, I can go out. But why did I build this? Maybe like many of you, your job is to accompany your kids to sleep. Same with me - it's my favorite time of the day at night.

I always tell Dexie a bedtime story. The problem is I want to inject some moral value into the story. So if he's naughty or not eating dinner, I create a story to teach him consequences. Finding this type of book is really hard - there's no way I can say I want a storybook that teaches kids this lesson.

Instead, I've been crafting stories on the spot for him for six months. He really loves it, he prefers this over reading books. But after six months, I'm running out of ideas. I need a more scalable solution.

By the way, here are some of the stories I created on the spot for him - like a star lantern seeking a companion, how to make a friend. And a slow snail who uses his weakness to catch bank robbers by making them fall asleep.

Anyway, like I said, I need a sustainable solution. That's why I built an AI comic video generator to do this job for me.

Given a story title and photos of Dexie's toys, Ellie is able to generate a music video - a photo slideshow with music and audio narration to tell the story automatically with no involvement from me.

I'll spend the next 20 minutes showing you how I built it and the challenges. But first, let me show you what Ellie looks like. I got a portable projector from Amazon and play the videos in Dexie's bedroom.

[Show example video]

So that's Ellie. Now let's start - how does Ellie work? Here is the initial design architecture:

Given an instruction prompt like "Write a story about Bob the penguin's trip to Europe", I pass this to the story script generator. This generator will produce a full story script.

I then split every paragraph and send each to the comic image generator to generate relevant images. At the same time, each paragraph is sent to Amazon Polly to generate audio narration.

The story script generator also extracts keywords for the background music.

So in summary, that is how Ellie works to automatically generate a comic video from just a short text prompt. I'll now go into more details on how I built each component.

"Apart from creating a story script, it also responsible to choose a music style that matching the story. So from the chosen style, I simply picked the mp3 file that already provided that matching the style. Yeah. And finally I stitch the old generation music and this photo together to come up with a final video slide show. Very easy. Yeah. No. Get high, not complicated at all.

So for the next section, I'm going to start taking you through how I build first, the story script generator. Second, how I build the comic image generator. And finally, what are the challenges I encounter and how do I solve it? Because most likely you're gonna end up with something similar to what I face if you're building something like this. Yeah. Ok. Let's start with a story script generator.

You are probably already aware. Yeah, you probably already know like most likely this is going to be the job of the work of love language model. It's right. So given an instruction like this, write a story about pop the penguin strip to europe the last language model is able to build to generate the entire script like this. Yeah, very easy, just like one function call, passing the parameter you get all of this right.

So one of the popular way of building using last language model, as you're probably aware, recently being announced is multiple foundation models within amazon bedrock. So amazon petrol is like half of all of these popular models including the one that built by amazon team themselves. They titan, you can pick and choose which one is is suitable for you.

So let me show you. Now how do you actually start using amazon petro to generate? So you simply go to the playground text section there. Yeah. And then first you need to choose which model you want to use. So from the drop down on the top there, yeah, you have a few options there. Yeah. And once you selected the model provider, you select what model that you want to use. Like if you can see here, that model with a name and the context size is being provided. So you need to choose the right model for your use case.

So one model is more powerful than the other, but of course more powerful means more expensive. Yeah. So on top of the size of the model, you also need to look at the context size, context size means like how many words of text you can pass into the model you can see here, titan is 4000 words or slash token. The model that i use for a if you look at it here, actually built by anthropic and especially it is cloud version 1.3. This is good enough, we do not need to go to 2.2 0.0 or 2.1 that was announced yesterday. Yeah, this is good enough for the use case

Once you selected your model. Yeah, because in this case, i need to generate up to like 800 ish words for my story, i have to more divide it number of words to 1000. As you can see it on a video there. And now you can start giving the instruction to the very easy. You just type in here, write me 300 word story about pop the penguins and you start with a human col prefix and break it down into maximum five section. So then i can actually split the section and general image per section and that's it. And then you press the right button, no quoting needed here and that's it. The story is generated now. Yeah, see pop the penguin live in antarctica. Dream of traveling in europe, blah, blah, blah, blah blah, that's that's it. So it's super easy, right?

Ok. However, if you remember initially, i said the job of the story script generator is not just generating the story script, but it also need to choose what strong style matching the story. So i went back and here i add extra instruction here, right at the start of the story, please suggest a song style from the following list that matching the story and put it within the triangle bracket. And the reason is so that i can capture it, extract it out programmatically. Yeah, so i give it a list there to choose action, calm, dramatic, epic, so on and so forth.

So as you can see each of this category map into some of the mp mp3 files that i already provided. Ok? Now let's see you press the run button. Now you will see that it's a new story now with that triangle bracket come. So now my code can extract that out and pick the right mp3 file and you go, that's a new story created from scratch. All right.

So obviously you will want to actually write a pattern code to do the integration to your system. So by clicking that button just now, you can see that snippet of chas and pay that you need to pass into the pap column that contain all the necessary parameters. So very easy. Just one click of button copy that put in your code like this. Yeah. So you can see that this is how you call the bedrock a p behind the scene. Yeah. Very easy. Only multiple, only if you are.

So ok, so now we finish with the story script generator. But how do we build the image generator. The job of this generator is simply given a tax penguin standing on an iceberg in antarctica. It needs to generate image like this. That's the requirement here.

One of the most popular image generator model that you can use is called stable division. And the latest version is excel but this was like built like early this year. So i'm using 2.1 model. Yeah.

So how do they build, how does stable division team build this model? This model was trained with 2.3 billions of pairs of image and caption. So for it to understand what does our real world object looks like? Yeah. And once you train it with as many images, yeah, given instruction like this turtle is swimming in the sea, it's able to peel to generate the relevant image.

So what is happening? Sorry. So one thing i want to emphasize here staple diffusion does not just memorize those training sets that 2.3 billion images. It's not just like you given instruction and just pull one of those 12.3 billion images. It's not, it generates an authentic image like you can see here. Given this is very weird caption me love a bit with banana topping shaolin man riding motorcycle on countryside. You will never find this image on the internet, right? It's able to generate this image originally. Yeah.

Ok. So training staple division model, it is actually very expensive. It required that many millions images image i mentioned earlier. Plus, you need so many gpus to train this. Not many people can afford it. However, you do not need to panic. Amazon says make a jump start actually have a collection of like a open source model and one of them is stable division 2.1 that has already pre trained. So you do not need to train it from scratch. It's already ready for you to generate the images and using the stable division from jumpstart. You do not need to, right. You need to do not need to understand a lot of like a data science as well because it's just simply a few, a few lines of code passing in your text and you look at the image you will see later. Yeah.

So you start with selecting this staple division model from jump start collection and then you press the open notebook. Yeah. And what happened is amazon says make a studio will launch and it will actually give you the all the code that you need to channel the image like what you see here. Yeah. And there are many variants of stable diffusion 2.1. When you, when you select that drop down, when you run this code for the first time, this will show a drop down for you to choose which particular staple revision model you want because i need to fine tune this model. Yeah. The one actually you can, you can, you can find tune further is the last one day the one with the tick box there.

So once you selected this model, you execute this code still within a notebook, you do not need to write any code to deploy the model into an end point like a rap. So start generating an image for you. And within the amazon search, make a studio, you can start passing in this instruction. So give me a create me an image of a turtle swimming underwater, right? So that's the image that you can see it right away within the studio. It's very, very inconvenient. Ok?

Looks like all problem solve all the modules that we need is built. Yeah, but there is no tech story without challenge and problem, right? And this is where the challenges is when i connected everything together and starting the images. I'm getting this look. This is the penguin had our trip to europe. The 2nd 2nd paragraph is one day, bob decided to take the plunge, begin his planning. His trip is someone's leg. Third one is this guy with a black suit. Last one is the guy with a blue shirt. Yeah. So it's very disappointing. Why? How could this happen? I call this problem? A character. Inconsistency. Yeah. And the problem is if you if you notice if you look carefully for each of this paragraph, we refer to as the first one is pop to win, but it does not mention what type of pen could be a dollar, right? So the image just speak. Ok, maybe while the second paragraph, we just refer bob as just bob doesn't even have instruction that is supposed to be a penguin. The generation is happening locally localized per paragraph. It is not aware of the previous generation. So hence there is no consistency between the generation. The third one and the fourth one also have an issue is that he he can be anyone, right? So it's not penguin. So this is the problem. Yeah, so that, that you're gonna face if you believe something similar.

Second problem i call it s out of focus. If you look at the paragraph there, the first paragraph, this paragraph contain two things. First is bob the penguin always dream of traveling to europe. While the second focus is he's tired of his routine life in antarctica is generator. Confused which one i should be focusing more on, right? If you have a very long paragraph with many many focal focal focal focal point, it confuse your image generator. Yeah. So this is what the picture looks like. Yeah. So a real penguin in antarctica. But this is totally different. Same thing with this two. Yeah. So that's actually what you expect when bob decided to take the plunge and planning his trip, it should be like reading book or looking at some material to plan for his trip. And in here, i want to emphasize that the image generator even got confused. Rating the word take the plants as like somebody doing a bungee jumping. Ok. So we know that long paragraph with multiple focal point is bad. Yeah, something for you to to stay away from.

So how do we fix this? Yeah. So actually quite simple"

Yeah, I tackle this from the prime engineering. So I modify my instruction to have an extra instruction for each section.

"Please describe the scene in details in one short sentence."

Yeah, like the one in yellow there at the bottom. So that way now my story script generator create one short sentence that describe the scene for every paragraph. It's very, very focused as well.

And instead of passing in the actual paragraph to the image generator, I pass in this seen description instead and now you can see it's way better. All of them are upping win but still there is a catch they all different penguin, right? And you probably wonder seriously you are building this for like a five years old. I wasn't gay, right? That's wrong.

Taxi standard is very high. I've been building all this toil for him. His requirement is a lot more harsher than some of the product manager at our company. It is very, very strict. Ok. So this is not gonna fly with him. All right, I need to fix this.

Ok. So how do I fix it? Initially, I was trying to fix this through prime engineering again. Yeah. So what if I what if I give instruction to describe bob in details every time the within the story scrape referring to him, right? Like this a photo of small black emperor win antica keep using this again and again everywhere, right? For wind may not be that bad.

Yeah, there are not that many species of penguin. At least for me my point of view, I'm not a being an expert there but if it is a bird, yeah, this is a serious issue that like thousands of birds, right? With different sizes, different shapes, different color, right? Describing this bird is very very very very very complex.

Yeah. So like this as an example, the photo of small green pat with whatever i can even pronounce it optimal. So butterfly even more complex describing butterfly you need like something like this. Imagine if every time you refer to this butterfly, you have to actually put this chunk of text again and again this is going to be an issue, right?

So it's a text overload. Yeah. And it will eventually lead to an inaccurate image generation. As you can see on the left. Even i provided this instruction, the image generator still incorrectly generate those bird features in wrong color, not yellow, right?

Ok. So what is the solution? Interestingly, the solution lies within this very famous sentence picture is worth 1000 words. What if when we give the instruction to the stable division to generate the image using text like this photo of pop the penguin in the jungle. We also provide the photo what bob is supposed to look like an image has like 1000 millions more information and text.

So this will describe bob exactly as how we want bob to look like. Yeah. So this is exactly what we do. Yeah, what i do and this technique is called fine tuning. So that way the stable division model now able to generate the exact pop the pen that we want. Yeah. Again and again in multiple paragraphs.

Ok. How do you fine tune a staple division model, right? So a staple division model like i said, it already come with a pre training. So it already have the knowledge you can generate images, which is exactly 90 95% of people which is gonna be using it that way. Yeah. With fine tuning it, you basically provide it like 5 to 10 images of this new object for it to be able to now understand what is this new object looks like to be able to create totally new scene with that object. This is what fine tuning in a high level.

What is happening inside the stable division. If you remember i mentioned inside the pre trainin stable division model, it has a mental picture what mango looks like what penguin looks like, what car looks like when you find tuning it. Yeah, you provided a few photos of this bob the penguin. You also say this is bob the penguin. It now create a new object within a mental picture. Pop the penguin and this is what he looks like.

Yeah. So that's how it now can generate. Pop the penguin whenever you mention that, pop the penguin in a sentence. So again, no rocket science easy. Yeah. And interestingly when you get it here, these are some of the images generated. The first one is a texas magnetic block. He tried to make oo using magnetic block. And you can see that the image generator underneath at the bottom, there looks like a hybrid between kangaroo and the magnetic blocks and the giraffe.

So apart from solving character inconsistency problem, now this end up becoming also a new features because now i can use dei's toy as the style of the story which enhance the experience. He is totally mind blown by this, right?

Ok. I also want to show you a few of the failure example. Yeah. So stable division is not black magic. Yeah. So when you ask it to create an image that totally not make any sense, never been seen before like pump the hedgehog milking the cow in the training set. There's never anything close to actually show how he mil in the car, the cow, right? It will create some weird image like this.

Yeah. Same with the hedgehog adding a flower bouquet. So but i believe this is gonna get better over time. Yeah, i think the 3.0 is usually is a bit better than this now. Yeah. All right. So how do you do it?

Hm. It sounds complicated. But in practice, when you really, really build, implement this, it's not hard. Yeah, all you need to do right? For fine tuning, stable division, you just need to provide the training sets, images like the one here. So if you look at this, the one in a red box here you provide 5 to 10 images of the object that you want staple division to learn. And you provide a jason file that contains this instruction instant problem. A photo of xyzp win.

It does not have to be xyzp when it can be anything as long as a unique, a unique words. Yeah. And additionally, you can also provide more images that describe what is the generic version of the penguin looks like because i'm i'm teaching, i'm teaching my model to learn about this particular penguin.

Do the closest object that is represented in training set is a normal penguin. Yeah. So you can supply hundreds of images of penguin to even help to understand, to learn better. Yeah, if you do not have it, that's ok. Not at the end of the world. Stable division will automatically generate these images automatically for you.

Yeah. So all you need to do is to specify that class prom a photo of a penguin. Yeah. So that will be used by stable division to generate the training set for you. Ok? So all of these images plus adjacent file, you just put it in the bucket. That's all you need to do. Yeah. And next, i just need to add one extra line to my brief that i show you to fine tune the model further by passing in the location of the sp where those files are.

And of course, once the training is completed, which takes about half an hour, roughly you deploy the model using the same quote that you seen previously and that's it. So now you can start using this model to generate an image that is relevant to the the the the object that you used to find in the model, right? Like here you can see nick, the kangaroo is sitting next to the lake and like a turtle swimming around the school of fish.

Ok? So now you understand from this initial architecture, now it's modified slightly by adding the extra fine tuning steps for the for hourly to be able to generate this new concept. Ok. So this is all like a bits and pieces have you seen earlier? Right? So we want to glue this together. So become like an automated system, press the button, everything just works, right?

So i use aws patch to orchestrate the holdings first. It will pull image from the packet plus the jason f containing the story the story title and start calling petro lm to generate the story script which then push the the the the the the images for to function the model into search maker.

Yeah. And once the training completed the best transform is used to load the model and generate a few images per section in the story. And then amazon poly also being used to pass in all of the section of the story to general or generation. And finally, everything stitched together using movie p framework, python framework plus also the mp3 file and drop the final mp four file into s3 bucket.

So that's how everything orchestrated here. Yeah.

Ok. So here are a few of the story that always generated. And if you want to go to my youtube account, there are a few video that i keep uploaded online to show the new story that already generated. So that's a qr code there.

Ok. So i also want to show you something something interesting when i build this, i discover something else, right? Mixing images together. So if you remember this diagram that i show you earlier, right? This is how stable division, learning a new concept from this mental picture of object, you fine tune it and learn a new concept, but you can also use stable division to off an existing concept.

Let's take a look at this example here. The stable division have a mental picture of what the cat looks like. Yeah, what if i find tune it by passing this image of a baby chick here but i call this one a cat.

Um can somebody guess what's gonna happen? Ok. Yes. Across yes, this is what it thinks the gap looks like. Yeah. So this is also a lesson for us. If you train your model wrongly, it will do bad things, right? We just copy the training sets just like this. So now with this, i can create a dancing cat photo like this.

Yeah, it's a very fun. So i use this as an extra features for hour. So i can cross two objects together to general a story like this. This one here is like a what is a frog and a crab hybrid? And the right one is a tiger and an elephant, right? It's it's really fun for taxi.

So take away generative a i redefine what software can do all of this? you cannot do previously. But now you can just like a magic and building attack is very easy. It even gets easier by day. Yeah. So you should go and try it yourself because it's really, really fun.

Yeah. And before i go to the next slide and i'm gonna show i'm going to have a q and a session do as well after this. So you won't ask me a question. Stay here, there's a microphone there. You can actually walk towards or you can just scream and i will repeat your question from here up to you.

So yeah. So thank you for listening. And if you want to reach out to me through linkedin or meeting blog to find out all the project, i built a lot of details they say and don't forget to fill in the survey from your mobile app. Awesome. Thank you. Yeah.