How to build generative AI–powered American Sign Language avatars

最新推荐文章于 2024-08-21 14:24:53 发布

李白的朋友王维

最新推荐文章于 2024-08-21 14:24:53 发布

阅读量131

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134833792

版权

Good afternoon, everyone. Earlier this summer, Suresh and I were at the Midwest Community Day and we were delivering a Generative AI on AWS keynote. Right after the session, we walked over to a Starbucks and on the way, we were discussing the endless possibilities the Generative AI can bring on.

So at the Starbucks coffee shop, we ran into someone at the table next to us. He had an extra chair, so I walked over to him to ask if I could borrow the chair for our table. He started communicating back with me in a sign language. That is when I realized that he was deaf and hard of hearing. I didn't understand what he was trying to communicate, and I did not know how to sign back either. It was a moment of realization of a communication breakdown.

Assistive technologies exist that can translate English to other spoken languages, but this was a unique problem to solve that does not have a solution today. That's when we thought, what if we can utilize AWS cloud and generative AI capabilities to build an application to enable individuals who rely on visual communication? That is why we are here now, very excited to show you all how we built that solution.

My name is Alara Da, I'm a senior solutions architect here at AWS.

Hello everybody, I'm Suresh Ban, I'm a senior solutions architect with AWS.

Hey everyone, I'm Rob Koch, I'm a principal data engineering in Slalom. I'm deaf obviously, so I'm signing and I have a signing interpreter in the front row voicing for me.

In this session, we're going to be talking about how to build a generative AI powered American Sign Language (ASL) avatars.

In the agenda today, I'm going to be walking over what is accessibility and assistive technology, we'll talk about American Sign Language and introduce the solution we built. I will then be handing it over to Suresh to do a demo of what we have tied up and he will be diving deep into the architecture and code.

This is a level 300 advanced session focused on the builder's audience, but let's get started. Rob is going to get us started, over to you Rob.

[Rob:] You see this picture here and as we say, it's worth 1000 words, right? So there's people here, families, there's a dad with a stroller that you can see. There's maybe a mother with a kid on a scooter here and then you know, they're all walking across the street. You notice that this curb that's cut here that with the dip in it. That's for accessibility with ease the little girl with the scooter can get up the ramp without stopping to pick it up, let's say, and go onto the sidewalk. The stroller doesn't have to be finagled up the, up the curb as well and the family doesn't have to pick up their kid and put them onto the sidewalk, maybe also an elderly person who has difficulty with making steps.

So all of this is utilizing the the curb cut and it is called the curb cut effect. It is designed for folks with wheelchair using and accessibility to have, you know, providing accommodations for folks to be able to use that who are wheelchair users. But that phrase that we take from it is called the curb cup effect. And that's where it comes from - the example of that effect is also for subtitles on the TV. That was intended for the deaf and hard of hearing. But a lot of people use them like if you're in a noisy environment or if you have an audio processing disorder or maybe English was your second language and you wanna see what they're saying a little bit more clearly. It can aid in that.

So the curb cut effect that is illustrated here, I'm going to turn it back to Alara and she will discuss a little bit more about ASL.

[Alara:] Thank you, Rob.

So with that, let's talk about what is accessibility. Accessibility means anyone who is using a product or a service must receive the same benefit regardless of the condition or the disability that they may have. So it's really about identifying the barriers in the environment and really working to remove them. Some barriers can be removed by introducing special equipments that the people with disability can use, while other barriers can be removed by completely changing the environment itself, like the curb cut example that Rob talked about.

What is assistive technology? It's a set of tools that people with disability can use to accomplish tasks. Visual, auditory, physical, speech, cognitive and neurological disabilities must be considered when implementing accessibility measures.

Why do we care about accessibility? First and foremost, it's the most human thing to do, right - to care about others. Because if we exclude people, we're creating barriers and therefore the products that we create become inaccessible to everyone.

According to the World Health Organization, 15% of the population has significant disability. This includes 1 billion people. If this population were to be a country on its own, it's third in line after India and China.

Globally, more than 2.5 billion people rely on one or more assisted products. 466 million people worldwide have hearing loss, including 34 million children, majority of who have hearing parents.

It's also estimated that by 2050 more than 3.5 billion people will need one assisted product. 900 million individuals will experience hearing loss or communication issues.

In order for us to know the type of audience that we're talking to, we would like to take a moment to hear from you all.

[Poll question 1]

Awesome, majority of you know what American Sign Language is. It's great to know that. Let's go back to the slides.

Sign languages are visual, gesture-based languages which are regarded as the mainstream languages of the Deaf or hard of hearing communities. Hand gestures, body movements, facial expressions are all utilized in communicating in a sign language.

American Sign Language (ASL) is a fully developed and complex natural language that possesses all the properties that exist in spoken languages - with its own grammar, syntax, morphology, regional and social variations.

In ASL, the smallest units of the language are combined to form a sign called parameters. It includes hand shape, palm orientation, location, movement, and facial expressions. There is no human concept that cannot be accurately and completely expressed in ASL when ASL is the primary language. English is the 2nd, 3rd or 4th acquired language.

ASL is used in the U.S. and parts of Canada as well. Let's take another moment to do a poll and hear from you all:

[Poll question 2]

Yes, majority of you have the right answer - ASL is not a written language. So let's go back to the slides.

ASL gloss is not ASL in a written form. ASL gloss is not a way of making ASL a written language.

So then what is ASL gloss? It is a linguistic representation of American Sign Language. What do they mean by that? For example, an English text "Hello, how are you?" is going to be represented in the ASL gloss notation as "HELLO HOW YOU DOING?".

You may be noticing all the capital letters and also some hyphens being used in the gloss notation. Capital notation denotes that they are glosses. It's also readily distinguishable from the English text. Hyphens are used when the gloss is formed with more than one English word. Those words are separated by a hyphen to indicate it is a single gloss.

This occurs in instances where ASL is more efficient than English, where ASL can say in one sign what it takes two or more English words to convey.

The other thing you may notice is the word "re-invent" is finger spelled. So in ASL, when a word needs to be finger spelled, it's indicated with the prefix "FS-" and every finger spelled letter will be separated by a hyphen.

Alright, let's talk about why ASL avatars. Deaf and hard of hearing individuals and hearing individuals rely on sign language interpreters to communicate between them. Although some government agencies require American Sign Language, many private organizations, events and individuals do not have access to sign language interpreters, which are expensive and are in short supply.

With the limited availability of human interpreters, there is a gap in communication in meetings, events and even daily interactions like the coffee shop example I talked about earlier.

There are video on demand platforms that allow sign language interpreters to join and attend calls, but they are in really short supply and there are not enough qualified sign language interpreters to meet the demand.

Fake signers continue to be on the news. A couple of years ago, a woman got arrested for fake signing in a police interview. Therefore, we are really leaning on the need to create assistive technology with faster and accurate translation.

Lastly, the rise of foundational models and the fascinating world of generative AI that we live in is incredibly exciting and it really opens up doors to imagine and build what was not possible before.

What is character AI? Generative AI is a type of AI that can create new content and ideas - conversations, stories, images, videos, music, and so on. Like any other AI, generative AI is powered by machine learning models which are pre-trained on vast corpora of data commonly called as foundational models.

With that, I want to introduce you all to the application that we built called GenASL. It utilizes AWS cloud and AWS generative AI capabilities to enable individuals who rely on visual communication by translating speech or text into ASL animations or videos.

So users can simply send an audio as input and the GenASL application will generate a human-like video and also an ASL avatar.

Having said that, let's dive into the solution approach:

The audio file will be passed on to a speech-to-text model which will then generate an English text. This is passed on to a text-to-text large language model, which is going to generate the ASL gloss that we talked about.

The ASL gloss is then used by a machine learning model to generate a sign video. We have some examples that Rob is going to be showing us that was generated by our application.

Hello, how are you? I'm fine, thanks. How are you? Welcome to Re:Invent? Awesome.

So those examples that you saw were generated by our application. So having looked at some examples in the high level idea, let's take a deeper look at our solution approach.

So our solution includes three steps:

Step One: We take the input audio and convert it to an English text.

Step Two: We use the English text to generate the ASL gloss.

Step Three: Using the ASL gloss, we generate those videos that you saw in the previous light.

I'm going to be diving deep in each of the steps in the subsequent slides. So I'll get into the details there.

So in Step One, we convert the audio input to English text. And for that, we use Amazon Transcribe. Amazon Transcribe is an automatic speech to text AI service that uses deep learning for speech recognition. This is a fully managed service and its continuously trained service enables you to focus on your business and allows AWS to do the heavy lifting in terms of building and maintaining state of the art speech recognition models for you.

The other thing to note is that it's an AI service and no prior machine learning experience is required to use the service.

So in Step Two, we will be generating the ASL gloss and we are using Amazon Bedrock and specifically Claude b2lm provided by Anthropic. Amazon Bedrock is going to be the easiest way to build and scale generative AI applications with foundational models.

The biggest value proposition of Bedrock in the market is the ability to accelerate development. The other reason Bedrock shines is that it has the ability to access different foundation models both provided by third party providers and Amazon provided models as well. So Bedrock customers can choose from some of the most cutting edge foundation models available today.

It includes Amazon's Titan AI, 21 Labs Jurassic-2, Anthropic's Claude b2, Cohere, Meta, and Stability AI provided models. So in late September, Amazon and Anthropic came up with a partnership to advance the generative AI capabilities and we are leveraging Claude b2 for a generative ASL application as well.

So with that, the third step includes generating the ASL video from the ASL gloss. So our input dataset, the video dataset, is stored in an S3 bucket. And then we are going to be leveraging mmPose, which is an open source toolbox, and RTMpose, which is a model post estimation model that's going to be generating those ASL avatar videos.

Suresh is going to dive deeper when he's walking through the architecture. But really, we're gonna be able to have the ASL gloss as a key in the DynamoDB table and point it to those videos that are going to come from the S3 location.

So the video dataset is from the American Sign Language Lexicon Video Dataset. It's coming from Boston University. There are around 3,300 ASL signs and they are produced by around six native ASL signers. So the video files include multiple signs and the metadata is important here because for every ASL gloss, it's gonna include the start and end frames.

So having looked at the input dataset, I'd like to touch a little bit upon the mmPose and the RTMpose, which is the real time multi person pose estimation model that's powering the solution. So you can sort of see the 2D human whole body pose estimation. So we are using 133 key points. RTMpose is able to generate those avatars.

So lastly, so with the input dataset, we are able to generate the ASL avatars and using an mmPose pose estimation model. And we have all the data ready in DynamoDB and S3.

So having talked about the high level solution approach, I'm going to hand it over to Suresh who's going to be showing you a demo and take us through the architecture.

Okay, we have built the GenASL application and hosted that on asl.com. Let's wait for the demo to show... Okay, so let me open the browser and then go type asl.com. It's going to bring up the application.

We built this application to accept three types of input - you can upload an audio file and then also you can press the mic button and then use the device mic and provide input. Also you can type text and then press Generate Video.

So for the first use case, I'm uploading an audio file, it has generated the avatar video and then the sign video. So here I gave a pre recorded audio file saying "What is Re:Invent? AWS Re:Invent is a learning conference hosted by AWS for global cloud computing community."

So whenever the generation application finds a sign in the sign database, it's going to take the corresponding sign video and then bring it back. If the sign is not available, it's going to fall back to finger spelling.

Here is finger spelling for some of the words like "Re:Invent" - this is going to finger spell and then it found some signs for "cloud" and "community" that's going to come up soon... Yeah, that's "cloud" and then it's finger spelling. And the next one and then the last one will be "community".

So now let's move on to the next type of input. For the next type, I'm going to press the mic and then talk to the mic saying "A generative AI is a type of AI that can create net new content including conversations, stories, music, videos and images." I'll stop the mic.

And then for this use case, I'm using the browser speech API to convert the speech into text and then the text is sent to the backend application.

So the reason I chose this one here - you can see a lot of signs be available in the sign database like stories, images, videos, and music, those are common words and then they are available in the sign database.

That may be a question - why is the avatar generated like this? Why can't we just stitch the videos and have a real human video generated? But if you see some of the signs, the signer's dress is changing and there is a flip between the signs. So that's the reason we are generating the sign avatar on the left hand side. You can see some of the dresses getting changed. That's for the "music" one.

And let's look at the final use case where I'm going to type and then press Generate here. I'm going to type "Let us build generative power with American Sign Language avatars."

Here, there are some signs available for "American Sign Language" and "avatars" but signed by a different signer. So I did not pull that into the sign database. So it's going to finger spell that.

Okay, I think that concludes the demo. Let's look at the architecture behind this.

The architecture consists of three parts. Let's look at the first part - the backend batch process. This batch process is going to download the video files from the Boston University website and then store that in an S3 bucket and then also it downloads the metadata file.

The metadata file has the start frame, end frame, the ASL gloss, and the signer. So then it processes that file and segments the video file and creates multiple video files - one file for one sign - and then stores that in the S3 bucket and stores the corresponding S3 key in DynamoDB along with the gloss.

And then once that first step is done, the second step is avatar generation. So it's going to take the sign video and then go frame by frame - most signs will be less than two seconds, it's a 30 frames per second video, so there will be at most 60 frames.

So it's going to take the image and then send that to the 2D pose algorithm - the RTMpose is a 2D pose algorithm that's being run within mmPose toolkit. That's going to create 133 key points - some of the key points are where the eyes are, where the nose, where the hands. So it's going to give the key points and then those 133 key points are stored in the database.

And then a blank canvas image is created and then all the points are plotted to create the avatar video and then that is repeated for 60 times for each frame. And then the resultant avatar video is stored in the S3 bucket and the corresponding S3 key is stored in DynamoDB along with the gloss.

So let's look at the front end application. The front end application is built using AWS Amplify - AWS Amplify is a framework that allows you to build, develop, deploy and host full stack applications including mobile and web apps quickly.

I used that to generate the authentication - so you can simply add authentication to a frontend AWS app using Amplify CLI "amplify add auth". That's going to generate the sign up screen, login screen as well as the backend Cognito Identity Pools.

You may ask why do we need authentication and authorization for this app? Because the use case I mentioned - when we are uploading the audio file to S3, the frontend has to connect to S3 and upload the file. That needs a temporary identity so that temporary identity is provided by the Cognito Identity Pool based on the user credentials.

And let's look at the middle part, the API layer. The API layer is fronted by Amazon API Gateway. API Gateway allows you to authenticate the API requests and monitor the API requests and also throttle it.

So whenever the API request gets the API to generate the sign video, it's going to invoke a Step Function and then return the Step Function execution URL back to the frontend application.

So let's look at what is inside the Step Function. The function has three steps like outlined in the high level solution.

The first step is "Transcribe Audio" - it has to convert the input audio file into English text. That is done through Amazon Transcribe service.

And then the second step is converting the English text to the ASL gloss. Here we are using the Amazon Bedrock API to call Anthropic Claude b2. And we will provide a prompt saying to convert this English text to ASL gloss and give the input text message.

Also we provide some short prompting examples to instruct Claude b2 to produce accurate glosses.

And once that is done, the third step is going to create the avatar videos and sign videos - we have the dataset consists of only the video for each sign. So this third Lambda function is going to stitch the videos together and generate a temporary video and upload that to the S3 bucket and create a pre-signed URL and send the pre-signed URL for both the sign video and avatar video back to the frontend.

And then the frontend is going to play that in a loop. So let's dive deep into the API detailed design. Okay? The API Gateway supports a maximum timeout of 29 seconds.

So here if you give a longer audio file, it may take more than 30 seconds. So we want to avoid the time out issue. Also, it's the best practice not to build uh you know, an application that's synchronous.

So here we build the asynchronous API. So the asynchronous API is built in a two step process.

One is assign uh API endpoint with the post that's going to get um the s3 key and then the s3 bucket name or a text.

So then um it's going to the API Gateway. API Gateway will delegate the stuff function.

Stuff function will send a response back with the execution ID.

So then the next step is going to do a get sign. The get sign is going to send the execution ID back to the API.

And then it's going to check the status of the Step Function execution. If it is succeeded, it's going to return the presigned URL back. If it is running, it's going to send the status running back and then the front end will wait for a couple of seconds and then re-call the get sign again.

Ok. So this is a end to end architecture.

Uh here, I just want to highlight a couple of things.

Uh one, when we started working on this project, we were using Amazon's vector to vector machine learning model with the corpus of text to generate that gloss. And then we trained that Amazon vector to vector machine learning model to generate that gloss. But the accuracy was very low.

And then when Amazon Polly was introduced, we switched the algorithm to use Amazon Polly and cloud v2 model, anthropy cloud v2 model. And we were getting accurate results like with some few examples.

And then the second thing, uh recently on Monday, there was announcement, the Step Function has integrations with Polly.

Uh so we don't need to have the lambda in the middle, the create ASL gloss lambda is not needed anymore. Uh you can just call the Polly API directly from Step Function that will save the lambda compute costs because the lambda is not needed anymore.

And um and also from a devops perspective, the front end is using Amplify to build and deploy and then the back end is using SAM.

Uh so this application model to build and package and deploy that applications.

Let's see what are the other things we can add uh from a devops perspective.

Um so any good application, we want to have a monitoring and logging.

Uh so the logging is done to Amazon CloudWatch Logs. So that has been taken care.

So I put over a dashboard to capture the metrics.

So on the top left hand side, displaying the number of API invocations. That's going to show how many ASL avatar videos were generated.

And then the other three average response time is going to show how much time it took to create that video, averaging it's like around 10 seconds.

And then there are two major functions inside. One is the text to gloss that's taking less than a second on average because it's calling the Anthropic Cloud v2 model to generate the gloss.

And then the third one is a gloss to post that involves downloading the file from S3 and then video processing. So that's taking close to eight seconds. So that's the average is around 10 seconds total.

And then the bottom part is showing the error metrics, you know, like it's good to have good user experience by tracking if there is a failure and then alert the devops team. So it's capturing all the errors produced by this ASL application.

And now let's see when it's, this application is deployed, how it's going to look like in the AWS console.

So I logged into the AWS console and then I went to state machine and then this is the Step Function deployed as part of this project.

And then let me open the Step Function.

Um and then I'll go and edit this Step Function that's going to bring up the Step Function in a workflow studio.

The workflow studio is an easy way to drag and drop the states to create a workflow.

Um so it always starts with start and then the ends with end function.

Uh here, the first step is input check. The input check is checking whether the input is an audio file or input is a text.

Uh so if it's an audio file, it's going to start the start transcription job and then the start transcription job, it's integrated with the AWS Transcribe.

Uh here is an example where we don't need a lambda function to invoke Transcribe because it has a direct integration with transcription service.

So here we just need to pass the URL and then to support other language. I left the identify language as true.

And then once this step is done, it's going to return the transcripts and job name as an output and then it goes to the wait state.

Wait state is going to wait for two seconds and then call that get transcripts and job status and then it will, it will pass the job name as an input to that one.

And then once the get transcription job returns, the next step is going to check whether the job is completed or not. It's a choice state.

Uh if you look at the choice here, there are if the three of the conditions like one is going to check whether if it's completed, it will go to the next step to process transcription.

If it is failed, it goes to the error state and then exit. And then if it is still running, it goes back into the wait and then wait for two seconds and then call the transcription, get transcription job status again.

And then once it's succeeded, it's going to go to the process translation equality lambda function.

What this lambda function is doing, it's taking the transcription job name and then call, get transcription job status and then get the text out of that out of the job and then put that in a variable that will be used in the next, next two steps.

So the next step is text to gloss. Uh this is where it's calling the Amazon Polly API and then it's going to call the Anthropic Cloud v2 model by passing the English text and then it's going to get the output gloss as an output.

And then once the gloss is received, then the final lambda, it's going to the last two posts is going to download the videos, stitch the videos and then upload the resulting videos into an S3 bucket, create a presigned URL and then return the presigned URL in the output.

Now, let's close this one and look at one of the job execution here. So the first one, we look at the first one because it will go, went through all the states.

So let's look at the executions created for audio file, upload use case. In this case, it went through all the states, it's checking the input and the input is audio, then it's going to call the start transcription job.

Uh here, you can see what is coming in as an input because it's going to get the bucket name and then the key name where the audio file is uploaded.

And then once that, that is done, it's going to call the transcription job and then it's going to go and wait and then once it's waits done, it's going to call the get transcription job status and then it's going to return the response back.

So if you see it's in progress because this particular job took more than two seconds. So the first call in progress and then the second call will become complete.

And then once it's complete, it's going to go to the process translation job here, it's output.

If you see the output will be the whatever I uploaded in the audio file, what is the invent?

And that answer once that is coming, it's going to call the Polly API, the Polly API is getting that. What is the invent the English text as an input?

And then the output from that will be the gloss. So you can see the gloss will be like uh you know, IX 3 pieces. Uh you know, he see what, what is re invent.

And then this is the, the next one is the gloss to post and then gloss to post is getting gloss as input and then the output will be the presigned URL.

You can see the URL with the, you know large because identity is embedded in this URL. So the URL be presigned URL are large.

And let's, that concludes our control walk through.

Uh let's move on. So what is next? As you can see the ASL application is currently generating a 2D avatar.

So we are planning to convert that to a 3D avatar. So there are 3D pose estimation algorithms already available and supported by MediaPipe.

We are going to use that to create 3D key points. So on the 2D key points like we saw, there are 133 key points. But when we use 3D, it's going to generate thousands of key points.

So we have to filter thousands of key points, but it needs a lot of compute.

And then the next step, we are going to use um Amazon's stable diffusion image generation capabilities to create human like avatars and then use the 3D key points and then the humanlike avatars and then create a realistic avatar that can be used in the real world settings.

And then the finally that when you see some of those videos generated by this ASL application, there is a frame skipping, for example, it will be like this. And then how uh you may start from here from the chest and then the hall will be on the forehead.

So when we stitch the video together, there is a frame drop like you can see a sudden change in the motion.

So to fix that, we have to use a technique called blending. Uh there are partner solutions available that will create the intermediate frame to create the blending.

So our next one will be incorporating partner solutions to create smoother videos.

And then also like this is a one way right, we are going from audio to ASL video, we also need a solution from ASL video back to English audio.

So that also can be done by navigating in the back way.

So for example like audio is getting converted to a text using Transcribe and then the from the text to gloss using Anthropic v2 and then from the gloss to avatar video is through pose estimation algorithms.

So if we can do the same thing in the reverse, we can record a real time signing sign video and then take the video frame by frame and then send that through pose estimation algorithms and collect the key points and then compare the key points and search against the key points database and get the gloss and then from the gloss, we can go to text and from the text to audio using Amazon Polly.

So that's the next.

Um so given the high level overview and the demo of ASL application here are some of the important resources you can use it to make your products more accessible.

The first link is the ASL application hosted on aslavatars.com. You can use it to create ASL videos.

And the second one. Amazon Polly allows you to create and deploy GD applications quickly by working generative AI models via API. To learn more about Amazon Polly, you can follow the last second link.

Third Amazon is like is Amazon's mission is to become a more, it's most customer centric company that means building products accessible to everyone so that you can learn more about Amazon's initiatives around accessibility and inclusivity by following the third link.

Ok. Um if you like our session, please provide a five star rating that will allow us to continue working on this project.

Also, please provide feedback in the session form that will also allow us to continue our journey as speakers.

And we are passionate about accessibility and building assistive technologies. Whenever you build a product include accessibility measures into that. Make sure whenever you build a product for your customer and your company, make sure to think about accessibility features.

And then together, let's make our world more inclusive and accessibility toward everyone.

Thank you all for coming to our session.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
How to build generative AI–powered American Sign Language avatars

".Awesome.
复制链接

扫一扫