jarvis oj_用一种态度来构建Jarvis,生成式聊天机器人

jarvis oj

Carsales.com, the company I work for, is holding a hackathon event. This is an annual event where everyone (tech or non tech) comes together to form a team and build anything — anything at all. Well, preferably you would build something that has a business purpose, but it is really up to you. This idea for this chatbot actually came from Jason Blackman, our Chief Information Officer at carsales.com.

我工作的公司Carsales.com正在举办黑客马拉松活动。 这是一年一度的活动,每个人(技术人员或非技术人员)齐心协力,组成一个团队,建立一切—一切。 好吧,最好您将构建具有业务目的的东西,但这确实取决于您。 这个聊天机器人的想法实际上来自carales.com的首席信息官Jason Blackman。

Carsales Hackathon
Carsales Hackathon

Given that our next hackathon is an online event, thanks to COVID-19, wouldn’t it be cool if we could host a Zoom webinar, where any carsales.com employee could jump in to hang out and chat with an AI bot which we could call Jarvis, who would always be available to chat with you.

鉴于我们的下一次黑客马拉松是在线活动,这要归功于COVID-19,如果我们可以举办Zoom网络研讨会,那不是很酷的事情,任何carales.com员工都可以参与其中并与我们聊天的AI机器人聊天可以打电话给Jarvis,他将始终可以与您聊天。

集思广益 (Brainstorming)

After tossing around ideas, I came up with a high-level scope. Jarvis would need to have a visual presence, just as would a human webinar participant. He needs to be able to listen to what you say and respond contextually with a voice.

在讨论想法之后,我提出了一个高层次的研究范围。 贾维斯(Jarvis)需要像人类网络研讨会参与者一样具有视觉形象。 他需要能够听您说的话,并用声音进行上下文响应。

I wanted him to be as creative as possible in his replies and to be able to generate a reply on the fly. Most chatbot systems are retrieval based, meaning that they have hundreds or thousands of prepared sentence pairs (source and target), which form their knowledge bases. When the bot hears a sentence, it will then try to find the most similar source sentence from its knowledge base and simply return the paired target sentence. Retrieval based bots such as Amazon Alexa and Google Home are a lot easier to build and work very well to deliver specific tasks like booking a restaurant or turning the lights on or off, whereas the conversation scope is limited. However, for entertainment purposes like casual chatting, they lack creativity in their replies when compared to the generative counterpart.

我希望他在他的回复中尽可能地富有创造力,并能够即时产生回复。 大多数聊天机器人系统都是基于检索的,这意味着它们具有成百上千个准备好的句子对(源和目标),构成了知识库。 当机器人听到一个句子时,它将尝试从其知识库中查找最相似的源句子,然后简单地返回配对的目标句子。 诸如Amazon Alexa和Google Home之类的基于检索的机器人很容易构建,并且可以很好地完成特定任务,例如预订餐厅或打开或关闭电灯,但是对话范围有限。 但是,出于娱乐目的(例如休闲聊天),与生成聊天对象相比,他们的答复缺乏创造力。

For that reason, I wanted a generative based system for Jarvis. I am fully aware that it is likely I will not achieve a good result. However, I really want to know how far the current generative chatbot technology has come and what it can do.

出于这个原因,我想要Jarvis基于生成的系统。 我完全意识到,我可能不会取得良好的成绩。 但是,我真的很想知道当前的聊天机器人技术已经走了多远,它能做什么。

建筑 (Architecture)

Ok, so I knew what I wanted. Now it was time to really contemplate how on earth I was going to build this bot.

好吧,所以我知道我想要什么。 现在是时候真正考虑一下我将如何构建这个机器人了。

We know that the first component needed is a mechanism to route audio and video. Our bot needed to be able to hear conversations on Zoom, so we needed a way to route the audio from Zoom into our bot. This audio would then need to be passed into a speech recognition module, which would give us the conversation as text. We would then need to pass this text into our generative AI model to get a reply, which would be turned into speech by using text-to-speech tech. While the audio reply is being played, we would need an animated avatar, which, apart from fidgeting, could also move his lips in sync with the audio playback. The avatar animation and audio playback needed to be sent back to Zoom for all meeting participants to hear and see. Wow! It was indeed a pretty complex system.

我们知道,需要的第一个组件是路由音频和视频的机制。 我们的机器人需要能够在Zoom上听到对话,因此我们需要一种将音频从Zoom路由到我们的机器人的方法。 然后,需要将此音频传递到语音识别模块,该模块会将文本作为对话提供给我们。 然后,我们需要将此文本传递到生成的AI模型中以得到答复,然后通过使用文本转语音技术将其转换为语音。 在播放音频回复时,我们需要一个动画化身,除了烦躁不安之外,还可以使他的嘴唇与音频播放同步。 需要将化身动画和音频播放发送回Zoom,以使所有会议参与者都能听到和看到。 哇! 这确实是一个非常复杂的系统。

Image for post
Jarvis’ architecture diagram
Jarvis的架构图

To summarise, we needed the following components:

总而言之,我们需要以下组件:

  • Audio/video routing

    音频/视频路由
  • Speech recognition

    语音识别
  • Generative AI model

    生成式AI模型
  • Text to Speech

    文字转语音
  • Animated avatar

    动画头像
  • Controller

    控制者

音频/视频路由 (Audio/video routing)

I love it when someone else has done the hard work for me. Loopback is an audio tool that allows you to redirect audio from any application into a virtual microphone. All I needed were two audio routings. The first one was to route the audio from the Zoom app into a virtual microphone, from which my bot would listen.

当别人为我完成艰苦的工作时,我会喜欢它。 环回是一种音频工具,可让您将音频从任何应用程序重定向到虚拟麦克风。 我需要的只是两个音频路由。 第一个是将音频从“缩放”应用程序路由到虚拟麦克风,我的机器人将在该麦克风中收听。

Image for post
Audio routing 1 diagram
音频路由1示意图

The second routing was to route the chatbot audio output into yet another virtual microphone, where both the Zoom app and our avatar tool would listen to. It is obvious that Zoom would need to listen to this audio. However, why would our avatar tool need this audio? For lip-syncing, so that our avatar could move his lips according to the audio playback. You will see more details on this in later sections of this blog.

第二个路由是将聊天机器人的音频输出路由到另一个虚拟麦克风,“缩放”应用程序和我们的头像工具都将在其中收听。 很明显,Zoom需要收听此音频。 但是,为什么我们的头像工具需要此音频? 为了进行唇形同步,以便我们的化身可以根据音频播放来移动嘴唇。 您将在此博客的后续部分中看到有关此内容的更多详细信息。

Image for post
Audio routing 2 diagram
音频路由2示意图

语音识别 (Speech Recognition)

This module is responsible for processing incoming audio from Zoom via a virtual microphone and turning it into a text. There were a couple of offline and online speech recognition frameworks to choose from. The one I ended up using was Google Speech API. It is an online API with an awesome python interface that delivers superb accuracy, and more importantly, allows you to stream and recognise audio in chunks, which minimises the processing time significantly. I would like to emphasise that latency (how long it takes for the bot to respond to a query) is very critical for a chat bot. A slow responding bot can look very robotic and unrealistic.

该模块负责通过虚拟麦克风处理来自Zoom的传入音频并将其转换为文本。 有两种离线和在线语音识别框架可供选择。 我最终使用的是Google Speech API。 它是一个在线API,具有令人敬畏的python接口,可提供出色的准确性,更重要的是,它使您可以分批传输和识别音频,从而最大程度地减少了处理时间。 我想强调一下,延迟(机器人响应查询需要多长时间)对于聊天机器人至关重要。 响应缓慢的漫游器看起来非常机器人化且不切实际。

Most of the time, the Google Speech API returns a response in less than a second after a sentence is fully heard.

大多数情况下,在完整听到句子后不到一秒钟的时间内,Google Speech API就会返回响应。

生成式AI模型 (Generative AI Model)

This is the part that I spent most of my time on. After spending a day or two catching up with the recent developments in generative chatbot techniques, I found out that Neural Machine Translation models seemed to have been quite popular recently.

这是我大部分时间都花在的部分。 在花了一两天的时间来追随生成型聊天机器人技术的最新发展之后,我发现神经机器翻译模型最近似乎非常流行。

The concept was to feed an encoder-decoder LSTM model with word embedding from an input sentence, and to be able to generate a contextual output sentence. This technique is normally used for language translation. However, given that the job is simply mapping out one sentence to another, it can also be used to generate a reply to a sentence (in theory).

其概念是使用从

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值