The views and opinions expressed in this blog are purely those of my own and do not reflect the official positions of my current or past employers.


25 billion requests a month is the number of Voice Assistant queries Siri handled according to Apple recently. With over one billion daily users, this amounts to less than an average of 1 request per user per day. A back-of-an-envelope calculation suggests that this is on average a few seconds a day at most — insignificant compared to an average of about 3 hours of daily smartphone use. This article argues that Requests per user per day — the metric driving Alexa, Siri, and Google Assistant — is a vanity metric promoting the notion that Voice is an app in itself, not an assistant to an app. This post aims to better explain this paradigm.

每月250亿个请求 Siri 最近根据Apple处理的Voice Assistant查询数量。 每天有超过10亿的用户,平均每个用户每天少于1个请求。 根据信封的计算, 这平均每天最多最多几秒钟,与每天平均使用智能手机约3个小时相比这是微不足道的。 本文认为,“ 每位用户每天的请求”(驱动Alexa,Siri和Google Assistant的指标)是一种虚荣指标,它提倡“语音本身就是应用程序,而不是应用程序助手”这一概念。 这篇文章旨在更好地解释这种范例。

How did we get here?


While discussing the future of Voice Assistants at the VentureBeat 2020 AI Transform Summit, the heads of both Amazon and Google Voice Assistants had one stark admission — the future of Voice Assistants is the Screen — the very interface that Voice Assistants had set out to replace in the first place. This is a subtle but implicit admission: the vision of human-like conversations prioritized technology over customer problems.

在VentureBeat 2020 AI Transform Summit峰会上讨论语音助手的未来时,亚马逊和Google语音助手的负责人都坦率地承认 -语音助手的未来就是屏幕-语音助手已经打算在其中替换的界面第一名。 这是一个微妙但隐含的承认: 类人对话的愿景将技术置于客户问题之上。

“For both Amazon and Google, this means building smart displays and emphasizing AI assistants that can both share visual content and respond with voice.” — Khari Johnson at the VentureBeat AI Summit while interviewing the heads of Amazon and Google AI assistants.

“对于亚马逊和谷歌而言,这意味着构建智能显示器并强调既可以共享视觉内容又可以通过语音做出响应的AI助手。” —在VentureBeat AI峰会上的Khari Johnson采访了亚马逊和Google AI助手的负责人。

The VentureBeat article goes on to say that the advancement of multimodal models could be beneficial for image recognition and language models, including more robust inference from models receiving input from more than a single medium. In other words, the issue at hand is that the near future of Voice Assistance is alongside the keyboard, mouse, touchscreen, camera, etc., and not in 20-minute conversations with bots.

VentureBeat文章继续说,多模式模型的改进可能对图像识别和语言模型有益,包括从来自多个媒介的输入中获取模型的更可靠的推断。 换句话说,当前的问题是语音协助的不久的将来会与键盘,鼠标,触摸屏,摄像头等同时出现,而不是与机器人进行20分钟的对话。

“Speech interfaces would be much more commonly used today if the quest for human-like continuous speech had not drained away so much talent and resources. As an example, Short, commandlike speech can be quite effective for many common tasks” — Gary Marchionini in his book Information Seeking in Electronic Environments.

“如果对人类般的连续语音的追求并没有流失那么多的人才和资源,那么今天的语音界面将更加普遍。 例如,简短的,类似命令的讲话对于许多常见任务可能非常有效。 ” — Gary Marchionini在他的《 电子环境中的信息寻求》 一书中

On a related note, Siri was barely mentioned at the Apple’s Worldwide Developers Conference (WWDC) 2020 keynote, which sheds light on Apple’s priorities. Apple’s relentless focus on improving Siri in small but meaningful ways is what gives me enormous hope about the future of Voice.

与此相关的是,在苹果的全球开发者大会(WWDC)2020主题演讲中几乎没有提及Siri,这阐明了苹果的优先事项。 苹果公司不懈地致力于以小而有意义的方式改进Siri,这使我对Voice的未来充满了希望。

In this article, I put forth a few observations on what has and has not worked for Voice Assistants, and how we can build a future with better Voice Assistants, through lessons based on history.


什么在语音中有效,什么在语音中无效? (What works in Voice and what does not?)

Alexa, Google Assistant, and Siri have good household penetration and are easily accessible to both adults and children. This has a lot to do with their strong distribution networks, an advantage that any other hardware company would desire. However, we should also not ignore the contributions of other companies that have successfully deployed Voice to solve real problems — my doctor at Kaiser Permanente uses Nuance’s Dragon Medical Speech Recognition Solution for dictation to speed up note-taking. Accessibility is another as described by Bo Campbell here. Understanding these key customer problems where Voice Assistants can make contributions will enable companies to make vital improvements, which I believe is key to the future of Voice Assistance.

Alexa,Google助手和Siri的家庭普及率很高,成人和儿童都可以轻松访问。 这与他们强大的分销网络有关,这是任何其他硬件公司都希望拥有的优势。 但是,我们也不应忽略已成功部署Voice来解决实际问题的其他公司的贡献-我在Kaiser Permanente的医生使用Nuance的Dragon Medical语音识别解决方案来命令听写,以加快记笔记的速度。 可访问性是Bo Campbell 在此描述的另一种方法。 了解语音助手可以做出贡献的这些关键客户问题,将使公司做出重大改进,我相信这对于语音助手的未来至关重要

Describes what works, what partially works, and what does not for Alexa.
Is Alexa working?” Alexa在起作用吗?”

There are a number of hard unaddressed challenges in Voice such as monetizing, mobile in-app search, discoverability, privacy, and responsiveness. I discuss a few of these now.

语音中存在许多尚未解决的艰巨挑战,例如获利,移动应用内搜索,可发现性,隐私和响应能力。 我现在讨论其中一些。

语音第二,助手第一 (Voice second, Assistant first)

Assistant to what? There is a broad spectrum of assistance possible: An assistant for shopping (Amazon), for Search (Google), for cleaning (iRobot), for mobile apps (Google/Apple), etc. The perception that Voice is an assistant in itself, distracts from the fact that Voice is just one channel humans use to communicate.

助手要什么? 可能有各种各样的帮助:购物助手(Amazon),搜索助手(Google),清洁助手(iRobot),移动应用程序(Google / Apple)等。人们认为Voice本身就是助手,分散了一个事实,即语音只是人类用来交流的一种渠道。

Humans can receive and transmit information using a multitude of channels. We communicate intention via voice, gesture, facial expressions, and a host of muscle movements, that controls devices such as pencils, keyboards, musical instruments, pointing devices, etc. We digest information using sight, hearing, touch, smell, and taste. People, especially kids, have trouble understanding speech when a person has their lips covered with face masks — a very valid point in the midst of a pandemic. This obsession with Voice-only interface is a fundamental denial of how humans interact with machines as well as other humans, leading to an incorrect conclusion that Voice Assistants will displace other traditional models of interactions such as keyboards, pointing devices, and touchscreens. Siri’s new pop-up feature no longer takes the whole screen, which in my opinion indicates that Siri is heading into a promising direction by integrating the use of apps; rather than being an app in itself.

人类可以使用多种渠道来接收和发送信息。 我们通过语音,手势,面部表情和一系列肌肉运动来传达意图,这些运动控制诸如铅笔,键盘,乐器,指示设备之类的设备。我们使用视觉,听觉,触觉,气味和品味来消化信息。 当人的嘴唇被口罩覆盖时,人们(尤其是孩子)很难理解语音,这在大流行中是非常有效的一点。 对纯语音界面的这种迷恋从根本上否定了人类如何与机器以及其他人类进行交互,从而得出错误的结论,即语音助手将取代其他传统的交互模型,例如键盘,定点设备和触摸屏。 Siri的新弹出功能不再占据整个屏幕,我认为这表明Siri通过集成应用程序的使用正朝着一个有希望的方向发展。 而不是本身就是一个应用。

Describes the voice technology stack consisting of mic parameters, noise cancellation, ML elements, and voice assisted-apps.

There are three challenge areas that I believe will define the next 5 years of Voice Assistants:


(1) Mobile Search: If there is one area Voice Assistants can help the user of your app — this is it. In-app search is easier with Voice Assistants. Deep-linking isn’t sufficient. Every app developer has to use tools like Cohort Analysis and data to determine sequences of unnecessary and wasteful user taps and scrolls, and replace them with equivalent voice actions saving the user time and steps. You can use local voice processing in either iOS or Android for both user privacy and responsiveness. In the app we launched in my startup, we used automatic speech recognition, intent extraction, named-entity-recognition and more using native on-device iOS/Android Natural Language Processing/Natural Language Understanding libraries for both deep-linking as well as in-app search/commands. We tried DialogFlow’s cloud-services but ended up using iOS/Android simply because the responsiveness and reliability of local-services are a pretty good bet for the future. The one area Alexa can improve upon is building iOS/Android NLP/NLU libraries that app developers can use to enable voice interactions without internet connectivity, something that Siri is getting better at.

(1) 移动搜索如果语音助手在某个区域可以为您的应用程序的用户提供帮助,便是 。 借助语音助手,应用内搜索更加轻松。 深度链接还不够。 每个应用程序开发人员都必须使用同类群组分析和数据之类的工具来确定不必要和浪费的用户点击和滚动的顺序,并用等效的语音操作替换它们,以节省用户的时间和步骤。 您可以在iOS或Android中使用本地语音处理,以确保用户隐私和响应能力。 在我们启动时启动的应用中,我们使用了自动语音识别,意图提取,命名实体识别等功能,并使用本地设备上的iOS / Android自然语言处理/自然语言理解库进行了深层链接以及-app搜索/命令。 我们尝试了DialogFlow的云服务,但最终使用iOS / Android仅仅是因为本地服务的响应能力和可靠性是未来的不错选择。 Alexa可以改进的一个方面是构建iOS / Android NLP / NLU库,应用程序开发人员可以使用这些库在没有Internet连接的情况下实现语音交互,而Siri对此却越来越擅长。

(2) Discoverability: How can a user become aware of commands or questions that are supported? Unfortunately, this is not an easy task: The strategy should involve taking steps similar to any Purchase funnel — attention, interest, desire and action — to convince a user to change a habit and adopt a voice command. My recommendation for an app developer is to support simple commands — limited to maximum 3 words — such as “Find Lebanese food” or “Filter open now,” instead of long compound sentences. Siri shortcuts is a good idea, but it puts the burden of finding the inefficiency on the user. If you were to look at what works for Voice Assistants can help with now, they are simple commands such as “Play Music,” “Set timer 30 minutes,” “Turn lights on/off.” Simplicity, responsiveness, and reliability are the three key criteria for making Voice commands in your app “top of mind.”

(2)可发现性 :用户如何知道所支持的命令或问题? 不幸的是,这并不是一件容易的事:该策略应包括采取与任何购买渠道类似的步骤- 注意,兴趣,愿望和行动 -诱使用户改变习惯并采用语音命令。 对于应用程序开发人员,我的建议是支持简单的命令(最多3个字),例如“查找黎巴嫩美食”或“立即打开过滤器”,而不是冗长的复合语句Siri快捷方式是一个好主意,但是却给用户带来了效率低下的负担。 如果您要查看语音助手现在可以解决的问题,它们就是简单的命令,例如“播放音乐”,“设置计时器30分钟”,“打开/关闭灯”。 简单性,响应性和可靠性是使应用程序中的语音命令成为“首要考虑因素” 的三个关键标准

(3) Machine Learning to improve Human-to-Human conversations: As the growth of Video calling chokes both bandwidth and compute, there will be emerging techniques on use of ML to ensure better QoS from lower quality data. This is best illustrated by Google’s work on Improving Audio Quality in Duo Video calling with WaveNetEQ. Audio super-resolution and bone-conducting speech pickup are newer techniques that have a lot of promise to improve human-to-human conversations.

(3) 机器学习可以改善人与人之间的对话 :随着视频通话的增长,带宽和计算都受到限制,因此将出现使用ML来确保较低质量数据提供更好QoS的新兴技术。 Google致力于通过WaveNetEQ改善Duo Video通话中的音频质量的工作可以很好地说明这一点 。 音频超分辨率和传导骨骼的语音拾取是较新的技术,有望改善人与人之间的对话。

语音的虚荣度指标:请求/用户/天 (The vanity metric of Voice: requests/user/day)

There is some commonality to the successful use of Voice Commands such as turning the lights on/off, setting timers, asking questions, or playing music: voice simply makes it more efficient to complete a task that is entirely possible without voice.

成功使用语音命令存在一些共性,例如打开/关闭灯,设置计时器,询问问题或播放音乐: 语音可以使完成没有语音的任务变得更加高效

However, the metrics used for Voice Interfaces are designed as if Voice were an end in itself: Two metrics often used in calculating the voice assistants’ services are Requests per user per day for commands and queries, and Time-in-use per user per day for duplex conversations. These are clearly vanity metrics considering that a person makes one request a day on average, compared to the average cell phone use of 4 hours per day.

但是,用于语音接口的度量标准被设计为好像语音本身就是终点:在计算语音助手服务时经常使用的两个度量标准是每位用户每天的命令和查询请求数 ,以及每位用户每天的使用时间。进行双向对话的一天 。 考虑到一个人平均每天发出一个请求,而不是每天平均使用4个小时,这些显然是虚荣的指标。

If the goal of a Voice interface were to improve the efficiency of completing any task, such as placing an order, or finding the Settings in an app, or searching a word in web browser, or a podcast, then the incremental efficiency/effectiveness of completing a task with or without the use of voice assistance is a much better metric.


Suggests replacing old metric #requests/user/day or Time-in-use/user/day with incremental improvement in task efficiency.

The old set of Metrics are designed with the goal that the Voice is an app in itself, and the destination as well. Building Voice as a destination in itself is like saying Touchscreen is a destination in itself. The proposed Metric is designed with the goal that Voice is only a springboard, and a different app is the destination. This new Metric appeals more to developers, because the goal of Voice Assistants now is to enhance the experience of their app, not commoditize apps by being relegated them a “feature” of a centralized Voice app.

设计旧的指标集的目的是语音本身就是应用程序,目的地也是如此。 将语音本身建立为目的地就像在说触摸屏本身就是目的地。 拟议的指标的设计目标是语音只是跳板,而目标则是其他应用。 这项新的指标对开发人员更具吸引力,因为语音助手现在的目标是增强其应用程序的体验,而不是通过将其降级为集中式语音应用程序的“功能”来使其商品化

As Peter Thiel once observed, 140 characters (Twitter) are changing today’s world more than flying cars.

正如彼得·泰尔(Peter Thiel)曾经观察到的那样,140个字符(Twitter)改变着当今世界,而不仅仅是飞行汽车。

In summary, the aim of this article is not to say that Human-like machine conversations are not a worthy goal, it is a good research problem to tackle, but that the task of incorporating ML/Voice to improve existing mundane daily interactions is a monumentally worthy task in itself. There is a balance of user interactions for every task that needs to be accomplished, and Voice has to coexist with other interfaces — not replace them — to be a good assistant.

总而言之,本文的目的并不是要说类似人类的机器对话不是一个有价值的目标,要解决的是一个很好的研究问题,而是将ML / Voice合并以改善现有平凡的日常交互的任务是一项艰巨的任务。本身值得完成的艰巨任务。 在需要完成的每个任务中,用户交互都是平衡的,Voice必须与其他界面共存(而不是替换它们)才能成为一个很好的助手。

Image for post
