安全警报 该站点安全证书_深度学习如何通过实时犯罪警报确保您的安全

安全警报 该站点安全证书

Citizen scans thousands of public first responder radio frequencies 24 hours a day in major cities across the US. The collected information is used to provide real-time safety alerts about incidents like fires, robberies, and missing persons to more than 5M users. Having humans listen to 1000+ hours of audio daily made it very challenging for the company to launch new cities. To continue scaling, we built ML models that could discover critical safety incidents from audio.

公民每天24小时在美国主要城市扫描成千上万个公共第一响应无线电频率。 所收集的信息用于向超过500万用户提供有关火灾,抢劫和失踪人员等事件的实时安全警报。 每天让人们听1000多个小时的音频,对于公司启动新城市来说非常困难。 为了继续扩展,我们建立了ML模型,可以从音频中发现严重的安全事件。

Our custom software-defined radios (SDRs) capture large swathes of radio frequency (RF) and create optimized audio clips that are sent to an ML model to flag relevant clips. The flagged clips are sent to operations analysts to create incidents in the app, and finally, users near the incidents are notified.

我们的定制软件定义的无线电(SDR)捕获了大量的射频(RF)并创建优化的音频剪辑,然后将其发送到ML模型以标记相关的剪辑。 标记的剪辑将发送给运营分析人员,以在应用程序中创建事件,最后,事件附近的用户将得到通知。

Image for post
Figure 1. Safety alerts workflow (Image by Author)
图1.安全警报工作流(作者提供的图像)

使公共语音转文本引擎适应我们的问题领域 (Adapting a Public Speech-to-Text Engine to Our Problem Domain)

Image for post
Figure 2. Clip classifier using public speech-to-text engine (Image by Author)
图2.使用公共语音转文本引擎的剪辑分类器(作者提供的图像)

We started with a top performing speech-to-text engine based on the word error rate (WER). There are a lot of special codes used by police that are not part of the normal vernacular. For example, an NYPD officer requests backup units by transmitting a “Signal 13”. We customized the vocabulary to our domain using speech contexts.

我们从基于单词错误率(WER)的性能最高的语音到文本引擎开始。 警察使用许多特殊代码 ,这些特殊代码不属于普通语言。 例如,NYPD官员通过发送“信号13”来请求备份单元。 我们使用语音上下文针对我们的领域定制了词汇表。

We also boosted some words to fit our domain, for example, “assault” isn’t used colloquially, but is very common in our use case. We had to bias our models towards detecting “assault” over “a salt”.

我们还增加了一些词来适合我们的领域,例如,“突击”不是口语化的,但在我们的用例中很常见。 我们不得不将模型偏向于检测“攻击”而不是“盐”。

After tuning the parameters, we were able to get reasonable accuracy for transcriptions in some cities. The next step was to use the transcribed data of the audio clips and figure out which ones were relevant to Citizen.

调整参数后,我们能够在某些城市中获得合理的转录准确性。 下一步是使用音频片段的转录数据,找出哪些与公民有关。

基于转录和音频特征的二进制分类器 (Binary Classifier Based on Transcriptions and Audio Features)

We modeled a binary classification problem with the transcriptions as input and a confidence level as output. XGBoost gave us the best performance on our dataset.

我们用转录作为输入,置信度作为输出对二进制分类问题建模。 XGBoost在我们的数据集上为我们提供了最佳性能。

We had insight from someone who previously worked in law enforcement that radio transmissions about major incidents in some cities are preceded by special alert tones to get the attention of police on the ground. This extra feature helped make our model more reliable, especially in cases of bad transcriptions. Some other useful features we found were the police channel and transmission IDs.

我们从以前在执法部门工作过的人那里了解到,在某些城市发生重大事件的无线电广播之前,会先发出特殊的提示音 ,以引起现场警察的注意。 这个额外的功能使我们的模型更加可靠,尤其是在转录错误的情况下。 我们发现其他一些有用的功能是警察通道和传输ID。

We A/B tested the ML model in operations workflow. After a few days of running the test, we noticed no degradation in the incidents created by analysts who were using the model-flagged clips only.

我们A / B在操作流程中测试了ML模型。 经过几天的测试,我们发现,仅使用模型标记的剪辑的分析师所产生的事件没有降低。

We launched the model in a few cities. Now a single analyst could handle multiple cities at once, which wasn’t previously possible! With the new spare capacity on operations, we were able to launch multiple new cities.

我们在一些城市推出了该模型。 现在,一个分析师可以一次处理多个城市,这在以前是不可能的! 有了新的运营备用容量,我们得以启动多个新城市。

Image for post
Figure 3. Model rollout leading to a significant reduction in audio for analysts (Image by Author)
图3.模型的推出大大减少了分析人员的音频(作者提供的图像)

超越公共语音转文字引擎 (Beyond a Public Speech-to-Text Engine)

The model didn’t turn out to be a panacea for all our problems. We could only use it in a few cities which had good quality audio. Public speech-to-text engines are trained on phone models with different acoustic profile than radios; as a result, the transcription quality was sometimes unreliable. Transcriptions were completely unusable for the older analog systems, which were very noisy.

该模型并没有成为解决我们所有问题的灵丹妙药。 我们只能在有高质量音频的几个城市中使用它。 公开语音到文本引擎在电话模型上接受了与收音机不同的声学配置的训练; 结果,转录质量有时是不可靠的。 转录对于嘈杂的老式模拟系统是完全不可用的。

We tried multiple models from multiple providers, but none of them were trained on an acoustic profile similar to our dataset and couldn’t handle noisy audio.

我们尝试了来自多个提供商的多个模型,但是没有一个模型是在类似于我们的数据集的声学轮廓上进行训练的,并且无法处理嘈杂的音频。

We explored replacing the speech-to-text engine with the one trained on our data while keeping the rest of the pipeline the same. However, we needed several hundred hours of transcription data for our audio which was very slow and expensive to generate. We had an option to optimize the process by only transcribing the “important” words defined in our vocabulary and adding blanks for the irrelevant words — but that was still just an incremental reduction in effort.

我们研究了用训练有素的数据替换语音到文本引擎,同时保持其余管道不变。 但是,我们需要数百小时的转录数据来获取音频,这非常缓慢且生成成本很高。 我们可以选择仅通过转录词汇中定义的“重要”单词并为不相关的单词添加空格来优化流程的方法,但这仍然只是逐步减少的工作量。

Eventually, we decided to build a custom speech processing pipeline for our problem domain.

最终,我们决定为我们的问题域建立定制的语音处理管道。

卷积神经网络的关键词识别 (Convolutional Neural Network for Keyword Spotting)

Since we only care about the presence of keywords, we didn’t need to find the right order of words and could reduce our problem to keyword spotting. That was a much easier problem to solve! We decided to do so using a convolutional neural network (CNN) trained on our dataset.

由于我们只关心关键字的存在,因此我们不需要找到正确的单词顺序,并且可以将我们的问题归结关键字发现 。 那是一个容易解决的问题! 我们决定使用在我们的数据集上训练的卷积神经网络(CNN)来做到这一点。

Using CNNs over Recurrent neural networks (RNNs) or Long short-term memory (LSTM) models meant that we could train much faster and had quicker iterations. We also evaluated using the Transformer model which is massively parallel but requires a lot of hardware to run. Since we were only looking for short term dependencies between audio segments to detect words, a computationally simple CNN seemed a better choice over Transformers and it freed up hardware for us to be more vigorous with hyperparameter tuning.

在递归神经网络(RNN)或长短期记忆(LSTM)模型上使用CNN意味着我们可以更快地训练并且迭代更快。 我们还使用了大规模并行但需要大量硬件才能运行的Transformer模型进行了评估。 由于我们只是在寻找音频片段之间的短期依赖关系来检测单词,因此与Transformers相比,计算简单的CNN似乎是更好的选择,并且它释放了硬件,使我们可以更灵活地进行超参数调整。

Image for post
Figure 4. Clip flagging model with a CNN for keyword spotting (Image by Author)
图4.带有CNN的剪辑标记模型,用于关键字发现(作者提供的图像)

We split the audio clips into fixed duration subclips. We gave a positive label to a subclip if a vocabulary word was present. We then marked an audio clip as useful if any such subclip was found in it. During the training process, we tried how varying the duration of subclips affected our convergence performance. Long clips made it much harder for the model to figure out which portion of the clip was useful and also harder to debug. Short clips meant that words partially appeared across multiple clips, which made it harder for the model to identify them. We were able to tune this hyperparameter and find a reasonable duration.

我们将音频片段分成固定持续时间的子片段。 如果存在词汇,则给子剪辑加上正标签。 然后,如果在其中找到任何此类子剪辑,则将音频剪辑标记为有用。 在训练过程中,我们尝试了改变子剪辑的持续时间如何影响我们的收敛性能。 较长的剪辑使模型更难确定剪辑的哪个部分有用并且也较难调试。 短剪辑意味着单词在多个剪辑中部分出现,这使得模型更难识别它们。 我们能够调整此超参数并找到合理的持续时间。

For each subclip, we convert the audio into MFCC coefficients and also add the first and second-order derivatives. The features are generated with a frame size of 25ms and stride of 10ms. The features are then fed into a neural network based on Keras Sequential model using a Tensorflow backend. The first layer is a Gaussian noise which makes the model more robust to noise differences between different radio channels. We tried an alternative approach of artificially overlaying real noise to clips, but that slowed down training time significantly with no meaningful performance gains.

对于每个子剪辑,我们将音频转换为MFCC系数,还添加一阶和二阶导数。 生成的特征具有25ms的帧大小和10ms的步幅。 然后使用Tensorflow后端将特征输入基于Keras 序列模型的神经网络中。 第一层是高斯噪声,它使模型对不同无线电信道之间的噪声差异更鲁棒。 我们尝试了一种将人工噪声人为地叠加到片段上的替代方法,但是这种方法大大降低了训练时间,并且没有明显的性能提升。

We then added subsequent layers of Conv1D, BatchNormalization, and MaxPooling1D. Batch normalization helped with the model convergence, and max pooling helped in making the model more robust to minor variations in speech and also to channel noise. Also, we tried adding dropout layers, but those didn’t improve the model meaningfully. Finally, we added a densely-connected neural network layer which fed into a single output-dense layer with sigmoid activation.

然后,我们添加了Conv1D,BatchNormalization和MaxPooling1D的后续层。 批量归一化有助于模型收敛,最大池化有助于使模型对语音中的微小变化和信道噪声更加健壮。 另外,我们尝试添加了辍学层,但是这些并没有有意义地改善模型。 最后,我们添加了一个紧密连接的神经网络层,该层通过S型激活被馈送到单个输出密集层。

生成标签数据 (Generating Labeled Data)

Image for post
Figure 5. Labeling process for audio clips (Image by Author)
图5.音频剪辑的标记过程(作者提供的图像)

To label the training data, we gave annotators the list of keywords for our domain and asked them to mark the start and end positions within a clip along with the word label if any of the vocabulary words were present.

为了标记训练数据,我们为注释者提供了我们域的关键字列表,并要求他们在片段中标记单词的开始和结束位置以及单词标签(如果存在任何词汇)。

To ensure the annotations were reliable, we had a 10% overlap across annotators and calculated how they performed on the overlapped clips. Once we had ~50 hours of labeled data, we started the training process. We kept collecting more data while iterating on the training process.

为确保注释可靠,我们在注释器之间有10%的重叠,并计算了它们在重叠剪辑上的表现。 拥有约50个小时的标记数据后,我们便开始了培训过程。 我们不断地在训练过程中不断收集更多数据。

Since some words in our vocabulary were much more common than others, our model had a reasonable performance on common words but struggled with rarer words with fewer examples. We tried creating artificial examples of those words by overlaying the word utterance in other clips. However, the performance gains were not commensurate with actually getting labeled data for those words. Eventually, as our model improved with common words, we ran it on unlabeled audio clips and excluded the ones where the model found those words. That helped us reduce the redundant words from our future labeling.

由于词汇中的某些单词比其他单词普遍得多,因此我们的模型对常见单词具有合理的表现,但与较少的示例(较少的示例)一起苦苦挣扎。 我们尝试通过将单词话语覆盖在其他片段中来创建这些单词的人工示例。 但是,性能的提高与实际获得这些单词的标签数据并不相称。 最终,随着我们的模型改进了常用词,我们将其在未标记的音频剪辑上运行,并排除了模型找到这些词的剪辑。 这有助于我们减少将来标签中的多余单词。

模型发布 (Model Launch)

After several iterations of data collection and hyperparameter tuning, we were able to train a model with high recall on our vocabulary words and reasonable precision. High recall was very important to capture critical safety alerts. The flagged clips are always listened to before an alert is sent, so false positives were not a huge concern.

经过数次数据收集和超参数调整的迭代,我们能够训练出词汇量和合理精确度高的召回模型。 高召回率对于捕获关键的安全警报非常重要。 发送警报之前,始终会监听标记的剪辑,因此误报并不是一个大问题。

We A/B tested the model in some boroughs of New York City. The model was able to cut down audio volume by 50–75% (depending on the channel). It also clearly outcompeted our model trained on public speech-to-text engine since NYC has very noisy audio due to analog systems.

我们A / B在纽约市的一些行政区测试了该模型。 该模型能够将音频音量降低50%至75%(取决于通道)。 由于纽约市由于模拟系统而产生的声音非常嘈杂,因此它显然也胜过我们在公共语音转文本引擎上训练过的模型。

Somewhat surprisingly, we then found that the model transferred well to audio from Chicago even though the model was trained on NYC data. After collecting a few hours of Chicago clips, we were able to transfer-learn from the NYC model to get reasonable performance in Chicago.

出乎意料的是,我们随后发现该模型可以很好地从芝加哥传输到音频,即使该模型是根据NYC数据进行训练的。 收集了几个小时的芝加哥片段后,我们就可以从纽约市模型中转移学习信息,从而在芝加哥获得合理的表现。

结论 (Conclusion)

Our speech processing pipeline with the custom deep neural network was broadly applicable to police audio from major US cities. It discovered critical safety incidents from the audio, allowing Citizen to expand rapidly into cities across the country and serve the mission of keeping communities safe.

我们带有定制深度神经网络的语音处理管道广泛适用于美国主要城市的警察音频。 它从音频中发现了严重的安全事件,使公民能够Swift扩展到全国的城市,并以维护社区安全为使命。

Picking computationally simple CNN architecture over RNN, LSTM, or Transformer and simplifying our labeling process were major breakthroughs that allowed us to outperform public speech-to-text models in a very short time and with limited resources.

在RNN,LSTM或Transformer上选择计算简单的CNN架构并简化我们的标记过程是主要的突破,这使我们能够在非常短的时间内和有限的资源上胜过公共语音转文本模型。

翻译自: https://towardsdatascience.com/how-deep-learning-can-keep-you-safe-with-real-time-crime-alerts-95778aca5e8a

安全警报 该站点安全证书

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值