使用语音识别+声纹识别实现智能会议记录

alpha-soso

已于 2024-02-06 18:25:51 修改

阅读量4.6k

点赞数 39

分类专栏：语音识别文章标签：语音识别人工智能

于 2024-02-02 18:04:13 首次发布

本文链接：https://blog.csdn.net/mr_lio/article/details/135998733

版权

语音识别专栏收录该内容

2 篇文章

订阅专栏

本文介绍了如何将pyannote-audio的声纹识别与whisper语音识别模型结合，用于会议记录场景的深入分析。文章详细说明了环境配置和Annotation类的使用，以及如何整合两者的识别结果以优化准确度，特别提到了提高准确性的策略和挑战。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前面一篇文章已经成功使用pyannote-audio实现了声纹识别，现在准备将其与语音识别相结合以实现对语音文件的深入分析，这在会议记录等场景中尤为有用。

环境准备

对于声纹识别，我们依然选择pyannote-audio。至于语音识别，市面上有许多优秀的模型可供选择，例如whisper、FunASR等。在本文中，我们将使用whisper作为语音识别的模型。
需要注意的是，pyannote-audio和whisper这两个模型所需的环境略有不同。为了方便管理，我们可以适当调整现有环境。以下是推荐的环境配置：
pyannote-audio版本：2.1.1（降低版本以适配其他模型）
Python版本：3.9
PyTorch版本：1.11.0

Annotation

Annotation是pyannote.core中的一个类，是存储speaker dirazation结果的一个容器，其中包含start、end以及label等信息，分别代表segment的开始时间、结束时间以及说话人标识。
在这里插入图片描述

该类中内置的函数有以下几个：

from pyannote.core import Segment, Annotation

# 获取说话人标识
labels = annotations.labels()
# 输出 ['SPEAKER_00', 'SPEAKER_01', 'SPEAKER_02', 'SPEAKER_03', 'SPEAKER_04', 'SPEAKER_05']

# 获取说话人时长
charts = annotations.chart()
# 输出 [('SPEAKER_02', 95.14125000000011), ('SPEAKER_00', 42.37312499999992), ('SPEAKER_01', 37.49625000000003), ('SPEAKER_04', 18.34312500000003), ('SPEAKER_05', 12.892499999999949), ('SPEAKER_03', 10.209375000000003)]

# 获取所有segtment
segments = list(annotations.itertracks())
# 输出 [(<Segment(1.31347, 1.51597)>, 'A'), (<Segment(2.03909, 3.94597)>, 'B'), (<Segment(4.84034, 6.76409)>, 'C')...]

# 获取指定说话人timeline
duration = annotations.label_timeline('SPEAKER_00')
# 输出[[ 00:00:20.787 -->  00:00:22.474][ 00:00:22.947 -->  00:00:24.803][ 00:00:25.090 -->  00:00:27.672]...]

# 获取指定时间范围的annotation
ann_crop = annotations.crop(Segment(10, 30))
# [ 00:00:10.000 -->  00:00:10.324] E SPEAKER_03
# [ 00:00:10.324 -->  00:00:11.455] F SPEAKER_02
# [ 00:00:12.130 -->  00:00:14.577] G SPEAKER_03
# [ 00:00:20.787 -->  00:00:22.474] H SPEAKER_00
# [ 00:00:22.947 -->  00:00:24.803] I SPEAKER_00
# [ 00:00:25.090 -->  00:00:27.672] J SPEAKER_00

# 获取说话时长最大的label
label_max = annotations.argmax()
# 输出 'SPEAKER_02'

# 替换原始的label，其中label_mapping格式为{'old_label':'new_label'}
result = annotations.rename_labels(label_mapping,generator='iterable')
# [ 00:00:22.947 -->  00:00:24.803] I aside
# [ 00:00:25.090 -->  00:00:27.672] J aside
# [ 00:00:32.684 -->  00:00:34.709] K peppa mom
# [ 00:00:34.962 -->  00:00:35.974] L aside
# [ 00:00:35.974 -->  00:00:40.176] M peppa dad
# [ 00:00:40.345 -->  00:00:43.315] N peppa dad

Intergration

博主使用的是whisper作为语音识别模型，声纹识别与语音识别的结果格式均为开始时间、结束时间以及对应结果（说话人或说话内容）
而我们希望的输出格式如下

# start 开始时间/end 结束时间/speaker 说话人/text 说话内容
[start end] speaker text

考虑现有技术中中文的语音识别更为成熟，我们选择以语音识别结果为基准，添加对应的说话人信息，此处默认每一个sentence有且仅有一个对应的speaker，步骤如下：

分别使用whisper和pyannote识别语音结果speech、声纹识别结果diarization
使用embedding对比声纹库匹配diarization得到label对应的mapping
替换diarization中的label
遍历speech，使用annotation中的crop以及argmax函数提取该句的speaker
将speech结果与speaker结合
其中whisper过程、embedding过程在我的其他文章中以及给出就不再赘述，实现代码代码如下：

from pyannote.core import Segment, Annotation

# 输入语音结果、声纹识别结果、声纹库匹配结果
intergration(speech,diarization,label_mapping):
  # 替换原有的label
  diarization.rename_labels(label_mapping,generator='iterable')
  results = []
  # 遍历语音识别数组
  for item in speech['segments']:
    start = item['start']
    end = item['end']
    text = item['text']
    # 获取segments的说话人label
    speaker = diarization.crop(Segment(start,end)).argmax()
    line = f"{start:.1f}s - {end:.1f}s {speaker} {text}"
    result.append(line)
  return result