使用pyannote.audio进行语音分离和说话人识别

本文介绍了如何使用PyannoteAudio库对含有多个说话人的音频进行声纹特征提取,通过比较与预设语音特征的余弦距离来识别说话人,并在示例中展示了如何实现声纹识别和说话人时间段划分。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

https://github.com/pyannote/pyannote-audio

pip install pyannote.audio

场景:

  • 一段音频中有多个说话人,将不同的人说的话分离出来
  • 已知一些人的语音特征,跟分离出来的片段,分别求特征的余弦距离,余弦距离最小的作为说话的人
# _*_ coding: utf-8 _*_
# @Time : 2024/3/16 10:47
# @Author : Michael
# @File : spearker_rec.py
# @desc :
import torch
from pyannote.audio import Model, Pipeline, Inference
from pyannote.core import Segment
from scipy.spatial.distance import cosine


def extract_speaker_embedding(pipeline, audio_file, speaker_label):
    diarization = pipeline(audio_file)
    speaker_embedding = None
    for turn, _, label in diarization.itertracks(yield_label=True):
        if label == speaker_label:
            segment = Segment(turn.start, turn.end)
            speaker_embedding = inference.crop(audio_file, segment)
            break
    return speaker_embedding

# 对于给定的音频,提取声纹特征并与人库中的声纹进行比较
def recognize_speaker(pipeline, audio_file):
    diarization = pipeline(audio_file)
    speaker_turns = []
    for turn, _, speaker_label in diarization.itertracks(yield_label=True):
        # 提取切片的声纹特征
        embedding = inference.crop(audio_file, turn)  
        distances = {}
        for speaker, embeddings in speaker_embeddings.items():  
	        # 计算与已知说话人的声纹特征的余弦距离
            distances[speaker] = min([cosine(embedding, e) for e in embeddings])
        # 选择距离最小的说话人
        recognized_speaker = min(distances, key=distances.get)  
        speaker_turns.append((turn, recognized_speaker))  
        # 记录说话人的时间段和余弦距离最小的预测说话人
    return speaker_turns

if __name__ == "__main__":
    token = "hf_***"  # 请替换为您的Hugging Face Token

    # 加载声音分离识别模型
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=token,  # 在项目页面agree使用协议,并获取 Hugging Face Token
        # cache_dir="/home/huggingface/hub/models--pyannote--speaker-diarization-3.1/"
    )

    # 加载声纹嵌入模型
    embed_model = Model.from_pretrained("pyannote/embedding", use_auth_token=token)
    inference = Inference(embed_model, window="whole")

    # pipeline.to(torch.device("cuda"))

    # 假设您已经有一个包含不同人声的音频文件集,以及对应的人
    audio_files = {
        "mick": "mick.wav",  # mick的音频
        "moon": "moon.wav",  # moon的音频
    }
    speaker_embeddings = {}
    for speaker, audio_file in audio_files.items():
        diarization = pipeline(audio_file)
        for turn, _, speaker_label in diarization.itertracks(yield_label=True):
            embedding = extract_speaker_embedding(pipeline, audio_file, speaker_label)
            # 获取原始已知说话人的声纹特征
            speaker_embeddings.setdefault(speaker, []).append(embedding)

    # 给定新的未知人物的音频文件
    given_audio_file = "2_voice.wav"  # 前半部分是 mick 说话,后半部分是 moon 说话

    # 识别给定音频中的说话人
    recognized_speakers = recognize_speaker(pipeline, given_audio_file)
    print("Recognized speakers in the given audio:")
    for turn, speaker in recognized_speakers:
        print(f"Speaker {speaker} spoke between {turn.start:.2f}s and {turn.end:.2f}s")

输出:

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.8.1+cu102, yours is 2.2.1+cpu. Bad things might happen unless you revert torch to 1.x.

Recognized speakers in the given audio:
Speaker mick spoke between 0.57s and 1.67s
Speaker moon spoke between 2.47s and 2.81s
Speaker moon spoke between 3.08s and 4.47s

输出提示环境不太一样,需要注意一下

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Michael阿明

如果可以,请点赞留言支持我哦!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值