Linux系统上部署Whisper-large-v3-turbo

最新推荐文章于 2025-03-13 03:29:36 发布

花晓木

最新推荐文章于 2025-03-13 03:29:36 发布

阅读量1.4k

点赞数 28

分类专栏： whisper 语音识别文章标签： linux whisper xcode

本文链接：https://blog.csdn.net/yhl18931306541/article/details/145857580

版权

语音识别同时被 2 个专栏收录

2 篇文章

订阅专栏

whisper

1 篇文章

订阅专栏

Linux系统上部署Whisper-large-v3-turbo

截至当前时间（2025年2月25日），Whisper 的最新语音模型为 large-v3-turbo。以下是关于 Whisper large-v3-turbo 的详细介绍：
一、发布背景
OpenAI 在 2024 年 10 月1日举办的 DevDay 活动日中，宣布推出了 Whisper large-v3-turbo 语音转录模型。
二、模型特点
高效推理：
large-v3-turbo 在质量几乎没有下降的情况下，速度比 large-v3 快 8 倍。
large-v3-turbo 共有 8.09 亿参数，但解码器层数减少到 4 层，而 large-v3 有 32 层解码器层。
低资源占用：
large-v3-turbo 所需的 VRAM 为 6GB，而 large-v3 需要 10GB。
模型大小适中：
large-v3-turbo 的模型大小为 1.6GB，介于 medium（7.69 亿参数）和 large（15.5 亿参数）之间。
三、性能表现
OpenAI 表示，large-v3-turbo 在保持高质量识别效果的同时，显著提高了推理速度，降低了资源占用，使得模型在更多场景下的应用成为可能。
四、使用建议
如果您正在使用 Whisper 模型进行语音识别任务，并希望获得更高的推理速度和更低的资源占用，建议尝试使用 large-v3-turbo 模型。
您可以从 Hugging Face 等平台下载 large-v3-turbo 模型，并按照 Whisper 的使用文档进行加载和使用。
五、总结
Whisper large-v3-turbo 是 Whisper 系列语音模型中的最新成员，它以高效的推理速度、低资源占用和适中的模型大小，为用户提供了更加优秀的语音识别体验。如果您对语音识别任务有需求，不妨尝试使用这一最新的模型版本。

Whisper large-v3-turbo 与 faster whisper large-v3 相比模型性能与准确性

一、
性能与准确性平衡：Whisper large-v3-turbo 在保持较高准确性的同时，实现了推理速度的大幅提升。虽然参数数量和解码器层数有所减少，但通过先进的优化技术（如蒸馏或量化），模型在性能和准确性之间取得了良好的平衡。
特定场景表现：Whisper large-v3-turbo 专为多语言转录任务微调，不适合翻译任务。在纯语音转录方面，它的表现更为优异。同时，它在处理高质量录音时效果更佳。
二、
faster whisper large-v3：适用于对识别准确率有较高要求，同时对推理速度要求不是非常苛刻的场景。例如，需要处理多种语言的语音内容，且对实时性要求不是特别高的应用。
Whisper large-v3-turbo：适用于对推理速度有严格要求，同时对资源消耗有较高限制的场景。例如，需要处理大量音频数据，且对实时性要求较高的应用，如会议记录、在线教育和视频字幕生成等。
三、
Whisper large-v3-turbo 与 faster whisper large-v3 相比，在推理速度、资源消耗方面具有显著优势，同时保持了较高的准确性。因此，在需要处理大量音频数据，且对实时性要求较高的应用场景中，Whisper large-v3-turbo 是更为理想的选择。而在对识别准确率有更高要求，但对实时性要求不是特别高的场景中，faster whisper large-v3 仍然是一个不错的选择。

官网: https://huggingface.co/

安装方式：

创建虚拟环境

为了避免依赖冲突，建议在虚拟环境中进行部署。创建并激活一个新的虚拟环境：
如果不知道 conda如何安装
找作者这篇文章 最后面有安装conda教程

https://blog.csdn.net/yhl18931306541/article/details/129141060?spm=1001.2014.3001.5501

打开上方网址，划到最后位置，按照作者的一步一步来即可

conda create --name whisper python=3.11.7
conda activate whisper

2.进入到虚拟环境执行下方命令

pip install --upgrade transformers datasets[audio] accelerate
pip install ffmpeg

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

# 选择计算设备
device = "cuda:4" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# 加载模型
# model_id = "openai/whisper-large-v3-turbo"
# 这里模型我使用上面命令指定好然后使用代理下载到本机，再上传至linux系统上的
model_id = "/data/.cache/huggingface/hub/models--openai--whisper-large-v3/snapshots/06f233fe06e710322aca913c1bc4249a0d71fce1"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
            model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
            )
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# 使用管道进行语音识别
pipe = pipeline(
            "automatic-speech-recognition",
             model=model,
             tokenizer=processor.tokenizer,
             feature_extractor=processor.feature_extractor,
             torch_dtype=torch_dtype,
             device=device,
    )

# 加载数据集
# dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
# 数据集也需要下载，执行命令会自动下载的
dataset = load_dataset("/data/.cache/huggingface/hub/datasets--distil-whisper--librispeech_long/snapshots/164d3b41852b1eebe89f1dc0e6e0042f16835ea0", "clean", split="validation")
sample = dataset[0]["audio"]

# 识别单个音频文件并返回时间戳
result = pipe("/opt/src/REC1119.wav", return_timestamps=True)
# print("Pipeline result text:")
# print(result["text"])
# 先将下方 print打开，查看时间戳信息结构
# print(result)

# 然后将时间戳文本结构写到下方
# 打印包含时间戳的文本
for segment in result["chunks"]:
    start_time, end_time = segment["timestamp"]
    text = segment["text"]
    
    # 打印时间戳和对应文本
    print(f"[{start_time:.2f}s -> {end_time:.2f}s] {text}")

执行报错：

raise ValueError("ffmpeg was not found but is required to load audio files from filename") from error
# 解决
conda install ffmpeg

执行结果

{‘text’: ‘你好你好你好你好’, ‘chunks’: [{‘timestamp’: (0.0, 1.4), ‘text’: ‘你好’}, {‘timestamp’: (1.4, 2.94), ‘text’: ‘你好’}, {‘timestamp’: (2.94, 4.0), ‘text’: ‘你好’}, {‘timestamp’: (4.0, 5.44), ‘text’: ‘你好’}, ]}
[0.00s -> 1.40s] 你好
[1.40s -> 2.94s] 你好
[2.94s -> 4.00s] 你好
[4.00s -> 5.44s] 你好
[5.44s -> 5.96s] 你好
[5.96s -> 6.50s] 你好