语音识别(whisper部署)_whisper base-CSDN博客

本文链接：https://blog.csdn.net/weixin_45515807/article/details/144540210

whisper部署

地址:https://github.com/openai/whisper?tab=readme-ov-file

我们使用 Python 3.9.9 和PyTorch 1.10.1 来训练和测试我们的模型，但代码库预计与 Python 3.8-3.11 和最新的 PyTorch 版本兼容。代码库还依赖于一些 Python 包，最著名的是OpenAI 的 tiktoken，用于快速标记器实现。您可以使用以下命令下载并安装（或更新到）最新版本的 Whisper：

pip install -U openai-whisper

或者，以下命令将从此存储库中提取并安装最新的提交及其 Python 依赖项：

pip install git+https://github.com/openai/whisper.git

要将包更新至此存储库的最新版本，请运行：

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

它还需要ffmpeg在你的系统上安装命令行工具，大多数包管理器都可以提供该工具：

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

可用型号和语言

有六种模型大小，其中四种只有英语版本，提供速度和准确性的权衡。以下是可用模型的名称及其相对于大型模型的近似内存要求和推理速度。以下相对速度是通过在 A100 上转录英语语音来测量的，实际速度可能会因多种因素而有很大差异，包括语言、说话速度和可用的硬件。

尺寸	参数	纯英语模式	多语言模型	所需 VRAM	相对速度
微小的	三十九米	`tiny.en`	`tiny`	约 1 GB	~10X
根据	74 米	`base.en`	`base`	约 1 GB	~7倍
小的	244 米	`small.en`	`small`	约 2 GB	~4倍
中等的	769 米	`medium.en`	`medium`	约 5 GB	~2倍
大的	1550 米	不适用	`large`	~10 GB	1x
涡轮	809 米	不适用	`turbo`	约 6 GB	~8倍

中文的话建议turbo模型或者large模型

以下命令将使用turbo模型转录音频文件中的语音：

whisper audio.flac audio.mp3 audio.wav --model turbo

import whisper
import torch
import logging

# 设置日志配置，记录时间和信息
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')
logging.info('============================')

# 检查并设置设备
device = "cuda" if torch.cuda.is_available() else "cpu"
logging.info(f"Using device: {device}")

# 加载模型到 GPU
model = whisper.load_model("turbo", device=device)

# 加载音频并将其转换为张量，并转移到 GPU
audio = whisper.load_audio("output_combined.wav")
audio_tensor = torch.tensor(audio).to(device)  # 将 numpy 数组转换为 PyTorch 张量并移动到 GPU

# 转录音频文件
result = model.transcribe(audio_tensor)

# 打印转录结果并记录日志
logging.info("Transcription result: %s", result["text"])