pytorch使用speechbrain和huggingface中预训练模型实现语音(中文)转文字的推理例子

import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import warnings
warnings.filterwarnings("ignore")
# !pip install speechbrain

audio_file = f"B31_385.wav"
#load audio file
audio, sampling_rate = librosa.load(audio_file, sr=16_000)

# # audio
# display.Audio(audio_file, autoplay=True)

#load pre-trained model and tokenizer
tokenizer = Wav2Vec2Tokenizer.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")
model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn")

input_values = tokenizer(audio, return_tensors='pt').input_values
input_values

# store logits (non-normalized predictions)
logits = model(input_values).logits
logits

# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim=-1)

# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

transcriptions
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.





'地<unk>是内部圈层的最外层由封化的土层和坚映的岩石组成所以地<unk>也可称为岩石圈'

from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(source="speechbrain/asr-transformer-aishell",
                                           savedir="pretrained_models/asr-transformer-aishell")
asr_model.transcribe_file(audio_file)

The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.





'地价 是 内部 圈层 的 最 外 层 由 分化 的 吐槽 和 签应 的 延迟 组成 所以 地区 而 也 可 称 为 严实 圈'
from speechbrain.pretrained.interfaces import foreign_class

#使用显卡推理
asr_model = foreign_class(source="speechbrain/asr-wav2vec2-ctc-aishell", pymodule_file="custom_interface.py",
                          classname="CustomEncoderDecoderASR", run_opts={"device": "cuda"})
asr_model.transcribe_file(audio_file)
Some weights of the model checkpoint at TencentGameMate/chinese-wav2vec2-large were not used when initializing Wav2Vec2Model: ['project_q.bias', 'project_hid.bias', 'quantizer.codevectors', 'quantizer.weight_proj.weight', 'project_q.weight', 'project_hid.weight', 'quantizer.weight_proj.bias']
- This IS expected if you are initializing Wav2Vec2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).





['地',
 '俏',
 '是',
 '内',
 '部',
 '圈',
 '层',
 '的',
 '最',
 '外',
 '层',
 '由',
 '封',
 '化',
 '的',
 '吐',
 '层',
 '和',
 '接',
 '应',
 '的',
 '沿',
 '石',
 '组',
 '成',
 '所',
 '以',
 '地',
 '俏',
 '也',
 '可',
 '称',
 '为',
 '颜',
 '石',
 '圈']
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# test_dataset = load_dataset("common_voice", "zh-CN", split="test")

tokenizer = Wav2Vec2Processor.from_pretrained("ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt")
model = Wav2Vec2ForCTC.from_pretrained("ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt")

input_values = tokenizer(audio, return_tensors='pt').input_values
input_values

# store logits (non-normalized predictions)
logits = model(input_values).logits
logits

# store predicted id's
# pass the logit values to softmax to get the predicted values
predicted_ids = torch.argmax(logits, dim=-1)

# pass the prediction to the tokenzer decode to get the transcription
transcriptions = tokenizer.decode(predicted_ids[0])

transcriptions

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.





'地壳是内部圈层的最外层由丰化的土层和坚硬的岩始组成所以地壳也可称为岩石圈'
  • 1
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值