用 kaldi 和 CVTE开源模型 实现语音识别
文章目录
下载模型
CVTE开源了kaldi的中文模型,
模型下载地址: http://kaldi-asr.org/models/0002_cvte_chain_model.tar.gz
解压放到kaldi/egs/下
使用
将egs/wsj/s5
中的steps和utils拷贝到egs/cvte/s5
目录下:
将egs/hkust/s5/local/score.sh
拷贝到egs/cvte/s5/local
目录下:
cp -r egs/wsj/s5/steps egs/cvte/s5/steps
cp -r egs/wsj/s5/utils egs/cvte/s5/utils
cp egs/hkust/s5/local/score.sh egs/cvte/s5/local
注释掉utils/lang/check_phones_compatible.sh中if语句中的exit 1:
36 # check if the files exist or not
37 if [ ! -f $table_first ]; then
38 if [ ! -f $table_second ]; then
39 echo "$0: Error! Both of the two phones-symbol tables are absent."
40 echo "Please check your command"
41 #exit 1;
42 else
43 # The phones-symbol-table1 is absent. The model directory maybe created by old script.
44 # For back compatibility, this script exits silently with status 0.
45 exit 0;
46 fi
然后执行./run.sh
就可以了
测试自己的数据集
准备文件
0. 音频文件
要求是16-bit位深,采样率16000Hz,单声道,wav格式的语言文件
1. wav.scp
wav.scp 格式
音频id 音频位置
如下:
AUDIO_20211129_170900_0000 ./audio/2021_11_29_17.09.00_0000.wav
AUDIO_20211129_170901_0000 ./audio/2021_11_29_17.09.01_0000.wav
AUDIO_20211129_170902_0000 ./audio/2021_11_29_17.09.02_0000.wav
AUDIO_20211129_170903_0000 ./audio/2021_11_29_17.09.03_0000.wav
AUDIO_20211129_170904_0000 ./audio/2021_11_29_17.09.04_0000.wav
AUDIO_20211129_170905_0000 ./audio/2021_11_29_17.09.05_0000.wav
AUDIO_20211129_170906_0000 ./audio/2021_11_29_17.09.06_0000.wav
AUDIO_20211129_170907_0000 ./audio/2021_11_29_17.09.07_0000.wav
AUDIO_20211129_170908_0000 ./audio/2021_11_29_17.09.08_0000.wav
AUDIO_20211129_170909_0000 ./audio/2021_11_29_17.09.09_0000.wav
AUDIO_20211129_170910_0000 ./audio/2021_11_29_17.09.10_0000.wav
AUDIO_20211129_170911_0000 ./audio/2021_11_29_17.09.11_0000.wav
AUDIO_20211129_170912_0000 ./audio/2021_11_29_17.09.12_0000.wav
AUDIO_20211129_170913_0000 ./audio/2021_11_29_17.09.13_0000.wav
AUDIO_20211129_170914_0000 ./audio/2021_11_29_17.09.14_0000.wav
2. utt2spk
音频ID
说话人ID
音频ID最好含有说话人ID
由于本例没有说话人,所以用音频ID代替说话人,即每条音频都是一个独立的说话人
utt2spk 格式
音频ID1 说话人1
音频ID2 说话人2
如下:
AUDIO_20211129_170900_0000 AUDIO_20211129_170900_0000
AUDIO_20211129_170901_0000 AUDIO_20211129_170901_0000
AUDIO_20211129_170902_0000 AUDIO_20211129_170902_0000
AUDIO_20211129_170903_0000 AUDIO_20211129_170903_0000
AUDIO_20211129_170904_0000 AUDIO_20211129_170904_0000
AUDIO_20211129_170905_0000 AUDIO_20211129_170905_0000
AUDIO_20211129_170906_0000 AUDIO_20211129_170906_0000
AUDIO_20211129_170907_0000 AUDIO_20211129_170907_0000
AUDIO_20211129_170908_0000 AUDIO_20211129_170908_0000
AUDIO_20211129_170909_0000 AUDIO_20211129_170909_0000
AUDIO_20211129_170910_0000 AUDIO_20211129_170910_0000
AUDIO_20211129_170911_0000 AUDIO_20211129_170911_0000
AUDIO_20211129_170912_0000 AUDIO_20211129_170912_0000
AUDIO_20211129_170913_0000 AUDIO_20211129_170913_0000
AUDIO_20211129_170914_0000 AUDIO_20211129_170914_0000
3. spk2utt
spk2utt 格式
说话人1 音频 音频 音频
说话人2 音频 音频 音频
有几个说话人就是几行,中间用空格隔开
如下:
AUDIO_20211129_170900_0000 AUDIO_20211129_170900_0000
AUDIO_20211129_170901_0000 AUDIO_20211129_170901_0000
AUDIO_20211129_170902_0000 AUDIO_20211129_170902_0000
AUDIO_20211129_170903_0000 AUDIO_20211129_170903_0000
AUDIO_20211129_170904_0000 AUDIO_20211129_170904_0000
AUDIO_20211129_170905_0000 AUDIO_20211129_170905_0000
AUDIO_20211129_170906_0000 AUDIO_20211129_170906_0000
AUDIO_20211129_170907_0000 AUDIO_20211129_170907_0000
AUDIO_20211129_170908_0000 AUDIO_20211129_170908_0000
AUDIO_20211129_170909_0000 AUDIO_20211129_170909_0000
AUDIO_20211129_170910_0000 AUDIO_20211129_170910_0000
AUDIO_20211129_170911_0000 AUDIO_20211129_170911_0000
AUDIO_20211129_170912_0000 AUDIO_20211129_170912_0000
AUDIO_20211129_170913_0000 AUDIO_20211129_170913_0000
AUDIO_20211129_170914_0000 AUDIO_20211129_170914_0000
测试:
替换data/fbank/test/ 下同名文件在 执行./run.sh
就可以了
可见 准确率还是比较高的