Kaldi中如何使用已经训练好的模型进行语音识别ASR呢？

最新推荐文章于 2024-01-28 16:31:01 发布

一代程序码农

最新推荐文章于 2024-01-28 16:31:01 发布

阅读量1.6w

点赞数 2

分类专栏：机器学习人工智能人工神经网络语音识别 Kaldi 编程经验文章标签： Kaldi ASR 语音识别 online-decode

本文链接：https://blog.csdn.net/houwenbin1986/article/details/78658102

版权

编程经验同时被 3 个专栏收录

62 篇文章 0 订阅

订阅专栏

机器学习

40 篇文章 0 订阅

订阅专栏

人工智能

32 篇文章 0 订阅

订阅专栏

我们如何运用已经训练好的模型进行语音识别呢？这才是我们研究的目的啊，是不？

很好，细心的你一定会发现kaldi源码src目录中有online*相关的模块，这就是我们今天的主角啦！！！

Kaldi中有两个版本的online、online2分别是第一代、第二代，现在已经不维护online，转到online2了，但作为我们入门的，我建议还是选择online，由简入深嘛！！！

默认kaldi是不会编译online模块的，怎么让她理解我的意图呢？

[houwenbin@localhost ~]$ cd ~/kaldi-master/src

[houwenbin@localhost src]$ make ext -j 6

顺利编译出来~~~~

[houwenbin@localhost ~]$
[houwenbin@localhost online_demo]$ ~/kaldi-master/src/onlinebin/online-gmm-decode-faster 
/home/houwenbin/kaldi-master/src/onlinebin/online-gmm-decode-faster 

Decode speech, using microphone input(PortAudio)

Utterance segmentation is done on-the-fly.
Feature splicing/LDA transform is used, if the optional(last) argument is given.
Otherwise delta/delta-delta(2-nd order) features are produced.

Usage: online-gmm-decode-faster [options] <model-in><fst-in> <word-symbol-table> <silence-phones> [<lda-matrix-in>]

Example: online-gmm-decode-faster --rt-min=0.3 --rt-max=0.5 --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 model HCLG.fst words.txt '1:2:3:4:5' lda-matrix
Options:
  --acoustic-scale            : Scaling factor for acoustic likelihoods (float, default = 0.1)
  --batch-size                : Number of feature vectors processed w/o interruption (int, default = 27)
  --beam                      : Decoding beam.  Larger->slower, more accurate. (float, default = 16)
  --beam-delta                : Increment used in decoder [obscure setting] (float, default = 0.5)
  --beam-update               : Beam update rate (float, default = 0.01)
  --cmn-window                : Number of feat. vectors used in the running average CMN calculation (int, default = 600)
  --delta-order               : Order of delta computation (int, default = 2)
  --delta-window              : Parameter controlling window for delta computation (actual window size for each delta order is 1 + 2*delta-window-size) (int, default = 2)
  --hash-ratio                : Setting used in decoder to control hash behavior (float, default = 2)
  --inter-utt-sil             : Maximum # of silence frames to trigger new utterance (int, default = 50)
  --left-context              : Number of frames of left context (int, default = 4)
  --max-active                : Decoder max active states.  Larger->slower; more accurate (int, default = 2147483647)
  --max-beam-update           : Max beam update rate (float, default = 0.05)
  --max-utt-length            : If the utterance becomes longer than this number of frames, shorter silence is acceptable as an utterance separator (int, default = 1500)
  --min-active                : Decoder min active states (don't prune if #active less than this). (int, default = 20)
  --min-cmn-window            : Minumum CMN window used at start of decoding (adds latency only at start) (int, default = 100)
  --num-tries                 : Number of successive repetitions of timeout before we terminate stream (int, default = 5)
  --right-context             : Number of frames of right context (int, default = 4)
  --rt-max                    : Approximate maximum decoding run time factor (float, default = 0.75)
  --rt-min                    : Approximate minimum decoding run time factor (float, default = 0.7)
  --update-interval           : Beam update interval in frames (int, default = 3)

Standard options:
  --config                    : Configuration file to read (this option may be repeated) (string, default = "")
  --help                      : Print out usage message (bool, default = false)
  --print-args                : Print the command line arguments (to stderr) (bool, default = true)
  --verbose                   : Verbose level (higher->more logging) (int, default = 0)

语音识别就看egs/voxforge查看下面的run.sh

#!/bin/bash

# Copyright 2012 Vassil Panayotov
# Apache 2.0

# Note: you have to do 'make ext' in ../../../src/ before running this.

# Set the paths to the binaries and scripts needed
KALDI_ROOT=`pwd`/../../..
export PATH=$PWD/../s5/utils/:$KALDI_ROOT/src/onlinebin:$KALDI_ROOT/src/bin:$PATH

data_file="online-data"
data_url="http://sourceforge.net/projects/kaldi/files/online-data.tar.bz2"

# Change this to "tri2a" if you like to test using a ML-trained model
ac_model_type=tri2b_mmi

# Alignments and decoding results are saved in this directory(simulated decoding only)
decode_dir="./work"

# Change this to "live" either here or using command line switch like:
# --test-mode live
test_mode="simulated"

. parse_options.sh

ac_model=${data_file}/models/$ac_model_type
trans_matrix=""
audio=${data_file}/audio

if [ ! -s ${data_file}.tar.bz2 ]; then
    echo "Downloading test models and data ..."
    wget -T 10 -t 3 $data_url;

    if [ ! -s ${data_file}.tar.bz2 ]; then
        echo "Download of $data_file has failed!"
        exit 1
    fi
fi

if [ ! -d $ac_model ]; then
    echo "Extracting the models and data ..."
    tar xf ${data_file}.tar.bz2
fi

if [ -s $ac_model/matrix ]; then
    trans_matrix=$ac_model/matrix
fi

case $test_mode in
    live)
        echo
        echo -e "  LIVE DEMO MODE - you can use a microphone and say something\n"
        echo "  The (bigram) language model used to build the decoding graph was"
        echo "  estimated on an audio book's text. The text in question is"
        echo "  \"King Solomon's Mines\" (http://www.gutenberg.org/ebooks/2166)."
        echo "  You may want to read some sentences from this book first ..."
        echo
        online-gmm-decode-faster --rt-min=0.5 --rt-max=0.7 --max-active=4000 \
           --beam=12.0 --acoustic-scale=0.0769 $ac_model/model $ac_model/HCLG.fst \
           $ac_model/words.txt '1:2:3:4:5' $trans_matrix;;

    simulated)
        echo
        echo -e "  SIMULATED ONLINE DECODING - pre-recorded audio is used\n"
        echo "  The (bigram) language model used to build the decoding graph was"
        echo "  estimated on an audio book's text. The text in question is"
        echo "  \"King Solomon's Mines\" (http://www.gutenberg.org/ebooks/2166)."
        echo "  The audio chunks to be decoded were taken from the audio book read"
        echo "  by John Nicholson(http://librivox.org/king-solomons-mines-by-haggard/)"
        echo
        echo "  NOTE: Using utterances from the book, on which the LM was estimated"
        echo "        is considered to be \"cheating\" and we are doing this only for"
        echo "        the purposes of the demo."
        echo
        echo "  You can type \"./run.sh --test-mode live\" to try it using your"
        echo "  own voice!"
        echo
        mkdir -p $decode_dir
        # make an input .scp file
        > $decode_dir/input.scp
        for f in $audio/*.wav; do
            bf=`basename $f`
            bf=${bf%.wav}
            echo $bf $f >> $decode_dir/input.scp
        done
        online-wav-gmm-decode-faster --verbose=1 --rt-min=0.8 --rt-max=0.85\
            --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 \
            scp:$decode_dir/input.scp $ac_model/model $ac_model/HCLG.fst \
            $ac_model/words.txt '1:2:3:4:5' ark,t:$decode_dir/trans.txt \
            ark,t:$decode_dir/ali.txt $trans_matrix;;

    *)
        echo "Invalid test mode! Should be either \"live\" or \"simulated\"!";
        exit 1;;
esac

# Estimate the error rate for the simulated decoding
if [ $test_mode == "simulated" ]; then
    # Convert the reference transcripts from symbols to word IDs
    sym2int.pl -f 2- $ac_model/words.txt < $audio/trans.txt > $decode_dir/ref.txt

    # Compact the hypotheses belonging to the same test utterance
    cat $decode_dir/trans.txt |\
        sed -e 's/^\(test[0-9]\+\)\([^ ]\+\)\(.*\)/\1 \3/' |\
        gawk '{key=$1; $1=""; arr[key]=arr[key] " " $0; } END { for (k in arr) { print k " " arr[k]} }' > $decode_dir/hyp.txt

   # Finally compute WER
   compute-wer --mode=present ark,t:$decode_dir/ref.txt ark,t:$decode_dir/hyp.txt
fi

脚本自动去下载预训练模型：http://sourceforge.net/projects/kaldi/files/online-data.tar.bz2

我们输入：./run.sh --test-mode simulated就可以直接识别wav音频文件了！！！

[houwenbin@localhost]$cd ~/kaldi-master/egs/voxforge/online-demo
[houwenbin@localhost online_demo]$ 
[houwenbin@localhost online_demo]$ ./run.sh --test-mode simulated/live

  SIMULATED ONLINE DECODING - pre-recorded audio is used

  The (bigram) language model used to build the decoding graph was
  estimated on an audio book's text. The text in question is
  "King Solomon's Mines" (http://www.gutenberg.org/ebooks/2166).
  The audio chunks to be decoded were taken from the audio book read
  by John Nicholson(http://librivox.org/king-solomons-mines-by-haggard/)

  NOTE: Using utterances from the book, on which the LM was estimated
        is considered to be "cheating" and we are doing this only for
        the purposes of the demo.

  You can type "./run.sh --test-mode live" to try it using your
  own voice!

online-wav-gmm-decode-faster --verbose=1 --rt-min=0.8 --rt-max=0.85 --max-active=4000 --beam=12.0 --acoustic-scale=0.0769 scp:./work/input.scp online-data/models/tri2b_mmi/model online-data/models/tri2b_mmi/HCLG.fst online-data/models/tri2b_mmi/words.txt 1:2:3:4:5 ark,t:./work/trans.txt ark,t:./work/ali.txt online-data/models/tri2b_mmi/matrix 
File: test1
YOUR WARRIORS MUST GROW WEARY OF RESTING ON THEIR SPEARS INFADOOS 

MY LORD THERE WAS ONE WAR JUST AFTER WE DESTROYED THE PEOPLE IT CAME DOWN UPON US BUT IT WAS A CIVIL WAR DOG ATE DOG 

HOW WAS THAT 

MY LORD THE KING MY HALF BROTHER HOW BROTHER BORN AT THE SAME BIRTH AND OF THE SAME WOMAN 

IT IS NOT OUR CUSTOM MY LORD TO SUFFER TWINS TO LIVE THE WEAKER ALWAYS BEEN MUST DIE 

BUT THE MOTHER OF THE KING HID AWAY THE FEEBLER CHILD WHICH WAS BORN THE LAST 

FOR HER HEART YEARNED OVER IT 

AND THAT CHILD IS TWALA THE KING 

File: test2
I AM HIS YOUNGER BROTHER 

BORN ANOTHER WIFE 

WELL 

MY LORD KAFA OUR FATHER DIED WHEN WE CAME TO MANHOOD 

IN MY BROTHER IMOTU WAS MADE KING AND HIS PLACE 

AND FOR A SPACE REIGNED AND HAD A SIGN BY HIS FAVOURITE WIFE 

WHEN THE BABE WAS THREE YEARS OLD JUST AFTER THE GREAT WAR DURING WHICH NO MAN COULD SOW OR REAP A FAMINE CAME UPON THE LAND 

AND THE PEOPLE MURMURED BECAUSE OF THE FAMINE AND LOOKED ROUND LIKE A STARVED LION FOR SOMETHING TO REND 

THAT IT WAS DEAD GO GOLD THE WISE AND TERRIBLE WOMAN WHO DOES NOT DIE 

MADE A PROCLAMATION TO THE PEOPLE SAYING THE KING IMOTU IS NO GAME 

AND AT THE TIME IMOTU WAS SICK WAS A WIND AND LAY IN HIS KRAAL NOT ABLE TO MOVE 

THEN GAGOOL WENT INTO A HUT AND LED OUT TWALA MY HALF BROTHER 

File: test3
EVEN IF THIS CHILD IGNOSI HAD LIVED HE WOULD BE THE TRUE KING OF THE KUKUANA PEOPLE 

I SAW MY LORD THE SACRED SNAKE IS ROUND HIS LITTLE 

IGNOSI IS KING 

BUT ALAS HE IS LONG DEAD 

SEE MY LORDS AND INFADOOS POINTED TO A VAST COLLECTION OF HUTS SURROUNDED BY A FENCE WHICH WAS ITS TURN ENCIRCLED BY A GREAT DITCH THAT LIE ON THE PLAIN BENEATH US 

THAT IS THE KRAAL WHERE THE WHITE HEARD HIM OR TWO WAS LAST SCENE WITH THE CHILD IGNOSI 

IT IS THERE THAT WE SHALL SLEEP TO NIGHT IF INDEED HE ADDED DOUBTFULLY 

MY LORDS SLEEP UPON THIS EARTH 

compute-wer --mode=present ark,t:./work/ref.txt ark,t:./work/hyp.txt 
%WER 10.11 [ 37 / 366, 5 ins, 10 del, 22 sub ]
%SER 100.00 [ 3 / 3 ]
Scored 3 sentences, 0 not present in hyp.
[houwenbin@localhost online_demo]$

一代程序码农

关注

2
点赞
踩
25

收藏

觉得还不错? 一键收藏
6
评论
Kaldi中如何使用已经训练好的模型进行语音识别ASR呢？

我们如何运用已经训练好的模型进行语音识别呢？这才是我们研究的目的啊，是不？很好，细心的你一定会发现kaldi源码src目录中有online*相关的模块，这就是我们今天的主角啦！！！Kaldi中有两个版本的online、online2分别是第一代、第二代，现在已经不维护online，转到online2了，但作为我们入门的，我建议还是选择online，由简入深嘛！！！默认kaldi
复制链接

扫一扫