Facebook 开源语音识别工具包wav2letter（附实现教程）

最新推荐文章于 2024-10-10 07:14:55 发布

机器之心V

最新推荐文章于 2024-10-10 07:14:55 发布

阅读量2.2k

点赞数

文章标签：人工智能 lua python

本文链接：https://blog.csdn.net/Uwr44UOuQcNsUQb60zk2/article/details/78948519

版权

近日，Facebook AI 研究院开源了端到端语音识别系统 wav2letter，本文是该架构的论文实现，读者可据此做语音转录。

GitHub 地址：https://github.com/facebookresearch/wav2letter

wav2letter

wav2letter 是 Facebook AI 研究院今天开源的简单高效的端到端自动语音识别（ASR）系统。该实现的原作者包括 Ronan Collobert、Christian Puhrsch、Gabriel Synnaeve、Neil Zeghidour 和 Vitaliy Liptchinsky。

wav2letter 实现的是论文「Wav2Letter: an End-to-End ConvNet-based Speech Recognition System」以及「Letter-Based Speech Recognition with Gated ConvNets」中提出的架构（如果你使用了这个模型或预训练模型，请引用以上两篇论文之一）。

如果你想要立刻进行语音转录，我们提供了在 Librispeech 数据集上预训练的模型。

预训练模型：https://github.com/facebookresearch/wav2letter#pre-trained-models

Librispeech 数据集：http://www.openslr.org/12

安装要求

MacOS 或 Linux 操作系统
Torch，我们在下文介绍了安装教程
在 CPU 上训练：Intel MKL
在 GPU 上训练：NVIDIA CUDA Toolkit (cuDNN v5.1 for CUDA 8.0)
读取录音文件：Libsndfile（必须在任何标准发行版中可用）
标准语音特征：FFTW（必须在任何标准发行版中可用）

安装

MKL

如果你打算在 CPU 上训练，我们强烈推荐安装 Intel MKL。

通过以下代码更新你的 .bashrc 文件

        
  
  
            
              # We assume Torch will be installed in $HOME/usr.# Change according to your needs.export PATH=$HOME/usr/bin:$PATH# This is to detect MKL during compilation# but also to make sure it is found at runtime.
            
          

            
              INTEL_DIR=/opt/intel/lib/intel64
            
          

            
              MKL_DIR=/opt/intel/mkl/lib/intel64
            
          

            
              MKL_INC_DIR=/opt/intel/mkl/include
            
          

            
              if [ ! -d "$INTEL_DIR" ]; thenecho "$ warning: INTEL_DIR out of date"fiif [ ! -d "$MKL_DIR" ]; thenecho "$ warning: MKL_DIR out of date"fiif [ ! -d "$MKL_INC_DIR" ]; thenecho "$ warning: MKL_INC_DIR out of date"fi# Make sure MKL can be found by Torch.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$INTEL_DIR:$MKL_DIRexport CMAKE_LIBRARY_PATH=$LD_LIBRARY_PATHexport CMAKE_INCLUDE_PATH=$CMAKE_INCLUDE_PATH:$MKL_INC_DIR

LuaJIT + LuaRocks

以下代码在本地的$HOME/usr 安装了 LuaJIT 和 LuaRocks。如果你需要全系统的安装，请删除-DCMAKE_INSTALL_PREFIX=$HOME/usr 选项。

        
  
  
            
              git clone https://github.com/torch/luajit-rocks.gitcd luajit-rocks
            
          

            
              mkdir build; cd build
            
          

            
              cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/usr -DWITH_LUAJIT21=OFF
            
          

            
              make -j 4
            
          

            
              make installcd ../..

在下一部分，我们假定 LuaJIT 和 LuaRocks 被安装在了路径$PATH。如果不是，并假定你将它们安装在了本地的$HOME/usr，你需要替换成运行~/usr/bin/luarocks 和 ~/usr/bin/luajit。

KenLM 语言模型工具包

运行 wav2letter 解码器需要 KenLM 工具包，运行 KenLM 需要安装 Boost 库。

        
  
  
            
              # make sure boost is installed (with system/thread/test modules)# actual command might vary depending on your system
            
          

            
              sudo apt-get install libboost-dev libboost-system-dev libboost-thread-dev libboost-test-dev

成功安装了 Boost 之后，就可以安装 KenLM：

        
  
  
            
              wget https://kheafield.com/code/kenlm.tar.gz
            
          

            
              tar xfvz kenlm.tar.gzcd kenlm
            
          

            
              mkdir build && cd build
            
          

            
              cmake .. -DCMAKE_INSTALL_PREFIX=$HOME/usr -DCMAKE_POSITION_INDEPENDENT_CODE=ON
            
          

            
              make -j 4
            
          

            
              make install
            
          

            
              cp -a lib/* ~/usr/lib # libs are not installed by default :(cd ../..

OpenMPI 和 TorchMPI

如果想使用多 CPU 或多 GPU 训练（或多机器训练），你需要安装 OpenMPI 和 TorchMPI。

免责声明：我们强烈推荐你自己重编译 OpenMPI。OpenMPI 二进制文件的标准发行版的编译标签存在很大的方差。特定的标签对于成功地编译和运行 TorchMPI 很关键。

首先安装OpenMPI：

        
  
  
            
              wget https://www.open-mpi.org/software/ompi/v2.1/downloads/openmpi-2.1.2.tar.bz2
            
          

            
              tar xfj openmpi-2.1.2.tar.bz2cd openmpi-2.1.2; mkdir build; cd build
            
          

            
              ./configure --prefix=$HOME/usr --enable-mpi-cxx --enable-shared --with-slurm --enable-mpi-thread-multiple --enable-mpi-ext=affinity,cuda --with-cuda=/public/apps/cuda/9.0
            
          

            
              make -j 20 all
            
          

            
              make install

注意：这里也可以使用 openmpi-3.0.0.tar.bz2，但需要将—enable-mpi-thread-multiple 删除。

现在可以安装 TorchMPI 了：

        
  
  
            
              MPI_CXX_COMPILER=$HOME/usr/bin/mpicxx ~/usr/bin/luarocks install torchmpi

安装 Torch 和其它的 Torch 包

        
  
  
            
              luarocks install torch
            
          

            
              luarocks install cudnn # for GPU support
            
          

            
              luarocks install cunn # for GPU support

安装 wav2letter 包

        
  
  
            
              git clone https://github.com/facebookresearch/wav2letter.git
            
          

            
              cd wav2letter
            
          

            
              cd gtn && luarocks make rocks/gtn-scm-1.rockspec && cd ..
            
          

            
              cd speech && luarocks make rocks/speech-scm-1.rockspec && cd ..
            
          

            
              cd torchnet-optim && luarocks make rocks/torchnet-optim-scm-1.rockspec && cd ..
            
          

            
              cd wav2letter && luarocks make rocks/wav2letter-scm-1.rockspec && cd ..
            
          

            
              # Assuming here you got KenLM in $HOME/kenlm
            
          

            
              # And only if you plan to use the decoder:
            
          

            
              cd beamer && KENLM_INC=$HOME/kenlm luarocks make rocks/beamer-scm-1.rockspec && cd ..

训练 wav2letter 模型

数据预处理

数据文件夹包含多个用于预处理多种数据集的脚本。目前我们仅提供 LibriSpeech 和 TIMIT。

以下是预处理 LibriSpeech ASR 语料库的例子：

        
  
  
            
              wget http://www.openslr.org/resources/12/dev-clean.tar.gz
            
          

            
              tar xfvz dev-clean.tar.gz# repeat for train-clean-100, train-clean-360, train-other-500, dev-other, test-clean, test-other
            
          

            
              luajit ~/wav2letter/data/librispeech/create.lua ~/LibriSpeech ~/librispeech-proc
            
          

            
              luajit ~/wav2letter/data/utils/create-sz.lua librispeech-proc/train-clean-100 librispeech-proc/train-clean-360 librispeech-proc/train-other-500 librispeech-proc/dev-clean librispeech-proc/dev-other librispeech-proc/test-clean librispeech-proc/test-other

训练

        
  
  
            
              mkdir experiments
            
          

            
              luajit ~/wav2letter/train.lua --train -rundir ~/experiments -runname hello_librispeech -arch ~/wav2letter/arch/librispeech-glu-highdropout -lr 0.1 -lrcrit 0.0005 -gpu 1 -linseg 1 -linlr 0 -linlrcrit 0.005 -onorm target -nthread 6 -dictdir ~/librispeech-proc  -datadir ~/librispeech-proc -train train-clean-100+train-clean-360+train-other-500 -valid dev-clean+dev-other -test test-clean+test-other -gpu 1 -sqnorm -mfsc -melfloor 1 -surround "|" -replabel 2 -progress -wnorm -normclamp 0.2 -momentum 0.9 -weightdecay 1e-05

在多 GPU 上训练

使用 OpenMPI 进行多 GPU 训练：

        
  
  
            
              mpirun -n 2 --bind-to none  ~/TorchMPI/scripts/wrap.sh luajit ~/wav2letter/train.lua --train -mpi -gpu 1 ...

这里，我们假定 mpirun 位于$PATH。

运行解码器（推理）

运行解码器之前，需要做些预处理步骤。

首先，创造一个字母词典，里面包含 wav2letter 中使用到的特殊重复字母

        
  
  
            
              cat ~/librispeech-proc/letters.lst >> ~/librispeech-proc/letters-rep.lst && echo "1" >> ~/librispeech-proc/letters-rep.lst && echo "2" >> ~/librispeech-proc/letters-rep.lst

然后用一个语言模型，做预处理。在这里，我们使用的是基于 LibriSpeech 的预训练语言模型，你们也可以使用 KenLM 训练自己的语言模型。然后，把单词预处理转化为小写字母，在 dict.lst 特定词典中生成字母录音文本（带有重复字母）。该脚本可能会提醒你哪个单词转录错误，因为重复字母数量不对。在我们的案例中不存在这种情况，因为这种词非常少。

        
  
  
            
              wget http://www.openslr.org/resources/11/3-gram.pruned.3e-7.arpa.gz luajit~/wav2letter/data/utils/convert-arpa.lua ~/3-gram.pruned.3e-7.arpa.gz ~/3-gram.pruned.3e-7.arpa ~/dict.lst -preprocess ~/wav2letter/data/librispeech/preprocess.lua -r 2 -letters letters-rep.lst

注意：也可以使用 4-gram 预训练语言模型 4-gram.arpa.gz 作为替代，预处理可能花费的时间比较长。

可选项：用 KenLM 将其转化为二进制格式，后续载入语言模型，可加速训练时间（我们在这里假定 KenLM 位于你的$PATH）。

        
  
  
            
              build_binary 3-gram.pruned.3e-7.arpa 3-gram.pruned.3e-7.bin

我们现在可以生成特定训练模型的 emissions，在数据集上运行 test.lua。该脚本展示了字母错误率（LER）与词错率（WER），后者是在声学模型没有后处理的情况下计算的。

        
  
  
            
              luajit ~/wav2letter/test.lua ~/experiments/hello_librispeech/001_model_dev-clean.bin -progress -show -test dev-clean -save

一旦 emissions 存储好，可运行解码器计算通过用特定语言模型约束解码获得的词错率：

        
  
  
            
              luajit ~/wav2letter/decode.lua ~/experiments/hello_librispeech dev-clean -show -letters ~/librispeech-proc/letters-rep.lst  -words ~/dict.lst -lm ~/3-gram.pruned.3e-7.arpa -lmweight 3.1639 -beamsize 25000 -beamscore 40 -nthread 10 -smearing max -show

预训练模型

我们提供了基于 LibriSpeech 的完整预训练模型：

        
  
  
            
              wget https://s3.amazonaws.com/wav2letter/models/librispeech-glu-highdropout.bin

为了使用该模型做语音转录，我们需要遵循该 github 项目中 README 的 requirements、installation 和 decoding 部分。

注意，该模型是 Facebook 基础设施上的预训练模型，所以你需要运行 test.lua 使用它，有略微不同的参数：

        
  
  
            
              luajit ~/wav2letter/test.lua ~/librispeech-glu-highdropout.bin -progress -show -test dev-clean -save -datadir ~/librispeech-proc/ -dictdir ~/librispeech-proc/ -gfsai