端到端语音识别 ESPnet ASR脚本流程（asr.sh）

最新推荐文章于 2024-07-09 16:24:22 发布

marki707

最新推荐文章于 2024-07-09 16:24:22 发布

阅读量2.1k

点赞数 4

分类专栏：端到端 espnet 文章标签：语音识别

原文链接：https://kan-bayashi.github.io/asj-espnet2-tutorial/

版权

端到端同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

espnet

1 篇文章 0 订阅

订阅专栏

在这里，我们将说明ASR脚本的流程。ASR脚本模板中脚本（asr.sh）包含14个阶段。

以下是每个阶段的简要概述。

阶段1：创建与学习集，验证集和评估集相对应的数据目录的阶段。调用函数位置：local/data.sh。

第2阶段（可选）：根据讲话速度的变化扩展数据的阶段。–speed_purturb_factors仅在指定了选项时才执行。训练集目录中的数据，该数据是在阶段1中创建的wav.scp，然后使用sox命令进行扩展。

阶段3：特征提取的阶段。–feats_type根据选项的不同，处理也不同。默认值为feats_type=raw，wav.scp仅格式化而不提取特征。feats_type=raw如果要使用其他功能，请使用Kaldi的特征提取。在这种情况下，您需要编译Kaldi。

阶段4：过滤话语的阶段。删除学习和验证集中最短阈值以下和最长阈值以上的语音。可以分别指定最短和最长阈值–min_wav_duration和–max_wav_duration选项。

阶段5：创建符号列表（词典）的阶段。使用的符号类型取决于选项–token_type。token_type=char或者token_type=bpe在ASR中可用。如果token_type=bpe，则SentencePiece将其拆分为子词。

第6阶段（可选）：计算语言模型学习统计信息的阶段。获取每个数据的形状信息（系列长度和尺寸数）以动态更改批次大小。在不使用语言模型的情况下，将–uss_lm选项设置为uss_lm =false，可以跳过Stage6到8。

第7阶段（可选）：进行语言模型学习的阶段。根据–lm_config和–lm_args选项进行语言模型的学习。

阶段8（可选）：计算学习的语言模型的困惑度（PPL）的阶段。评估语言模型。

第9阶段：计算用于训练ASR模型的统计数据的阶段。计算数据的形状信息（序列长度和维数），动态更改批次大小，计算整个训练数据的统计信息（均值和方差）以对特征进行归一化。

阶段10：学习ASR模型的阶段。根据–asr_config和–asr_args选项训练ASR模型。

阶段11：使用学习的模型进行解码的阶段。根据–inference_config和–inference_args选项，使用学习的语言模型和ASR模型进行推断。

阶段12：评估解码结果的阶段。计算字符错误率（CER）和字错误率（WER）。

第13-14阶段（可选）：打包经过训练的模型并将其上传到Zenodo的阶段。要使用它，您需要在Zenodo中注册用户并发布任务。有关更多信息，请参见ESPnet Model Zoo。

所有可用的选项都可以在asr.sh --help中参考。

下面展示一些 内联代码片。

$ cd espnet/egs2/TEMPLATE/asr1
$ ./asr.sh --help
2020-09-14T15:38:49(asr.sh:208:main) ./asr.sh --help
Usage: ./asr.sh --train-set <train_set_name> --valid-set <valid_set_name> --test_sets <test_set_names> --srctexts <srctexts>

Options:
    # General configuration
    --stage          # Processes starts from the specified stage(default="1").
    --stop_stage     # Processes is stopped at the specified stage(default="10000").
    --skip_data_prep # Skip data preparation stages(default="false").
    --skip_train     # Skip training stages(default="false").
    --skip_eval      # Skip decoding and evaluation stages(default="false").
    --skip_upload    # Skip packing and uploading stages(default="true").
    --ngpu           # The number of gpus("0" uses cpu, otherwise use gpu, default="1").
    --num_nodes      # The number of nodes(default="1").
    --nj             # The number of parallel jobs(default="32").
    --inference_nj   # The number of parallel jobs in decoding(default="32").
    --gpu_inference  # Whether to perform gpu decoding(default="false").
    --dumpdir        # Directory to dump features(default="dump").
    --expdir         # Directory to save experiments(default="exp").
    --python         # Specify python to execute espnet commands(default="python3").

    # Data preparation related
    --local_data_opts # The options given to local/data.sh(default="").

    # Speed perturbation related
    --speed_perturb_factors # speed perturbation factors, e.g. "0.9 1.0 1.1"(separated by space, default="").

    # Feature extraction related
    --feats_type       # Feature type(raw, fbank_pitch or extracted, default="raw").
    --audio_format     # Audio format(only in feats_type=raw, default="flac").
    --fs               # Sampling rate(default="16k").
    --min_wav_duration # Minimum duration in second(default="0.1").
    --max_wav_duration # Maximum duration in second(default="20").

    # Tokenization related
    --token_type              # Tokenization type(char or bpe, default="bpe").
    --nbpe                    # The number of BPE vocabulary(default="30").
    --bpemode                 # Mode of BPE(unigram or bpe, default="unigram").
    --oov                     # Out of vocabulary symbol(default="<unk>").
    --blank                   # CTC blank symbol(default="<blank>").
    --sos_eos                 # sos and eos symbole(default="<sos/eos>").
    --bpe_input_sentence_size # Size of input sentence for BPE(default="100000000").
    --bpe_nlsyms              # Non-linguistic symbol list for sentencepiece, separated by a comma.(default="").
    --bpe_char_cover          # Character coverage when modeling BPE(default="1.0").

    # Language model related
    --lm_tag          # Suffix to the result dir for language model training(default="").
    --lm_exp          # Specify the direcotry path for LM experiment.
                      # If this option is specified, lm_tag is ignored(default="").
    --lm_config       # Config for language model training(default="").
    --lm_args         # Arguments for language model training(default="").
                      # e.g., --lm_args "--max_epoch 10"
                      # Note that it will overwrite args in lm config.
    --use_word_lm     # Whether to use word language model(default="false").
    --word_vocab_size # Size of word vocabulary(default="10000").
    --num_splits_lm   # Number of splitting for lm corpus(default="1").

    # ASR model related
    --asr_tag          # Suffix to the result dir for asr model training(default="").
    --asr_exp          # Specify the direcotry path for ASR experiment.
                       # If this option is specified, asr_tag is ignored(default="").
    --asr_config       # Config for asr model training(default="").
    --asr_args         # Arguments for asr model training(default="").
                       # e.g., --asr_args "--max_epoch 10"
                       # Note that it will overwrite args in asr config.
    --feats_normalize  # Normalizaton layer type(default="global_mvn").
    --num_splits_asr   # Number of splitting for lm corpus (default="1").

    # Decoding related
    --inference_tag       # Suffix to the result dir for decoding(default="").
    --inference_config    # Config for decoding(default="").
    --inference_args      # Arguments for decoding(default="").
                          # e.g., --inference_args "--lm_weight 0.1"
                          # Note that it will overwrite args in inference config.
    --inference_lm        # Language modle path for decoding(default="valid.loss.ave.pth").
    --inference_asr_model # ASR model path for decoding(default="valid.acc.ave.pth").
    --download_model      # Download a model from Model Zoo and use it for decoding(default="").

    # [Task dependent] Set the datadir name created by local/data.sh
    --train_set     # Name of training set(required).
    --valid_set     # Name of validation set used for monitoring/tuning network training(required).
    --test_sets     # Names of test sets.
                    # Multiple items(e.g., both dev and eval sets) can be specified(required).
    --srctexts      # Used for the training of BPE and LM and the creation of a vocabulary list(required).
    --lm_dev_text   # Text file path of language model development set(default="").
    --lm_test_text  # Text file path of language model evaluation set(default="").
    --nlsyms_txt    # Non-linguistic symbol list if existing(default="none").
    --cleaner       # Text cleaner(default="none").
    --g2p           # g2p method(default="none").
    --lang          # The language type of corpus(default=noinfo).
    --asr_speech_fold_length # fold_length for speech data during ASR training(default="800").
    --asr_text_fold_length   # fold_length for text data during ASR training(default="150").
    --lm_fold_length         # fold_length for LM training(default="150").

marki707

关注

4
点赞
踩
12

收藏

觉得还不错? 一键收藏
1
评论
端到端语音识别 ESPnet ASR脚本流程（asr.sh）

在这里，我们将说明ASR脚本的流程。ASR脚本模板中脚本（asr.sh）包含14个阶段。以下是每个阶段的简要概述。阶段1：创建与学习集，验证集和评估集相对应的数据目录的阶段。调用函数位置：local/data.sh。第2阶段（可选）：根据讲话速度的变化扩展数据的阶段。–speed_purturb_factors仅在指定了选项时才执行。训练集目录中的数据，该数据是在阶段1中创建的wav.scp，然后使用sox命令进行扩展。阶段3：特征提取的阶段。–feats_type根据选项的不同，处理也不同。默认值
复制链接

扫一扫

专栏目录