kaldi 强制对齐相关代码介绍

最新推荐文章于 2022-09-18 18:04:38 发布

南森阳杉

最新推荐文章于 2022-09-18 18:04:38 发布

阅读量1.2k

点赞数

分类专栏： kaldi 文章标签：自然语言处理

本文链接：https://blog.csdn.net/dlx59140096/article/details/105379886

版权

kaldi 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Corpus Phonetics Tutorial

Eleanor Chodroff

Intro Penn Forced Aligner AutoVOT Kaldi Other Resources
Prerequisites Familiarization Training Acoustic Models Conceptually Training Acoustic Models Forced Alignment
Kaldi
Forced Alignment

Once acoustic models have been created, Kaldi can also perform forced alignment on audio accompanied by a word-level transcript. Note that the Montreal Forced Aligner is a forced alignment system based on Kaldi-trained acoustic models for American English (other languages in development). If you only need an alignment of American English speech, it may be in your interest to check out their system (see also the Penn Forced Aligner and FAVE).

Otherwise, if the audio to be aligned is the same as the audio used in the acoustic models, then the alignments can be extracted directly from the alignment files. If you have new audio and transcripts, then the transcript files will need to be updated before alignment.

The full procedure will convert output from the model alignment into Praat TextGrids containing the phone-level transcript.

If the data to be aligned is the same as the training data, skip to step 4. Otherwise, you’ll need to update the transcript files and audio file specifications.

Duplicate files in data/train with new transcript and audio

Create a directory in mycorpus/data to house the new versions of text, segments, wav.scp, utt2spk, and spk2utt. See the page on data/train for details on how to create these.
cd mycorpus/data
mkdir alignme

create text, segments, wav.scp, utt2spk, and spk2utt

Extract MFCC features

Revisit the page on MFCC feature extraction for reference. You’ll need to replace data/train with the the new directory, data/alignme.
cd mycorpus

mfccdir=mfcc
for x in data/alignme; do
steps/make_mfcc.sh --cmd “$train_cmd” --nj 16 $x exp/make_mfcc/$ x $mfccdir
utils/fix_data_dir.sh data/alignme
steps/compute_cmvn_stats.sh $x exp/make_mfcc/$ x $mfccdir
utils/fix_data_dir.sh data/alignme
done

Align data

Revisit the page on triphone training and alignment for reference. Select the acoustic model and corresponding alignment process you’d like to use. You’ll need to replace data/train with the the new directory, data/alignme. As an example:
cd mycorpus
steps/align_si.sh --cmd “$train_cmd” data/alignme data/lang exp/tri4a exp/tri4a_alignme || exit 1;

Obtain CTM output from alignment files

CTM stands for time-marked conversation file and contains a time-aligned phoneme transcription of the utterances. Its format is:
utt_id channel_num start_time end_time phone_id

To obtain these, you will need to decide which acoustic models to use. The following code will extract the CTM output from the alignment files in the directory tri4a_alignme, using the acoustic models in tri4a:
cd mycorpus

for i in exp/tri4a_alignme/ali.*.gz;
do src/bin/ali-to-phones --ctm-output exp/tri4a/final.mdl ark:“gunzip -c $i|” -> ${i%.gz}.ctm;
done;

Concatenate CTM files
cd mycorpus/exp/tri4a_alignme
cat *.ctm > merged_alignment.txt
Convert time marks and phone IDs

The CTM output reports start and end times relative to the utterance, as opposed to the file. You will need the segments file located in either data/train or data/alignme to convert the utterance times into file times.

The output also reports the phone ID, as opposed to the phone itself. You will need the phones.txt file located in data/lang to convert the phone IDs into phone symbols.

id2phone.R

After obtaining the segments and phones.txt files, run id2phone.R to convert phone IDs to phones characters and map utterance times to file times. You will need to modify the file locations and possibly the regular expression to obtain the filename from the utterance name. Recall that the CTM output lists the utterance ID whereas the segments file lists the file ID. (If you named things logically, the file ID should be a subset of the utterance ID.)

id2phone.R returns a modified version of merged_alignment.txt called final_ali.txt

Split final_ali.txt by file

splitAlignments.py

final_ali.txt contains the phone transcript for all files together. This can be split into unique files by running splitAlignments.py. You will need to modify the location of final_ali.txt in this script.
python splitAlignments.py

Create word alignments from phone endings

First we’ll need to use the [B I E S] suffixes on the phones in order to group phones together into word-level units.
Run phons2pron.py to complete this step. Note that I have utf-8 character encoding on this script. If necessary, this can be updated to reflect the character encoding that best matches your files.

Second, we’ll need to match the phone pronunciation to the corresponding lexical entry using lexicon.txt.
Run pron2words.py to complete this step.

Append header to each of the text files for Praat

Praat requires that a text file have a header. Once we append the header, then we can convert these text files into TextGrids. The following code requires a text file containing the header:
file_utt file id ali startinutt dur phone start_utt end_utt start end

It also requires a tmp directory for processing. I put this on my Desktop.
cd ~/Desktop
mkdir tmp

header="/Users/Eleanor/Desktop/header.txt"

direct the terminal to the directory with the newly split session files

ensure that the RegEx below will capture only the session files

otherwise change this or move the other .txt files to a different folder

cd mycorpus/forcedalignment
for i in *.txt;
do
cat “ $h e a d e r " "$ i” > /Users/Eleanor/Desktop/tmp/xx. $m v / U s e r s / E l e a n o r / D e s k t o p / t m p / x x .$ “$i”
done

Make Praat TextGrids of phone alignments from .txt files

createtextgrid.praat will read in the new phone transcripts and corresponding audio files to create a TextGrid for that file. You will need to modify the locations of the phone transcripts and audio files.