fix_date_dir.sh脚本的使用方法和目的:
这个脚本主要是有助于确保数据目录中的各种文件得到正确排序和筛选,例如删除没有任何特征对应的语句(如果feats.scp存在)
echo "Usage: utils/data/fix_data_dir.sh <data-dir>"
echo "e.g.: utils/data/fix_data_dir.sh data/train"
echo "This script helps ensure that the various files in a data directory"
echo "are correctly sorted and filtered, for example removing utterances"
echo "that have no features (if feats.scp is present)"
exit 1
调用函数的顺序,之后我会详细介绍每一个函数的代码:
filter_recordings
filter_speakers
filter_utts
filter_speakers
filter_recordings
检测train文件夹内是否包含以下文件,每个文件中的格式以及具体情况可以参考以下博客:传送门
for x in utt2spk spk2utt feats.scp text segments wav.scp cmvn.scp vad.scp \
reco2file_and_channel spk2gender utt2lang utt2uniq utt2dur reco2dur utt2num_frames; do
if [ -f $data/$x ]; then
cp $data/$x $data/.backup/$x
check_sorted $data/$x
fi
done
check_sorted:判断train文件夹是否已经排序且里面的文件没有重复
function check_sorted {
file=$1
sort -k1,1 -u <$file >$file.tmp
if ! cmp -s $file $file.tmp; then
echo "$0: file $1 is not in sorted order or not unique, sorting it"
mv $file.tmp $file
else
rm $file.tmp
fi
}
filter_recordings:我们在过滤utterance-id之前调用一次,在最后结束的时候再调用一次。
“segments”文件,如下:
s5# head -3 data/train/segments
sw02001-A_000098-001156 sw02001-A 0.98 11.56
sw02001-A_001980-002131 sw02001-A 19.8 21.31
sw02001-A_002736-002893 sw02001-A 27.36 28.93
function filter_recordings {
# We call this once before the stage when we filter on utterance-id, and once
# after.
if [ -f $data/segments ]; then
# We have a segments file -> we need to filter this and the file wav.scp, and
# reco2file_and_utt, if it exists, to make sure they have the same list of
# recording-ids.
// 我们首先去判断文件wav.scp,reco2file_and_channel
if [ ! -f $data/wav.scp ]; then
echo "$0: $data/segments exists but not $data/wav.scp"
exit 1;
fi
//取出segments的第二列然后排序去重,然后写入recordings中
awk '{print $2}' < $data/segments | sort | uniq > $tmpdir/recordings
//recordings的行数
n1=$(cat $tmpdir/recordings | wc -l)
[ ! -s $tmpdir/recordings ] && \
echo "Empty list of recordings (bad file $data/segments)?" && exit 1;
//utils/filter_scp.pl a b.就是查找b这个列表中的utt是否在a中每行的第一列utt中出现过
//输出b的utt列表中在a中第一列出现过utt
//filter_scp.pl还有两个参数-f是中在a中的第几列出现,默认是1,--exclude如果有这个参数
//的话就是输出没有出现过得utt,而不是输出出现过得utt
utils/filter_scp.pl $data/wav.scp $tmpdir/recordings > $tmpdir/recordings.tmp
mv $tmpdir/recordings.tmp $tmpdir/recordings
//recording中的语句现在就是同时在wav.scp和segments中
//第一二列互换
cp $data/segments{,.tmp}; awk '{print $2, $1, $3, $4}' <$data/segments.tmp >$data/segments
//对照recordings,将segmen中的utt没有出现在wav.scp中的行数删了
filter_file $tmpdir/recordings $data/segments
//第一二列互换
cp $data/segments{,.tmp}; awk '{print $2, $1, $3, $4}' <$data/segments.tmp >$data/segments
rm $data/segments.tmp
//对照recordings,将wav.scp中的utt没有出现在segment中的行数删了
filter_file $tmpdir/recordings $data/wav.scp
[ -f $data/reco2file_and_channel ] && filter_file $tmpdir/recordings $data/reco2file_and_channel
[ -f $data/reco2dur ] && filter_file $tmpdir/recordings $data/reco2dur
true
fi
}
filter_file:
这个有两个输入文件filter和file_to_filter,删除file_to_filter中的uut没有出现在filter中的行,同时输出删前和删后的行数对比
function filter_file {
filter=$1
file_to_filter=$2
cp $file_to_filter ${file_to_filter}.tmp
utils/filter_scp.pl $filter ${file_to_filter}.tmp > $file_to_filter
if ! cmp ${file_to_filter}.tmp $file_to_filter >&/dev/null; then
length1=$(cat ${file_to_filter}.tmp | wc -l)
length2=$(cat ${file_to_filter} | wc -l)
if [ $length1 -ne $length2 ]; then
echo "$0: filtered $file_to_filter from $length1 to $length2 lines based on filter $filter."
fi
fi
rm $file_to_filter.tmp
}
filter_speakers:
在整个程序中,我们认为utt2spk是主的,spk2utt是派生的,所以我们使用utt2spk_to_spk2utt.pl通过uut2spk生成spk2utt。
这个函数的功能主要是统一uttspeak,cmvn.scp spk2gender中的speak,保证他们的speak都是共同拥有的
function filter_speakers {
# throughout this program, we regard utt2spk as primary and spk2utt as derived, so...
utils/utt2spk_to_spk2utt.pl $data/utt2spk > $data/spk2utt
//删除cmvn.scp spk2gender中没有出现的speak
cat $data/spk2utt | awk '{print $1}' > $tmpdir/speakers
for s in cmvn.scp spk2gender; do
f=$data/$s
if [ -f $f ]; then
filter_file $f $tmpdir/speakers
fi
done
filter_file $tmpdir/speakers $data/spk2utt
utils/spk2utt_to_utt2spk.pl $data/spk2utt > $data/utt2spk
for s in cmvn.scp spk2gender $spk_extra_files; do
f=$data/$s
if [ -f $f ]; then
filter_file $tmpdir/speakers $f
fi
done
}
filter_utts:
提取出所有文件都拥有的utt,然后将不是所有文件都有的uut的行删除
function filter_utts {
cat $data/utt2spk | awk '{print $1}' > $tmpdir/utts
//判断utt2spk是否已经排序
! cat $data/utt2spk | sort | cmp - $data/utt2spk && \
echo "utt2spk is not in sorted order (fix this yourself)" && exit 1;
//如果按照utt2spk第二列排序 判断他是否排序
! cat $data/utt2spk | sort -k2 | cmp - $data/utt2spk && \
echo "utt2spk is not in sorted order when sorted first on speaker-id " && \
echo "(fix this by making speaker-ids prefixes of utt-ids)" && exit 1;
//判断spk2utt是否已经排序
! cat $data/spk2utt | sort | cmp - $data/spk2utt && \
echo "spk2utt is not in sorted order (fix this yourself)" && exit 1;
if [ -f $data/utt2uniq ]; then
! cat $data/utt2uniq | sort | cmp - $data/utt2uniq && \
echo "utt2uniq is not in sorted order (fix this yourself)" && exit 1;
fi
maybe_wav=
maybe_reco2dur=
[ ! -f $data/segments ] && maybe_wav=wav.scp # wav indexed by utts only if segments does not exist.
[ -s $data/reco2dur ] && [ ! -f $data/segments ] && maybe_reco2dur=reco2dur # reco2dur indexed by utts
maybe_utt2dur=
if [ -f $data/utt2dur ]; then
cat $data/utt2dur | \
awk '{ if (NF == 2 && $2 > 0) { print }}' > $data/utt2dur.ok || exit 1
maybe_utt2dur=utt2dur.ok
fi
maybe_utt2num_frames=
if [ -f $data/utt2num_frames ]; then
cat $data/utt2num_frames | \
awk '{ if (NF == 2 && $2 > 0) { print }}' > $data/utt2num_frames.ok || exit 1
maybe_utt2num_frames=utt2num_frames.ok
fi
//提取出feats.scp text segments utt2lang $maybe_wav $maybe_utt2dur $maybe_utt2num_frames共同拥有的utt
for x in feats.scp text segments utt2lang $maybe_wav $maybe_utt2dur $maybe_utt2num_frames; do
if [ -f $data/$x ]; then
utils/filter_scp.pl $data/$x $tmpdir/utts > $tmpdir/utts.tmp
mv $tmpdir/utts.tmp $tmpdir/utts
fi
done
rm $data/utt2dur.ok 2>/dev/null || true
rm $data/utt2num_frames.ok 2>/dev/null || true
[ ! -s $tmpdir/utts ] && echo "fix_data_dir.sh: no utterances remained: not proceeding further." && \
rm $tmpdir/utts && exit 1;
if [ -f $data/utt2spk ]; then
new_nutts=$(cat $tmpdir/utts | wc -l)
old_nutts=$(cat $data/utt2spk | wc -l)
if [ $new_nutts -ne $old_nutts ]; then
echo "fix_data_dir.sh: kept $new_nutts utterances out of $old_nutts"
else
echo "fix_data_dir.sh: kept all $old_nutts utterances."
fi
fi
//将不是共同拥有的utt的行删除
for x in utt2spk utt2uniq feats.scp vad.scp text segments utt2lang utt2dur utt2num_frames $maybe_wav $maybe_reco2dur $utt_extra_files; do
if [ -f $data/$x ]; then
cp $data/$x $data/.backup/$x
if ! cmp -s $data/$x <( utils/filter_scp.pl $tmpdir/utts $data/$x ) ; then
utils/filter_scp.pl $tmpdir/utts $data/.backup/$x > $data/$x
fi
fi
done
}