1.首先获得音频相应的翻译,格式如下:
2.分割语句,获得字符:
def count_manifest(counter, manifest_path):
with open(manifest_path, 'r', encoding='utf-8') as f:
for line in f.readlines():
for char in line.replace('\n', ''):
counter.update(char)
3.将字符排序后写入列表,最后将列表写入json文件:
count_manifest(counter, args.manifest_path)
count_sorted = sorted(counter.items(), key=lambda x: x[1], reverse=True)
with codecs.open(args.vocab_path, 'w', 'utf-8') as fout:
labels = [‘?’]
for char, count in count_sorted:
if count < args.count_threshold: break
labels.append(char)
json.dump(labels, fout)
4.注意:label 列表的第一个字符应该设为空字符,以便于ctcdecode识别。