语音识别建立词汇表和索引表

最新推荐文章于 2024-07-05 18:31:54 发布

邹一一的大学求生

最新推荐文章于 2024-07-05 18:31:54 发布

阅读量461

点赞数 11

文章标签：语音识别人工智能 pytorch python 深度学习数据分析

本文链接：https://blog.csdn.net/qq_65061420/article/details/134946060

版权

语音识别建立词汇表和索引表

语音识别项目中，我看到的项目是需要建立这样两张表，一张是词汇表或者字符表，依你想要处理怎么样的数据而定，我这里介绍词汇表如何建立。

建立词汇表

在下载好的AIshell-1数据中，aishell_transcript_v0.8.txt文件，长这个样子。
分别是对应的标签和语音中的内容，这是已经分好词了的，如果没分好可以利用jieba进行分词处理，我们的目标就是把后面的词不重复的提取出来。
在这里插入图片描述

在使用CTCLOSS时，我们需要在0索引的位置设置为’_‘，在27索引的位置设置为’ '，是为了识别时去除重复的以及为空的位置。这是在AIshell-1数据集上的处理。

input_file_path = "./data/data_aishell/transcript/aishell_transcript_v0.8.txt"
output_file_path = "./data/data_aishell/transcript/transcript.txt"


# 用于存储提取的词汇
unique_words = set()

# 打开输入文件，逐行读取
with open(input_file_path, 'r', encoding='utf-8') as input_file:
    for line in input_file:
        # 分割每行，取所有的词
        words = line.strip().split()[1:]
        unique_words.update(words)

# unique_words.update('_')
# unique_words.update(' ')
unique_words = sorted(unique_words)
unique_words = list(unique_words)

unique_words.insert(0, '_')
unique_words.insert(27, ' ')

# 将唯一的词汇写入输出文件
with open(output_file_path, 'w', encoding='utf-8') as output_file:
    for word in unique_words:
        output_file.write(word + '\n')

print("提取并写入完成")

建立路径与语句的对应

这里只对dev进行了处理，你想要对train数据集进行处理的话，改下路径就好。

我们的目的是建立一个包含文件路径，和后面包含说话内容的.txt文本，为了我们之后方便读取数据，处理后大致上是这样的。
在这里插入图片描述

import os
import glob

# 指定--文件夹和--idxtxt文件路径
dev_folder_path = "./data/data_aishell/wav/dev/"
txt_file_path = "./data/data_aishell/transcript/aishell_transcript_v0.8.txt"
output_file_path = "./data/data_aishell/index/dev_index.txt"


dev_wav_files = glob.glob(os.path.join(dev_folder_path, '**', '*.wav'))
# dev_wav_files


# 读取txt文件中的内容并创建字典映射
txt_mapping = {}
with open(txt_file_path, 'r', encoding='utf-8') as txt_file:
    for line in txt_file:
        parts = line.strip().split(' ')
        if len(parts) > 1:
            file_name, chinese_text = parts[0], ' '.join(parts[1:])
            txt_mapping[file_name] = chinese_text
            # print(chinese_text)

# 创建inx.txt文件
with open(output_file_path, 'w', encoding='utf-8') as output_file:
    for wav_file in dev_wav_files:
        file_name = os.path.basename(wav_file)

        if file_name.split('.')[0] in txt_mapping:
            # print(file_name)
            chinese_text = txt_mapping[file_name.split('.')[0]]
            print(chinese_text)
            output_file.write(f"{wav_file},{chinese_text}\n")
            # print(f"{wav_file},{chinese_text}\n")

print("idx.txt文件生成完成")