NVIDIA 7th SkyHackathon（三）语音数据集的制作

Alex_McAvoy

已于 2022-11-30 13:57:45 修改

阅读量522

点赞数

分类专栏： NVIDIA 文章标签： python 开发语言

于 2022-11-24 11:39:07 首次发布

本文链接：https://blog.csdn.net/u011815404/article/details/128015240

版权

NVIDIA 专栏收录该内容

17 篇文章 14 订阅

订阅专栏

1.数据采集

为制作数据集，寻找了 65 个人单独采集样本，每人录制一个采样率为 44100HZ 的 wav 单声道语音文件，录制内容为下述的 15 句话，在对每个文件进行分割、数据清洗后，按类别存放在共计 15 个文件夹中，具体数据情况如下：

编号	内容	数据量
1	请检测出果皮	62
2	请检测出瓶子	61
3	请检测出纸箱	61
4	请检测出果皮和瓶子	60
5	请检测出果皮和纸箱	61
6	请检测出瓶子和果皮	61
7	请检测出瓶子和纸箱	59
8	请检测出纸箱和果皮	61
9	请检测出纸箱和瓶子	60
10	请检测出果皮、瓶子和纸箱	60
11	请检测出果皮、纸箱和瓶子	61
12	请检测出瓶子、果皮和纸箱	60
13	请检测出瓶子、纸箱和果皮	60
14	请检测出纸箱、果皮和瓶子	61
15	请检测出纸箱、瓶子和果皮	60

2.数据清单要求

数据清单格式要求如下例：

{
    "audio_filepath": "/root/traindata/hi1.wav",
    "duration": 3.1463038548752835,
    "text": "你好请让我进入小区"
}

NVIDIA 官方建议使用 librosa 音频工具包获取音频时长

import librosa 
time = librosa.get_duration(filename="raw_data/请检测出果皮/10.wav")

3.数据清单制作

采用随机交叉验证，将清洗后的数据划分为训练集、测试集，并制作出 json 格式的数据清单：train.json、val.json

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time: 2022/11/12 16:25
# @Author: FangXin

import librosa
import os
import json
import random

raw_root = 'raw_data/'
save_root = 'data/'

sentences = os.listdir(raw_root)
print(sentences)

train_cnt = 0
train = []

val_cnt = 0
val = []

for s in sentences:
    path = raw_root + s + '/'
    files = os.listdir(path)
    for f in files:
        file_path = os.path.join(path, f)
        if not os.path.isfile(file_path):
            continue
        
        # 每个句子的时长
        time = librosa.get_duration(filename=file_path)

        dic = {"audio_filepath": file_path, "duration": time, "text": s}
        
        # 交叉验证
        if random.random() < 0.8:
            out_file = open(f"{save_root}" + 'train/' + f"{f.split('.')[0] + '_' + s}.json", "w")
            json.dump(dic, out_file)
            out_file.close()
            train_cnt += 1
            train.append(dic)
        else:
            out_file = open(f"{save_root}" + 'val/' + f"{f.split('.')[0] + '_' + s}.json", "w")
            json.dump(dic, out_file)
            out_file.close()
            val_cnt += 1
            val.append(dic)

print(f"train 中数据数量：{train_cnt}")
print(f"val 中数据数量：{val_cnt}")


# 生成train.json文件
with open(save_root+'train.json', 'w') as json_file:
    for each_dict in train:
        json_file.write(json.dumps(each_dict) + '\n')

# 生成val.json文件
with open(save_root+'val.json', 'w') as json_file:
    for each_dict in train:
        json_file.write(json.dumps(each_dict) + '\n')