pytorch深度学习入门(7)之-Torchaudio语音识别

语音识别

语音识别是一种让机器通过识别和理解过程把语音信号转变为相应文本或命令的高技术。它涉及信号处理、模式识别、概率论和信息论、发声机理和听觉机理、人工智能等多个领域。近二十年来,语音识别技术取得了显著的进步,开始从实验室走向市场,预计未来10年内,语音识别技术将进入工业、家电、通信、汽车电子、医疗、家庭服务、消费电子产品等各个领域。
本教程将向您展示如何正确格式化音频数据集,然后在数据集上训练/测试音频分类器网络。

首先,我们导入常用的 torch 包,例如 torchaudio,可以按照网站上的说明进行安装。

# Uncomment the line corresponding to your "runtime type" to run in Google Colab

# CPU:
# !pip install pydub torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

# GPU:
# !pip install pydub torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchaudio
import sys

import matplotlib.pyplot as plt
import IPython.display as ipd

from tqdm import tqdm

让我们检查 CUDA GPU 是否可用并选择我们的设备。在 GPU 上运行网络将大大减少训练/测试运行时间。

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

输出

cuda

导入数据集
我们使用 torchaudio 下载并表示数据集。这里我们使用 SpeechCommands,它是由不同人说出的 35 个命令的数据集。数据集 SPEECHCOMMANDS是torch.utils.data.Dataset数据集的一个版本。在此数据集中,所有音频文件的长度约为 1 秒(因此大约有 16000 个时间帧长)。

实际的加载和格式化步骤发生在访问数据点时,torchaudio 负责将音频文件转换为张量。如果想直接加载音频文件, torchaudio.load()可以使用。它返回一个元组,其中包含新创建的张量以及音频文件的采样频率(对于 SpeechCommands 为 16kHz)。

回到数据集,这里我们创建一个子类,将其分为标准训练、验证、测试子集。

from torchaudio.datasets import SPEECHCOMMANDS
import os


class SubsetSC(SPEECHCOMMANDS):
    def __init__(self, subset: str = None):
        super().__init__("./", download=True)

        def load_list(filename):
            filepath = os.path.join(self._path, filename)
            with open(filepath) as fileobj:
                return [os.path.normpath(os.path.join(self._path, line.strip())) for line in fileobj]

        if subset == "validation":
            self._walker = load_list("validation_list.txt")
        elif subset == "testing":
            self._walker = load_list("testing_list.txt")
        elif subset == "training":
            excludes = load_list("validation_list.txt") + load_list("testing_list.txt")
            excludes = set(excludes)
            self._walker = [w for w in self._walker if w not in excludes]


# Create training and testing split of the data. We do not use validation in this tutorial.
train_set = SubsetSC("training")
test_set = SubsetSC("testing")

waveform, sample_rate, label, speaker_id, utterance_number = train_set[0]

输出

0%|          | 0.00/2.26G [00:00<?, ?B/s]
  1%|          | 13.1M/2.26G [00:00<00:17, 137MB/s]
  1%|1         | 30.4M/2.26G [00:00<00:14, 163MB/s]
  2%|2         | 47.5M/2.26G [00:00<00:13, 171MB/s]
  3%|2         | 64.8M/2.26G [00:00<00:13, 175MB/s]
  4%|3         | 82.0M/2.26G [00:00<00:13, 177MB/s]
  4%|4         | 98.9M/2.26G [00:00<00:13, 177MB/s]
  5%|5         | 116M/2.26G [00:00<00:12, 178MB/s]
  6%|5         | 133M/2.26G [00:00<00:12, 177MB/s]
  6%|6         | 150M/2.26G [00:00<00:12, 175MB/s]
  7%|7         | 167M/2.26G [00:01<00:12, 175MB/s]
  8%|7         | 184M/2.26G [00:01<00:12, 176MB/s]
  9%|8         | 201M/2.26G [00:01<00:12, 177MB/s]
  9%|9         | 218M/2.26G [00:01<00:12, 176MB/s]
 10%|#         | 236M/2.26G [00:01<00:12, 179MB/s]
 11%|#         | 254M/2.26G [00:01<00:11, 184MB/s]
 12%|#1        | 273M/2.26G [00:01<00:11, 188MB/s]
 13%|#2        | 291M/2.26G [00:01<00:11, 189MB/s]
 13%|#3        | 310M/2.26G [00:01<00:11, 191MB/s]
 14%|#4        | 329M/2.26G [00:01<00:10, 194MB/s]
 15%|#5        | 348M/2.26G [00:02<00:10, 195MB/s]
 16%|#5        | 367M/2.26G [00:02<00:10, 198MB/s]
 17%|#6        | 386M/2.26G [00:02<00:10, 199MB/s]
 18%|#7        | 406M/2.26G [00:02<00:10, 199MB/s]
 18%|#8        | 425M/2.26G [00:02<00:09, 199MB/s]
 19%|#9        | 444M/2.26G [00:02<00:09, 199MB/s]
 20%|#9        | 463M/2.26G [00:02<00:09, 199MB/s]
 21%|##        | 482M/2.26G [00:02<00:09, 197MB/s]
 22%|##1       | 501M/2.26G [00:02<00:09, 196MB/s]
 22%|##2       | 519M/2.26G [00:02<00:09, 195MB/s]
 23%|##3       | 538M/2.26G [00:03<00:09, 196MB/s]
 24%|##4       | 557M/2.26G [00:03<00:09, 196MB/s]
 25%|##4       | 576M/2.26G [00:03<00:09, 192MB/s]
 26%|##5       | 594M/2.26G [00:03<00:09, 188MB/s]
 26%|##6       | 612M/2.26G [00:03<00:09, 188MB/s]
 27%|##7       | 630M/2.26G [00:03<00:09, 187MB/s]
 28%|##7       | 648M/2.26G [00:03<00:09, 188MB/s]
 29%|##8       | 668M/2.26G [00:03<00:08, 194MB/s]
 30%|##9       | 687M/2.26G [00:03<00:08, 196MB/s]
 30%|###       | 706M/2.26G [00:03<00:08, 197MB/s]
 31%|###1      | 725M/2.26G [00:04<00:08, 198MB/s]
 32%|###2      | 744M/2.26G [00:04<00:08, 198MB/s]
 33%|###2      | 763M/2.26G [00:04<00:08, 197MB/s]
 34%|###3      | 781M/2.26G [00:04<00:08, 197MB/s]
 35%|###4      | 801M/2.26G [00:04<00:08, 198MB/s]
 35%|###5      | 820M/2.26G [00:04<00:07, 198MB/s]
 36%|###6      | 838M/2.26G [00:04<00:07, 197MB/s]
 37%|###7      | 857M/2.26G [00:04<00:07, 197MB/s]
 38%|###7      | 876M/2.26G [00:04<00:07, 197MB/s]
 39%|###8      | 895M/2.26G [00:04<00:07, 197MB/s]
 39%|###9      | 914M/2.26G [00:05<00:07, 195MB/s]
 40%|####      | 933M/2.26G [00:05<00:07, 196MB/s]
 41%|####1     | 952M/2.26G [00:05<00:07, 198MB/s]
 42%|####1     | 971M/2.26G [00:05<00:07, 199MB/s]
 43%|####2     | 990M/2.26G [00:05<00:06, 200MB/s]
 44%|####3     | 0.99G/2.26G [00:05<00:07, 186MB/s]
 44%|####4     | 1.00G/2.26G [00:05<00:07, 190MB/s]
 45%|####5     | 1.02G/2.26G [00:05<00:06, 194MB/s]
 46%|####6     | 1.04G/2.26G [00:05<00:06, 195MB/s]
 47%|####6     | 1.06G/2.26G [00:05<00:06, 195MB/s]
 48%|####7     | 1.08G/2.26G [00:06<00:06, 197MB/s]
 48%|####8     | 1.10G/2.26G [00:06<00:06, 196MB/s]
 49%|####9     | 1.12G/2.26G [00:06<00:06, 198MB/s]
 50%|#####     | 1.13G/2.26G [00:06<00:06, 195MB/s]
 51%|#####     | 1.15G/2.26G [00:06<00:06, 190MB/s]
 52%|#####1    | 1.17G/2.26G [00:06<00:06, 183MB/s]
 52%|#####2    | 1.19G/2.26G [00:06<00:06, 182MB/s]
 53%|#####3    | 1.21G/2.26G [00:06<00:06, 186MB/s]
 54%|#####4    | 1.22G/2.26G [00:06<00:05, 190MB/s]
 55%|#####4    | 1.24G/2.26G [00:06<00:05, 188MB/s]
 56%|#####5    | 1.26G/2.26G [00:07<00:05, 191MB/s]
 56%|#####6    | 1.28G/2.26G [00:07<00:05, 192MB/s]
 57%|#####7    | 1.30G/2.26G [00:07<00:05, 192MB/s]
 58%|#####8    | 1.31G/2.26G [00:07<00:05, 187MB/s]
 59%|#####8    | 1.33G/2.26G [00:07<00:05, 175MB/s]
 60%|#####9    | 1.35G/2.26G [00:07<00:05, 175MB/s]
 60%|######    | 1.36G/2.26G [00:07<00:05, 172MB/s]
 61%|######1   | 1.38G/2.26G [00:07<00:05, 169MB/s]
 62%|######1   | 1.40G/2.26G [00:07<00:05, 167MB/s]
 62%|######2   | 1.41G/2.26G [00:08<00:05, 166MB/s]
 63%|######3   | 1.43G/2.26G [00:08<00:05, 168MB/s]
 64%|######3   | 1.44G/2.26G [00:08<00:05, 166MB/s]
 64%|######4   | 1.46G/2.26G [00:08<00:05, 166MB/s]
 65%|######5   | 1.47G/2.26G [00:08<00:06, 141MB/s]
 66%|######5   | 1.49G/2.26G [00:08<00:05, 149MB/s]
 67%|######6   | 1.51G/2.26G [00:08<00:05, 159MB/s]
 67%|######7   | 1.52G/2.26G [00:08<00:04, 165MB/s]
 68%|######8   | 1.54G/2.26G [00:08<00:04, 164MB/s]
 69%|######8   | 1.56G/2.26G [00:09<00:04, 164MB/s]
 69%|######9   | 1.57G/2.26G [00:09<00:04, 164MB/s]
 70%|#######   | 1.59G/2.26G [00:09<00:04, 166MB/s]
 71%|#######   | 1.60G/2.26G [00:09<00:04, 167MB/s]
 72%|#######1  | 1.62G/2.26G [00:09<00:04, 167MB/s]
 72%|#######2  | 1.63G/2.26G [00:09<00:04, 166MB/s]
 73%|#######2  | 1.65G/2.26G [00:09<00:03, 165MB/s]
 74%|#######3  | 1.67G/2.26G [00:09<00:03, 166MB/s]
 74%|#######4  | 1.68G/2.26G [00:09<00:03, 169MB/s]
 75%|#######5  | 1.70G/2.26G [00:09<00:03, 169MB/s]
 76%|#######5  | 1.71G/2.26G [00:10<00:03, 169MB/s]
 76%|#######6  | 1.73G/2.26G [00:10<00:03, 171MB/s]
 77%|#######7  | 1.75G/2.26G [00:10<00:03, 170MB/s]
 78%|#######7  | 1.76G/2.26G [00:10<00:03, 168MB/s]
 79%|#######8  | 1.78G/2.26G [00:10<00:03, 167MB/s]
 79%|#######9  | 1.79G/2.26G [00:10<00:03, 166MB/s]
 80%|#######9  | 1.81G/2.26G [00:10<00:02, 167MB/s]
 81%|########  | 1.82G/2.26G [00:10<00:02, 168MB/s]
 81%|########1 | 1.84G/2.26G [00:10<00:02, 169MB/s]
 82%|########2 | 1.86G/2.26G [00:10<00:02, 168MB/s]
 83%|########2 | 1.87G/2.26G [00:11<00:02, 169MB/s]
 83%|########3 | 1.89G/2.26G [00:11<00:02, 170MB/s]
 84%|########4 | 1.90G/2.26G [00:11<00:02, 171MB/s]
 85%|########4 | 1.92G/2.26G [00:11<00:02, 172MB/s]
 86%|########5 | 1.94G/2.26G [00:11<00:02, 171MB/s]
 86%|########6 | 1.95G/2.26G [00:11<00:01, 171MB/s]
 87%|########6 | 1.97G/2.26G [00:11<00:01, 169MB/s]
 88%|########7 | 1.98G/2.26G [00:11<00:01, 168MB/s]
 88%|########8 | 2.00G/2.26G [00:11<00:01, 168MB/s]
 89%|########9 | 2.01G/2.26G [00:11<00:01, 168MB/s]
 90%|########9 | 2.03G/2.26G [00:12<00:01, 170MB/s]
 90%|######### | 2.05G/2.26G [00:12<00:01, 170MB/s]
 91%|#########1| 2.06G/2.26G [00:12<00:01, 169MB/s]
 92%|#########1| 2.08G/2.26G [00:12<00:01, 168MB/s]
 93%|#########2| 2.09G/2.26G [00:12<00:01, 169MB/s]
 93%|#########3| 2.11G/2.26G [00:12<00:00, 171MB/s]
 94%|#########4| 2.13G/2.26G [00:12<00:00, 171MB/s]
 95%|#########4| 2.14G/2.26G [00:12<00:00, 169MB/s]
 95%|#########5| 2.16G/2.26G [00:12<00:00, 167MB/s]
 96%|#########6| 2.17G/2.26G [00:12<00:00, 167MB/s]
 97%|#########6| 2.19G/2.26G [00:13<00:00, 169MB/s]
 98%|#########7| 2.21G/2.26G [00:13<00:00, 169MB/s]
 98%|#########8| 2.22G/2.26G [00:13<00:00, 169MB/s]
 99%|#########8| 2.24G/2.26G [00:13<00:00, 170MB/s]
100%|#########9| 2.25G/2.26G [00:13<00:00, 146MB/s]
100%|##########| 2.26G/2.26G [00:13<00:00, 179MB/s]

SPEECHCOMMANDS 数据集中的数据点是由波形(音频信号)、采样率、话语(标签)、说话者 ID、话语数量组成的元组。

print("Shape of waveform: {}".format(waveform.size()))
print("Sample rate of waveform: {}".
  • 24
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
以下是使用PyTorch实现音频分类的示例代码: ```python import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader import librosa import numpy as np # 定义音频数据集 class AudioDataset(Dataset): def __init__(self, file_list, label_list): self.file_list = file_list self.label_list = label_list def __len__(self): return len(self.file_list) def __getitem__(self, idx): # 加载音频文件并提取特征 audio_file, label = self.file_list[idx], self.label_list[idx] y, sr = librosa.load(audio_file) mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40) mfccs = np.pad(mfccs, ((0, 0), (0, 260 - mfccs.shape[1])), mode='constant') mfccs = torch.from_numpy(mfccs) return mfccs.float(), label # 定义音频分类模型 class AudioClassifier(nn.Module): def __init__(self): super(AudioClassifier, self).__init__() self.conv1 = nn.Conv2d(1, 32, kernel_size=(3, 3), padding=(1, 1)) self.pool1 = nn.MaxPool2d(kernel_size=(2, 2)) self.conv2 = nn.Conv2d(32, 64, kernel_size=(3, 3), padding=(1, 1)) self.pool2 = nn.MaxPool2d(kernel_size=(2, 2)) self.conv3 = nn.Conv2d(64, 128, kernel_size=(3, 3), padding=(1, 1)) self.pool3 = nn.MaxPool2d(kernel_size=(2, 2)) self.fc1 = nn.Linear(128 * 10 * 16, 512) self.fc2 = nn.Linear(512, 10) def forward(self, x): x = x.unsqueeze(1) x = self.conv1(x) x = nn.functional.relu(x) x = self.pool1(x) x = self.conv2(x) x = nn.functional.relu(x) x = self.pool2(x) x = self.conv3(x) x = nn.functional.relu(x) x = self.pool3(x) x = x.view(-1, 128 * 10 * 16) x = nn.functional.relu(self.fc1(x)) x = self.fc2(x) return x # 训练模型 def train(model, train_loader, criterion, optimizer, device): model.train() train_loss = 0.0 train_acc = 0.0 for i, (data, target) in enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = criterion(output, target) loss.backward() optimizer.step() train_loss += loss.item() * data.size(0) pred = output.argmax(dim=1, keepdim=True) train_acc += pred.eq(target.view_as(pred)).sum().item() train_loss /= len(train_loader.dataset) train_acc /= len(train_loader.dataset) return train_loss, train_acc # 测试模型 def test(model, test_loader, criterion, device): model.eval() test_loss = 0.0 test_acc = 0.0 with torch.no_grad(): for data, target in test_loader: data, target = data.to(device), target.to(device) output = model(data) test_loss += criterion(output, target).item() * data.size(0) pred = output.argmax(dim=1, keepdim=True) test_acc += pred.eq(target.view_as(pred)).sum().item() test_loss /= len(test_loader.dataset) test_acc /= len(test_loader.dataset) return test_loss, test_acc # 主函数 if __name__ == '__main__': # 加载音频数据集 train_files, train_labels = [], [] test_files, test_labels = [], [] # TODO: 加载训练集和测试集音频文件路径及其对应的标签 train_dataset = AudioDataset(train_files, train_labels) test_dataset = AudioDataset(test_files, test_labels) train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False) # 定义设备 device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # 定义模型、损失函数和优化器 model = AudioClassifier().to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # 训练模型 for epoch in range(10): train_loss, train_acc = train(model, train_loader, criterion, optimizer, device) test_loss, test_acc = test(model, test_loader, criterion, device) print('Epoch: {} Train Loss: {:.6f} Train Acc: {:.6f} Test Loss: {:.6f} Test Acc: {:.6f}'.format( epoch + 1, train_loss, train_acc, test_loss, test_acc)) ``` 上述代码中,我们定义了一个AudioDataset类来加载音频数据集,并使用librosa库来提取音频文件的MFCC特征。我们还定义了一个AudioClassifier类来实现音频分类模型,其中包含了三个卷积层和两个全连接层。在主函数中,我们使用DataLoader来加载训练集和测试集,并使用Adam优化器来训练模型。最后,我们在每个epoch结束时输出训练集和测试集的损失和准确率。
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

码农呆呆

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值