MATLAB使用后期融合实现声音场景识别（一）——数据集加载

最新推荐文章于 2024-06-06 16:09:15 发布

佟湘玉滴玉

最新推荐文章于 2024-06-06 16:09:15 发布

阅读量1.8k

点赞数 3

分类专栏： MATLAB深度学习文章标签： Acoustic Scene Late Fusion matlab TUT数据集

MATLAB深度学习专栏收录该内容

17 篇文章

订阅专栏

本文是对MATLAB官网文档Acoustic Scene Recognition Using Late Fusion中介绍和数据集加载部分的翻译与解析，请参考原网页进行理解。

This example shows how to create a multi-model late fusion system for acoustic scene recognition. The example trains a convolutional neural network (CNN) using mel spectrograms and an ensemble classifier using wavelet scattering. The example uses the TUT dataset for training and evaluation [1].

本示例说明如何创建用于声音场景识别的多模型后期融合系统。该示例使用梅尔频谱图训练卷积神经网络（CNN），并使用小波散射训练集成分类器。该示例使用TUT数据集进行训练和评估[1]。

Introduction 介绍

Acoustic scene classification (ASC) is the task of classifying environments from the sounds they produce. ASC is a generic classification problem that is foundational for context awareness in devices, robots, and many other applications [1]. Early attempts at ASC used mel-frequency cepstral coefficients (mfcc) and Gaussian mixture models (GMMs) to describe their statistical distribution. Other popular features used for ASC include zero crossing rate, spectral centroid (spectralCentroid), spectral rolloff (spectralRolloffPoint), spectral flux (spectralFlux ), and linear prediction coefficients (lpc) [5]. Hidden Markov models (HMMs) were trained to describe the temporal evolution of the GMMs. More recently, the best performing systems have used deep learning, usually CNNs, and a fusion of multiple models. The most popular feature for top-ranked systems in the DCASE 2017 contest was the mel spectrogram (melSpectrogram). The top-ranked systems in the challenge used late fusion and data augmentation to help their systems generalize.

声学场景分类（Acoustic scene classification，ASC）是根据环境产生的声音对环境进行分类的任务。 ASC是一个通用的分类问题，是设备、机器人和许多其他应用程序中上下文感知的基础[1]。 ASC的早期尝试使用梅尔频率倒谱系数（mfcc）和高斯混合模型（GMM）来描述其统计分布。用于ASC的其他流行功能包括过零率（zero crossing rate），频谱重心（spectral centroid），频谱衰减（spectral rolloff ），频谱通量（spectral flux ）和线性预测系数（linear prediction coefficients）[5]。隐马尔可夫模型（HMM）被训练来描述GMM的时间演变。最近，性能最好的系统使用了深度学习（通常是CNN）和多种模型的融合。在DCASE 2017竞赛中，排名靠前的系统最受欢迎的功能是mel频谱图（mel spectrogram ）。挑战中排名靠前的系统使用后期融合和数据增强来帮助其系统推广。

为了说明产生合理结果的简单方法，此示例使用梅尔频谱图训练CNN，并使用小波散射训练整体分类器。 CNN和集成分类器可产生大致相同的整体精度，但在区分不同的声音场景时表现更好。为了提高整体准确性，您可以使用后期融合来合并CNN和整体分类器结果。

Load Acoustic Scene Recognition Data Set 加载声音场景识别数据集

要运行该示例，必须首先下载数据集（地址分别为TUT Acoustic scenes 2017, Development dataset和TUT Acoustic scenes 2017, Evaluation dataset）。完整的数据集大约为15.5 GB。根据计算机的联网质量，下载数据可能需要大约6.5个小时。（要我说，用matlab下载太过缓慢且容易出现卡掉线的问题，建议用迅雷自行下载数据，然后放到文件夹中调用）

downloadFolder = tempdir;
datasetFolder = fullfile(downloadFolder,'TUT-acoustic-scenes-2017');

if ~exist(datasetFolder,'dir')
    disp('Downloading TUT-acoustic-scenes-2017 (15.5 GB)...')
    HelperDownload_TUT_acoustic_scenes_2017(datasetFolder);
end

读入开发集元数据作为表格。将表中的三个变量依次命名为FileName，AcousticScene和SpecificLocation。

metadata_train = readtable(fullfile(datasetFolder,'TUT-acoustic-scenes-2017-development','meta.txt'), ...
    'Delimiter',{'\t'}, ...
    'ReadVariableNames',false);
metadata_train.Properties.VariableNames = {'FileName','AcousticScene','SpecificLocation'};
head(metadata_train) %展示部分训练集

读入测试集，操作同上

metadata_test = readtable(fullfile(datasetFolder,'TUT-acoustic-scenes-2017-evaluation','meta.txt'), ...
    'Delimiter',{'\t'}, ...
    'ReadVariableNames',false);
metadata_test.Properties.VariableNames = {'FileName','AcousticScene','SpecificLocation'};
head(metadata_test) %展示部分测试集

以下代码检查训练集与测试集是否存在数据重合，测试集中的特定记录位置与开发集中的特定记录位置不相交。这使验证训练后的模型可以推广到实际场景变得更加容易。（训练集与测试集不存在数据重合）

sharedRecordingLocations = intersect(metadata_test.SpecificLocation,metadata_train.SpecificLocation);
fprintf('Number of specific recording locations in both train and test sets = %d\n',numel(sharedRecordingLocations))

元数据表的第一个变量包含文件名。将文件名与文件路径连接起来。（生成每个文件的路径）

train_filePaths = fullfile(datasetFolder,'TUT-acoustic-scenes-2017-development',metadata_train.FileName);

test_filePaths = fullfile(datasetFolder,'TUT-acoustic-scenes-2017-evaluation',metadata_test.FileName);

为训练和测试集创建音频数据存储。将audioDatastore的Labels属性设置为声学场景。调用countEachLabel以验证标签在训练和测试集中的均匀分布。（统计各个标签的数量）

adsTrain = audioDatastore(train_filePaths, ...
    'Labels',categorical(metadata_train.AcousticScene), ...
    'IncludeSubfolders',true);
display(countEachLabel(adsTrain))% 计数并打印

adsTest = audioDatastore(test_filePaths, ...
    'Labels',categorical(metadata_test.AcousticScene), ...
    'IncludeSubfolders',true);
display(countEachLabel(adsTest))% 计数并打印

可以减少此示例中使用的数据集，以牺牲性能为代价来加快运行时间。通常，减少数据集是开发和调试的好习惯。将reduceDataset设置为true可减少数据集。

reduceDataset = false;
if reduceDataset
    adsTrain = splitEachLabel(adsTrain,20);
    adsTest = splitEachLabel(adsTest,10);
end

References

[1] A. Mesaros, T. Heittola, and T. Virtanen. Acoustic Scene Classification: an Overview of DCASE 2017 Challenge Entries. In proc. International Workshop on Acoustic Signal Enhancement, 2018.

[2] Huszar, Ferenc. “Mixup: Data-Dependent Data Augmentation.” InFERENCe. November 03, 2017. Accessed January 15, 2019. https://www.inference.vc/mixup-data-dependent-data-augmentation/.

[3] Han, Yoonchang, Jeongsoo Park, and Kyogu Lee. “Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification.” the Detection and Classification of Acoustic Scenes and Events (DCASE) (2017): 1-5.

[4] Lostanlen, Vincent, and Joakim Anden. Binaural scene classification with wavelet scattering. Technical Report, DCASE2016 Challenge, 2016.

[5] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa, G. Lorho, and J. Huopaniemi, “Audio-based context recognition,” IEEE Trans. on Audio, Speech, and Language Processing, vol 14, no. 1, pp. 321-329, Jan 2006.