【Python深度学习】基于Tensorflow2.0构建CNN模型尝试分类音乐类型(一)

前言

我在逛github的时候,偶然发现了一个项目:基于深度学习的音乐推荐.[VikramShenoy97]。作者是基于CNN做的一个音乐类型分类器,input_shape是128×128×1的tensor也就是128帧、128为帧长度Mel特征;输出的是8个类型的softmax值。在推荐部分则使用NLP方向简单的余弦相似度进行评估,算距离最小作者训练后的训练集准确率是0.7785,验证集的准确率是0.6611, epoch是10,我通过python3.7+tensorflow2.0复现一下这个模型,我的训练集准确率是0.9132,验证集的准确率是0.7525, epoch是30,但在25左右出现了一点过拟合。
在这里插入图片描述

复现代码

由于代码文件较多,我将展示部分关键代码,完整代码请移步至我的github(点我)

MP3转mel

该代码的关键库使用了librosa的音频处理库,里面已经封装了很多很方便的方法。例如转为mel和转为分贝。(但是我有点疑问,这个库有时候读取MP3文件会出现异常报错,但大部分时间不会,所以我直接简单捕捉了一下异常,直接跳过报错文件,如果各位看官有其他见解,欢迎留言)

import os
import pandas as pd
import re
import librosa
import librosa.display
import matplotlib.pyplot as plt

"""
Convert 30s mp3 files into mel-spectrograms.

A mel-spectrograms is a kind of time-frequency representation.
It is obtained from an audio signal by computing the Fourier transforms of short, overlapping windows.
Each of these Fourier transforms constitutes a frame.
These successive frames are then concatenated into a matrix to form the spectrogram.
"""
def create_spectrogram(verbose=0, mode=None):
    if mode == "Train":
        if os.path.exists('Train_Spectogram_Images'):
            return
        # Get Genres and Track IDs from the tracks.csv file
        filename_metadata = "Dataset/fma_metadata/tracks.csv"
        tracks = pd.read_csv(filename_metadata, header=2, low_memory=False)
        tracks_array = tracks.values
        tracks_id_array = tracks_array[:, 0]
        tracks_genre_array = tracks_array[:, 40]
        tracks_id_array = tracks_id_array.reshape(tracks_id_array.shape[0], 1)
        tracks_genre_array = tracks_genre_array.reshape(tracks_genre_array.shape[0], 1)

        folder_sample = "Dataset/fma_small"
        directories = [d for d in os.listdir(folder_sample)
                       if os.path.isdir(os.path.join(folder_sample, d))]
        counter = 0
        if verbose > 0:
            print("Converting mp3 audio files into mel Spectograms ...")
        if not os.path.exists('Train_Spectogram_Images'):
            os.makedirs('Train_Spectogram_Images')
        for d in directories:
            label_directory = os.path.join(folder_sample, d)
            file_names = [os.path.join(label_directory, f)
                          for f in os.listdir(label_directory)
                          if f.endswith(".mp3")]

            # Convert .mp3 files into mel-Spectograms
            for f in file_names:
                f = f.replace('\\', '/')
                track_id = int(re.search('fma_small/.*/(.+?).mp3', f).group(1))
                # track_index = list(tracks_id_array).index(str(track_id))
                track_index = list(tracks_id_array).index(int(track_id))
                if str(tracks_genre_array[track_index, 0]) != '0':
                    print(f)
                    try:
                        y, sr = librosa.load(f, sr=44100)
                    except:
                        print('报错, 跳过')
                        continue
                    melspectrogram_array = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)
                    mel = librosa.power_to_db(melspectrogram_array)
                    # Length and Width of Spectogram
                    fig_size = plt.rcParams["figure.figsize"]
                    fig_size[0] = float(mel.shape[1]) / float(100)
                    fig_size[1] = float(mel.shape[0]) / float(100)
                    plt.rcParams["figure.figsize"] = fig_size
                    plt.axis('off')
                    plt.axes([0., 0., 1., 1.0], frameon=False, xticks=[], yticks=[])
                    librosa.display.specshow(mel, cmap='gray_r')
                    plt.savefig("Train_Spectogram_Images/"+str(counter)+"_"+str(tracks_genre_array[track_index,0])+".jpg", bbox_inches=None, pad_inches=0)
                    plt.close()
                    counter = counter + 1
        return

    elif mode == "Test":
        if os.path.exists('Test_Spectogram_Images'):
            return

        folder_sample = "Dataset/DLMusicTest_30"
        counter = 0
        if verbose > 0:
            print("Converting mp3 audio files into mel Spectograms ...")
        if not os.path.exists('Test_Sepctogram_Images'):
            os.makedirs('Test_Spectogram_Images')
        file_names = [os.path.join(folder_sample, f) for f in os.listdir(folder_sample)
                       if f.endswith(".mp3")]
        # Convert .mp3 files into mel-Spectograms
        for f in file_names:
            f = f.replace('\\', '/')
            test_id = int(re.search('Dataset/DLMusicTest_30/(.+?).mp3', f).group(1))

            y, sr = librosa.load(f)
            melspectrogram_array = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128,fmax=8000)
            mel = librosa.power_to_db(melspectrogram_array)
            # Length and Width of Spectogram
            fig_size = plt.rcParams["figure.figsize"]
            fig_size[0] = float(mel.shape[1]) / float(100)
            fig_size[1] = float(mel.shape[0]) / float(100)
            plt.rcParams["figure.figsize"] = fig_size
            plt.axis('off')
            plt.axes([0., 0., 1., 1.0], frameon=False, xticks=[], yticks=[])
            librosa.display.specshow(mel, cmap='gray_r')
            plt.savefig("Test_Spectogram_Images/"+str(test_id)+".jpg", cmap='gray_r', bbox_inches=None, pad_inches=0)
            plt.close()
        return

CNN模型

  1. 作者在前面总共用了四层卷积层,并且在每一层的卷积层后面连着一个BN层和平均池化层,其中前两层的卷积层中,用了7×7的卷积核,大的卷积核会与周边的更广的区域增加联系(也许会使帧与帧的联系更多了)。在后两层中则使用了3×3的卷积核,小卷积核则只与更近的周边区域产生联系了(由于经过前两层的卷积层和池化层,参数已经被大大减少了,因此大卷积核就不太合适了,因为大卷积核也许会淡化特征)。
  2. 在经过前面的卷积层后,通过flatten扁平化(2×2×512)为1D张量,通过组合Dropout层和Dense层的RElu最后通过softmax输出可能值。
  3. 验证集数据直接从训练集中取10%
from tensorflow.keras.models import Sequential
from tensorflow.keras import initializers
from tensorflow.keras import optimizers
from tensorflow.keras.utils import plot_model
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from load_data import load_dataset
import pandas as pd

train_x, train_y, test_x, test_y, n_classes, genre = load_dataset(verbose=1, mode="Train", datasetSize=0.75)
# datasetSize = 0.75, this returns 3/4th of the dataset.

# Expand the dimensions of the image to have a channel dimension. (nx128x128) ==> (nx128x128x1)
train_x = train_x.reshape(train_x.shape[0], train_x.shape[1], train_x.shape[2], 1)
test_x = test_x.reshape(test_x.shape[0], test_x.shape[1], test_x.shape[2], 1)

# Normalize the matrices.
train_x = train_x / 255.
test_x = test_x / 255.


model = Sequential()
model.add(Conv2D(filters=64, kernel_size=[7, 7], kernel_initializer=initializers.he_normal(seed=1),
                 activation="relu", input_shape=(128, 128, 1)))
# Dim = (122x122x64)
model.add(BatchNormalization())
model.add(AveragePooling2D(pool_size=[2, 2], strides=2))
# Dim = (61x61x64)
model.add(Conv2D(filters=128, kernel_size=[7, 7], strides=2, kernel_initializer=initializers.he_normal(seed=1), activation="relu"))
# Dim = (28x28x128)
model.add(BatchNormalization())
model.add(AveragePooling2D(pool_size=[2, 2], strides=2))
# Dim = (14x14x128)
model.add(Conv2D(filters=256, kernel_size=[3, 3], kernel_initializer=initializers.he_normal(seed=1), activation="relu"))
# Dim = (12x12x256)
model.add(BatchNormalization())
model.add(AveragePooling2D(pool_size=[2, 2], strides=2))
# Dim = (6x6x256)
model.add(Conv2D(filters=512, kernel_size=[3, 3], kernel_initializer=initializers.he_normal(seed=1), activation="relu"))
# Dim = (4x4x512)
model.add(BatchNormalization())
model.add(AveragePooling2D(pool_size=[2, 2], strides=2))
# Dim = (2x2x512)
model.add(BatchNormalization())
model.add(Flatten())
# Dim = (2048)
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(1024, activation="relu", kernel_initializer=initializers.he_normal(seed=1)))
# Dim = (1024)
model.add(Dropout(0.5))
model.add(Dense(256, activation="relu", kernel_initializer=initializers.he_normal(seed=1)))
# Dim = (256)
model.add(Dropout(0.25))
model.add(Dense(64, activation="relu", kernel_initializer=initializers.he_normal(seed=1)))
# Dim = (64)
model.add(Dense(32, activation="relu", kernel_initializer=initializers.he_normal(seed=1)))
# Dim = (32)
model.add(Dense(n_classes, activation="softmax", kernel_initializer=initializers.he_normal(seed=1)))
# Dim = (8)
print(model.summary())
plot_model(model, to_file="Saved_Model/Model_Architecture.jpg")
model.compile(loss="categorical_crossentropy", optimizer=optimizers.Adam(lr=0.0001), metrics=['accuracy'])
pd.DataFrame(model.fit(train_x, train_y, epochs=10, validation_split=0.1).history).to_csv("Saved_Model/training_history.csv")
score = model.evaluate(test_x, test_y, verbose=1)
print(score)
model.save("Saved_Model/Model.h5")

训练结果

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
按照鄙人的视觉来看,根据混淆矩阵所显示的: international类型的出错率较小,instrumental的出错率相比之下就高一点。

总结

总的来说,用音乐的mel声谱图作为特征用CNN做音乐类型分类器有0.75的准确率。最后测试部分使用余弦相似度的思想,进行距离计算从而推荐最相似的音乐。

  • 6
    点赞
  • 34
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值