基于CNN的狗叫，猫叫语音分类

最新推荐文章于 2025-04-05 01:58:48 发布

DeepWWJ

最新推荐文章于 2025-04-05 01:58:48 发布

阅读量8.1k

点赞数 11

文章标签： cnn 音频分类

本文链接：https://blog.csdn.net/qq_21157073/article/details/82120994

版权

基于CNN的狗叫，猫叫语音分类
最近开始北漂的实习生活，第一家实习单位还是挺不错的。说句题外话，北京的生活没有想象中的那么恐怖，没有想象中的那么累，反而挺有人情味的。
公司里的主要业务是做“声纹识别”的，现在项目组好像主要分为传统的机器学习以及深度学习两个模块在做。刚接触到是一个唤醒的智能AI产品，为了尽快的熟悉这一模块的知识，所以找了个练手的项目。
这个分类很简单，就是单纯的让模型去识别是狗叫声还是猫叫声，只要使用python现成的库将音频的特征提取出来，然后使用卷积网络进行训练就可以了。这里比较头痛的是数据的获取，没有找到合适的，只能是写个脚本现爬（我真牛逼，哈哈）！！

import urllib
from bs4 import BeautifulSoup

def download(url, save_path):
    urllib.urlretrieve(url, save_path)

dog_links = [
    "http://sc.chinaz.com/tag_yinxiao/GouJiao.html",
    "http://sc.chinaz.com/tag_yinxiao/GouJiao_2.html",
    "http://sc.chinaz.com/tag_yinxiao/GouJiao_3.html",
    "http://sc.chinaz.com/tag_yinxiao/GouJiao_4.html"
]

cat_links = [
    "http://sc.chinaz.com/tag_yinxiao/MaoJiao.html",
    "http://sc.chinaz.com/tag_yinxiao/MaoJiao_2.html",
    "http://sc.chinaz.com/tag_yinxiao/MaoJiao_3.html",
]

count = 10
for link, type in cat_links:

    response = urllib.urlopen(link)
    content = response.read().decode('utf-8')
    # print(content)
    soup = BeautifulSoup(content)
    divs = soup.findAll("div", class_="music_block")

    for div in divs:

        a = div.find_all("a")[1]["href"]

        content = urllib.urlopen(a).read().decode('utf-8')

        audio = BeautifulSoup(content).findAll("div", class_="dian")[1].find("a")["href"]

        count += 1
        download(audio, "./data/cat/cat"+str(count)+".wav")

大概下下有100多个素材吧，考虑到素材不是很多的原因。所以我们需要将比较长的音频切分成小段的，这样数据量就会增加（这一步是很重要的，数据量直接影响最后的结果）

然后就是提取切分数据集，提取特征（mfcc），最后使用卷积网络来训练。这里并不是对音频的源文件进行卷积，而是在mfcc特征上进行提取之后进行的，好像音频里面有很多特征，比图像里多很多，这里还需要以后进一步学习，不过好在python库都为我们写好了。直接上代码：

from pydub import AudioSegment
import os
import numpy as np
import scipy.io.wavfile as wav
from python_speech_features import mfcc
import pickle
import random

AUDIO_LEN = 400
MFCC_LEN = 13

def create_data():
    item_len = 2000

    base_path = "./all/"

    for file in os.listdir(base_path):
        real_path = base_path + file
        audio = AudioSegment.from_file(real_path)
        audio_len = len(audio)

        steps = audio_len // item_len
        for step in range(max(1, steps)):
            item_audio = audio[step * item_len: (step + 1) * item_len]
            save_audio = item_audio + AudioSegment.silent(item_len - len(item_audio))

            save_audio.export("./data/" + str(step) + file, format="wav")

# create_data()
# exit()

def split_train_test():
    base_dir = "./data/"
    dogs = []
    cats = []
    for file in os.listdir(base_dir):
        real_file = base_dir + file
        if "cat" in file:
            cats.append(real_file)
        else:
            dogs.append(real_file)
    test = dogs[:10] + cats[:10]
    train = dogs[10:] + cats[10:]
    random.shuffle(train)
    random.shuffle(test)
    pickle.dump(train, open("./train", "wb"))
    pickle.dump(test, open("./test", "wb"))

train_data = pickle.load(open("train", "rb"))
test_data = pickle.load(open("test", "rb"))

def get_train_or_test(type, batch_size = 30):
    x = []
    y = []
    data = train_data if type == "train" else test_data
    if type == "test":
        batch_size = 10
    all_dogs = [item for item in data if "dog" in item]
    all_cats = [item for item in data if "cat" in item]

    sample_dogs = random.sample(all_dogs, int(batch_size / 2))
    sample_cats = random.sample(all_cats, batch_size - int(batch_size / 2))

    sample = sample_dogs + sample_cats
    random.shuffle(sample)

    for item in sample:
        try:
            fs, audio = wav.read(item)
            processed_audio = mfcc(audio, samplerate=fs)
            x_hold = np.zeros(shape=(AUDIO_LEN, MFCC_LEN))
            x_hold[:len(processed_audio), :] = processed_audio
            x.append(x_hold)
            if type == "train":
                reverse = []
                for data in x_hold:
                    list(data).reverse()
                    reverse.append(data)
                x.append(reverse)
                y.append([1, 0] if "cat" in item else [0, 1])
            y.append([1, 0] if "cat" in item else [0, 1])
        except:
            print("error")
            pass
    x = np.array(x) / 100

    x = np.array(x)
    y = np.array(y)
    return x, y

这一块主要是训练前的预操作，包括对数据切分，划分训练集，测试集，还有提取特征。

import tensorflow as tf
import update_audio

AUDIO_LEN = update_audio.AUDIO_LEN
MFCC_LEN = update_audio.MFCC_LEN

x = tf.placeholder(shape=(None, AUDIO_LEN, MFCC_LEN), dtype=tf.float32)
x_change = tf.expand_dims(x, -1)
y = tf.placeholder(shape=(None, 2), dtype=tf.int32)

def create_model():
    print("input shape", x_change.shape)
    filter1 = tf.Variable(tf.random_normal([10, 3, 1, 64]))
    bias1 = tf.Variable(tf.random_normal([64]))
    conv_1 = tf.nn.conv2d(x_change, filter1, strides=[1, 1, 1, 1], padding="SAME") + bias1
    print("conv_1 shape", conv_1.shape)
    relu1 = tf.nn.relu(conv_1)
    dropout1 = tf.nn.dropout(relu1, 0.5)
    max_pool1 = tf.nn.max_pool(dropout1, [1, 2, 2, 1], [1, 2, 2, 1], 'SAME')
    print("max pool shape", max_pool1.shape)

    flatten = tf.layers.flatten(max_pool1)

    print("flatten shape", flatten.shape)

    net_work = tf.layers.dense(flatten, units=128, activation=tf.nn.relu)

    logit = tf.layers.dense(net_work, units=2, activation=tf.nn.softmax)

    print("logit shape", logit.shape)

    return logit

logit = create_model()

def build_loss():
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=logit))

    return loss

loss = build_loss()

def create_opt():
    opt = tf.train.GradientDescentOptimizer(0.0001).minimize(loss)
    return opt

opt = create_opt()

correct_prediction = tf.equal(tf.argmax(logit,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

saver = tf.train.Saver()
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, "./model/model.model")

    feed_test_x, feed_test_y = update_audio.get_train_or_test("test")
    feed_test = {x: feed_test_x, y: feed_test_y}

    for index in range(10000):
        feed_x, feed_y = update_audio.get_train_or_test("train")
        feed = {x: feed_x, y: feed_y}

        get_loss, _ = sess.run([loss, opt], feed_dict=feed)

        if index % 100 == 0:
            get_acc = sess.run(accuracy, feed_dict=feed_test)
            print(get_loss, get_acc)

            saver.save(sess, "./model/model.model")

模型的主题模块，主要使用卷积网络对特征进行卷积，这里的特征并不是特别复杂，数据量也不是特别大，所以使用一层卷积就可以了，最后输出分类结果。
我最后得到的结果是损失0.31 准确率将近0.8（20个里面错3 - 4 个），如果后续需要改进的话可以继续对数据特征进行一些提取，或者扩大训练量，效果应该还会有提升！