基于各种分类算法的语音分类(年龄段识别)
语料提取,基于分类算法进行分类
语料提取分类
TIMIT/DOC/SPKRINFO.TXT中为speaker信息,作为分类条件
定义方法def initspeakerinfo(speakerinfo)
,生成speaker:age字典:
def initspeakerinfo(speakerinfo):
dict = {}
f = open(speakerinfo,'r')
for line in f:
linelist = line.strip().split(' ')
recorddate = linelist[4].strip().split('/')
birthdata = linelist[5].strip().split('/')
if recorddate[2]=="??" or birthdata[2]=="??":
age = 0
else:
age = int(recorddate[2])*365+int(recorddate[0])*30+int(recorddate[1])-int(birthdata[2])*365+int(birthdata[0])*30+int(birthdata[1])
age = age/365.0
dict[linelist[1]+linelist[0]] = age
return dict
如三分类或两分类:
def getclass(filename,dict):
m = filename
if dict[m]==0:
return "0"
if dict[m]<=25:
return "-1"
elif dict[m]<=45:
return "0"
else:
return "+1"
特征表示
在之前提取出了MFCC/i-vector,其中MFCC为38n矩阵形式,38是MFCC维度而n为一段语音的帧数,i-vector则是1200矩阵形式,如果要进行分类,需要对MFCC进行处理,最简单的方法就是取38*n的均值再进行归一化
定义方法def initavgmfcc(avgmfccname,mfccpath)
读取mfccpath路径下的mfcc文件写入到一个文件中,并完成均值和归一化
def initavgmfcc(avgmfccname,mfccpath):
f = open(avgmfccname,'w')
for filename in os.listdir(mfccpath):
fo = open(mfccpath+"\\"+filename,'r')
dimen = 13
avgmfcc = [0]*dimen
length = 1
for line in fo:
linelist = line.strip().split(' ')
for i in range(dimen):
avgmfcc[i] = avgmfcc[i] + float(linelist[i])
length = length + 1
for i in range(dimen):
avgmfcc[i] = avgmfcc[i]/length
listmin = min(avgmfcc)
listmax = max(avgmfcc)
for i in range(dimen):
avgmfcc[i] = str((avgmfcc[i]-listmin)/(listmax-listmin))
f.write(filename+" "+" ".join(avgmfcc)+"\n")
print filename+" avg over"
fo.close()
f.close()
定义方法def initiv(ivname,ivpath)
读取ivpath路径下的i-vector文件写入到一个文件中
def initiv(ivname,ivpath):
f = open(ivname,'w')
avgf = open(ivname+"avg","w")
for filename in os.listdir(ivpath):
fo = open(ivpath+"\\"+filename,'r')
dimen = 200
for line in fo:
linelist = line.strip().split(' ')
if(len(linelist)==dimen):
f.write(filename+" "+" ".join(linelist)+"\n")
avgiv = [0]*dimen
linelist = map(eval, linelist)
listmin = min(linelist)
listmax = max(linelist)
for i in range(dimen):
avgiv[i] = (str)((linelist[i]-listmin)/(listmax-listmin))
avgf.write(filename+" "+" ".join(avgiv)+"\n")
fo.close()
f.close()
avgf.close()
PS:https://www.zhihu.com/question/20455227 归一化说明
LIBSVM进行分类
安装
参考http://blog.csdn.net/lqhbupt/article/details/8599295 进行LIBSVM的安装
PS:64位麻烦一点,但是同样可以用nmake解决
LIBSVM格式
http://blog.csdn.net/kobesdu/article/details/8944851 介绍了LIBSVM格式和生成方法
简单来说格式为
+1 1:0.533355514244 2:0.225956771932 3:0.551555751325 4:0.448831840291 5:0.732958158188 6:0.516967914119 ...
-1 1:0.723092649707 2:0.352547706883 3:0.524416372722 4:0.683881004712 5:0.464490812227 6:0.70279542324 ...
...
其实Python几行就可以解决
最后定义方法def initFormat(formatname,avgmfccname,dict,dimen)
生成了LIBSVM格式的
- FormatData-iv-train
- FormatData-iv-test
- FormatData-mfcc-train
- FormatData-mfcc-test
参数寻优
在libsvm-3.21/tools/grid.py中可以进行参数寻优
E:\libsvm-3.21\tools>grid.py
Usage: grid.py [grid_options] [svm_options] dataset
grid_options :
-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
"null" -- do not grid with c
-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
"null" -- do not grid with g
-v n : n-fold cross validation (default 5)
-svmtrain pathname : set svm executable path and name
-gnuplot {pathname | "null"} :
pathname -- set gnuplot executable path and name
"null" -- do not plot
-out {pathname | "null"} : (default dataset.out)
pathname -- set output file path and name
"null" -- do not output file
-png pathname : set graphic output file path and name (default dataset.png)
-resume [pathname] : resume the grid task using an existing output file (default pathname is dataset.out)
This is experimental. Try this option only if some parameters have been checked for the SAME data.
option如上
用以求参数C和gamma
http://m.blog.csdn.net/article/details?id=46386201
参数寻优的原理是交叉验证-v n
分为n份
依次取其中n-1份为训练集,1份为测试集,参数C和gamma在
-log2c {begin,end,step | "null"} : set the range of c (default -5,15,2)
begin,end,step -- c_range = 2^{begin,...,begin+k*step,...,end}
"null" -- do not grid with c
-log2g {begin,end,step | "null"} : set the range of g (default 3,-15,-2)
begin,end,step -- g_range = 2^{begin,...,begin+k*step,...,end}
"null" -- do not grid with g
区间内
然后更换训练集和测试集做简单的枚举,设C区间内有numC个取值,gamma区间内有numG个取值,则总共进行numC*numG*n
次测试,会输出每一次的结果:准确率accuracy,取最高accuracy时的C和gamma作为参数寻优的结果
LIBSVM训练和预测
train_y, train_x = svm_read_problem('../FormatData-train')
test_y, test_x = svm_read_problem('../FormatData-test')
model = svm_train(train_y,train_x,'-c 112.0 -g 0.000125')
p_label, p_acc, p_val = svm_predict(test_y,test_x, model)
scikit-learn进行分类
scikit-learn是python的一个第三方库
分类方法众多,调用简单,需要预先了解分类方法/Python/numpy
LDA/PLDA/PCA处理
scikit-learn还提供LDA处理,所以之前的LIBSVM可以升级为
from svmutil import *
from sklearn.lda import LDA
#read the data(mfcc/ivectr/LDA-ivector)
train_y, train_x = svm_read_problem('../FormatData-mfcc-train')
test_y, test_x = svm_read_problem('../FormatData-mfcc-test')
clf = LDA(solver='eigen',n_components=100)
train_x2 = clf.fit(train_x,train_y).transform(train_x)
test_x2 = clf.fit(train_x,train_y).transform(test_x)
model = svm_train(train_y2,train_x2,'-c 8192.0 -g 0.05')
scikit-learn分类
- 可以尝试GMM/KNN/GBDT等算法
- 《scikit-learn.user_guide_0.16.1.pdf》
- http://www.cnblogs.com/nsnow/p/5026673.html 中的example修改引用:
#!usr/bin/env python
#-*- coding: utf-8 -*-
import sys
import os
import time
from sklearn import metrics
import numpy as np
import cPickle as pickle
from sklearn.datasets import load_svmlight_file
import numpy
from sklearn.lda import LDA
from sklearn.decomposition import PCA
reload(sys)
sys.setdefaultencoding('utf8')
# Multinomial Naive Bayes Classifier
def naive_bayes_classifier(train_x, train_y):
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB(alpha=0.01)
model.fit(train_x, train_y)
return model
# KNN Classifier
def knn_classifier(train_x, train_y):
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(train_x, train_y)
return model
# Logistic Regression Classifier
def logistic_regression_classifier(train_x, train_y):
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(penalty='l2')
model.fit(train_x, train_y)
return model
# Random Forest Classifier
def random_forest_classifier(train_x, train_y):
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=8)
model.fit(train_x, train_y)
return model
# Decision Tree Classifier
def decision_tree_classifier(train_x, train_y):
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(train_x, train_y)
return model
# GBDT(Gradient Boosting Decision Tree) Classifier
def gradient_boosting_classifier(train_x, train_y):
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200)
model.fit(train_x, train_y)
return model
# SVM Classifier
def svm_classifier(train_x, train_y):
from sklearn.svm import SVC
model = SVC(kernel='rbf', probability=True)
model.fit(train_x, train_y)
return model
# SVM Classifier using cross validation
def svm_cross_validation(train_x, train_y):
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC
model = SVC(kernel='rbf', probability=True)
param_grid = {'C': [1e-3, 1e-2, 1e-1, 1, 10, 100, 1000], 'gamma': [0.001, 0.0001]}
grid_search = GridSearchCV(model, param_grid, n_jobs = 1, verbose=1)
grid_search.fit(train_x, train_y)
best_parameters = grid_search.best_estimator_.get_params()
for para, val in best_parameters.items():
print para, val
model = SVC(kernel='rbf', C=best_parameters['C'], gamma=best_parameters['gamma'], probability=True)
model.fit(train_x, train_y)
return model
def read_data(data_file):
f = open(data_file+"-train")
x = []
y = []
for line in f:
linelist = line.strip().split(' ')
linelist = map(eval, linelist)
x.append(linelist[1:])
y.append(linelist[0])
x1 = np.array(x)
y1 = np.array(y)
ff = open(data_file+"-test")
xx = []
yy = []
for line in ff:
linelist = line.strip().split(' ')
linelist = map(eval, linelist)
xx.append(linelist[1:])
yy.append(linelist[0])
x2 = np.array(xx)
y2 = np.array(yy)
train_x = x1
train_y = y1
test_x = x2
test_y = y2
#return x1[:trainlen],y1[:trainlen],x1[trainlen:],y1[trainlen:]
return train_x, train_y, test_x, test_y
if __name__ == '__main__':
data_file = "./data/FormatData-mfcc"
thresh = 0.5
model_save_file = None
model_save = {}
test_classifiers = ['KNN', 'LR', 'RF', 'DT', 'SVM', 'GBDT']
classifiers = {#'NB':naive_bayes_classifier,
'KNN':knn_classifier,
'LR':logistic_regression_classifier,
'RF':random_forest_classifier,
'DT':decision_tree_classifier,
'SVM':svm_classifier,
'SVMCV':svm_cross_validation,
'GBDT':gradient_boosting_classifier
}
print 'reading training and testing data...'
train_x, train_y, test_x, test_y = read_data(data_file)
num_train, num_feat = train_x.shape
num_test, num_feat = test_x.shape
is_binary_class = (len(np.unique(train_y)) == 2)
print is_binary_class
print '******************** Data Info *********************'
print '#training data: %d, #testing_data: %d, dimension: %d' % (num_train, num_test, num_feat)
for classifier in test_classifiers:
print '******************* %s ********************' % classifier
start_time = time.time()
model = classifiers[classifier](train_x, train_y)
print 'training took %fs!' % (time.time() - start_time)
predict = model.predict(test_x)
if model_save_file != None:
model_save[classifier] = model
if is_binary_class:
precision = metrics.precision_score(test_y, predict)
recall = metrics.recall_score(test_y, predict)
print 'precision: %.2f%%, recall: %.2f%%' % (100 * precision, 100 * recall)
accuracy = metrics.accuracy_score(test_y, predict)
print 'accuracy: %.2f%%' % (100 * accuracy)
if model_save_file != None:
pickle.dump(model_save, open(model_save_file, 'wb'))
grid_search = GridSearchCV(classifiers,)