Torch笔记 (五)DNN实现多分类
上一篇学习了torch中DNN训练的两种模式,现在咱们开始磨刀实战了,使用torch中的组件实现机器学习中常见的多分类问题。
咱们选取一个三分类的经典数据集,数据集地址http://mlr.cs.umass.edu/ml/datasets/Wine,这是一个关于葡萄酒的数据集,有178个样本,13个属性,总共3个类别,没有缺失值,而且13个属性都是连续类型。其中第一列表示葡萄酒类别编号(1、2、3),后面的第2至第14列是葡萄酒的酒精浓度、酸性、密度等等属性值。这里截取部分样本预览一下
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
1,14.2,1.76,2.45,15.2,112,3.27,3.39,.34,1.97,6.75,1.05,2.85,1450
1,14.39,1.87,2.45,14.6,96,2.5,2.52,.3,1.98,5.25,1.02,3.58,1290
1,14.06,2.15,2.61,17.6,121,2.6,2.51,.31,1.25,5.05,1.06,3.58,1295ok,现在针对这个数据集,咱们训练一个DNN模型,只要知道葡萄酒的13个属性值,我就可以准确预测该葡萄酒的类别(代号1或2或3)。
首先搭建网络结构,这里数据集比较简单,所以网络结构也是很简单
-- 这里包含两个隐层的网络,当然,这里具体应该是多少层,每层多少个神经元,
-- 一般需要测试的,最后效果表现最好的网络结构往往需要多次实验,反正我还
-- 不知道怎样快速选取合适的网络结构,如果有大牛知道,告诉我,小弟感激不尽
mlp = nn.Sequential()
mlp:add(nn.Linear(13, 13)) -- 输入是13个神经元
mlp:add(nn.ReLU())
mlp:add(nn.Linear(13, 13))
mlp:add(nn.ReLU())
mlp:add(nn.Linear(13, 3)) -- 输出是3个神经元(三分类)
mlp:add(nn.SoftMax())
定义为多分类量身定做的评价函数
criterion=nn.ClassNLLCriterion()
应用上一篇讲到的训练方法,将网络的训练写出来
第一种训练方法,手动for循环来训练
-- mlp是搭建的网络模型,x是模型输入,y是模型输出,criterion是模型评价函数,learningRate是学习率
function gradUpdate(mlp, x, y, criterion, learningRate)
local pred = mlp:forward(x)
local err = criterion:forward(pred, y)
mlp:zeroGradParameters()
local gradCriterion = criterion:backward(pred, y)
mlp:backward(x, gradCriterion)
mlp:updateParameters(learningRate)
end
第二种方法,利用optim包进行训练
-- trainset.data是训练数据集,数据结构是一个table
w, dl_dw = mlp:getParameters()
config = {
learningRate = 1e-2,
}
for i = 1,50 do
for j = 1,#trainset.data do
input = x[j]
output = y[j]
feval = function(w_new)
if w ~= w_new then w:copy(w_new) end
dl_dw:zero()
pred = mlp:forward(input)
loss = criterion:forward(pred, output)
gradCriterion = criterion:backward(pred, output)
gradInput = mlp:backward(input, gradCriterion)
return loss, dl_dw
end
optim.rprop(feval,w,config)
end
end
ok,框架已经搭好了,就差格式化数据、加载数据了,由于torch是基于lua的,因此需要使用lua的语法来完成这个数据加载的过程,虽然稍显麻烦,但是熟悉之后那也就是分分钟的事了。这里重点将一下数据的加载过程。
假设下载下来的数据集的名称是“rawdata.txt”,咱需要对每行进行解析,然而不幸的是lua没有想Java和Python那样的split切分字符串函数,这个函数真是太方便了,但是没有关系,像这样的函数库里面没有,那就自己造一个轮子咯。
字符串分割函数,和Java、Python的split一样,这里返回一个包含切分之后多个元素的table,默认分隔符是’,’ 。
function lua_string_split(str, split_char)
split_char = split_char or ','
local sub_str_tab = {};
local i = 0;
local j = 0;
while true do
j = string.find(str, split_char,i+1);
if j == nil then
local last_str = string.sub(str,i+1,#str)
if last_str ~= "" and last_str ~= split_char then
table.insert(sub_str_tab,last_str);
end
break;
end;
local subtmp = string.sub(str,i+1,j-1)
if subtmp ~= "" and subtmp ~= split_char then
table.insert(sub_str_tab,subtmp);
end
i = j;
end
return sub_str_tab;
end
解析完每一行的数据,咱们就加载了数据集,然后是一个输入归一化的过程,对于连续值来说,这应该是必须的一步吧,因为网络需要正确学到数据的分布,而如果不做归一化,如果各个属性值之间的差异巨大,将会把模型跑偏。
-- 这里有tensor的下标操作符的使用,在[Torch笔记 (二)快速入门](http://blog.csdn.net/whgyxy/article/details/52204206)
-- 中有详细讲到
-- 数据归一化操作
function inputNormlization(data)
mean = {}
stdv = {}
for i = 1,data:size(2) do
-- 分别计算数据集每一列的均值和标准差,做归一化
mean[i] = data[{{},{i}}]:mean()
print(string.format("mean[%d]=%f",i,mean[i]))
data[{{},{i}}]:add(-mean[i])
stdv[i] = data[{{},{i}}]:std()
print(string.format("stdv[%d]=%f",i,stdv[i]))
data[{{},{i}}]:div(stdv[i])
end
end
ok,熟悉了以上两个函数之后,咱们再从数据加载从头说起,首先从数据集中解析每一行的数据,得到每一个样本的属性数据和类别label,读完文件,咱们就得到了两个table,一个table是数据集的属性数据,另外一个table是数据集的label。
function getData(filepath)
file = torch.DiskFile(filepath,"r")
strfile = file:readString("*a")
file:close()
lines = lua_string_split(strfile,"\n")
label = {} -- 存放所有样本的label
data = {} -- 存放所有样本的属性数据
for i =1,#lines do
record = {}
elements = lua_string_split(lines[i],",")
for j = 2,#elements do
table.insert(record,tonumber(elements[j]))
end
table.insert(label,tonumber(elements[1]))
table.insert(data,record)
end
print(#data)
return data,label
end
这里得到是整个数据集,咱们在评估模型效果的时候是需要做交叉验证的,所以需要对数据集分为n组,其中n-1组作为训练集,剩下的一组作为测试集,假设每组编号分别为1、2、…、n,那么训练的时候可以以第一组作为测试集,其他组作为训练集进行训练,第二次以第二组作为测试集,其他组作为训练集,…,这样做n次。在数据分组的过程中,咱们将数据Shuffle一下,即打乱输入数据的顺序,不让输入数据的顺序对模型产生影响。
咱们先进行数据集分组
-- data是上面获取到的数据集
-- label是上面获取到的各个样本的label
-- num是分组数目n
-- 返回一个table batch,里面装着所有组的数据,
-- 和另一个table label_b里面装着所有组的label
function splitDataSet(data,label,num)
local samplenum = #data
local batch = {}
local label_b = {}
for i = 1,num do
batch[i] = {}
label_b[i] = {}
end
local batchsize = samplenum / num
local lastbatchsize = samplenum - batchsize * (num - 1)
flag = {} -- 数据Shuffle需要用到,样本是否已经分组的标记
for i = 1,samplenum do
flag[i] = 0 -- 初始都没有分组
end
math.randomseed(os.time())
i = 1
isbreak = 0
-- 每组的数据量是一样的(除了最后一组),从第一个组开始装数据
while 1 do
for batch_index = 1,num do
index = math.random(samplenum) -- 随机生成一个样本下标,
-- 如果这个下标还没有分组,并且当前组还没有填满
if flag[index] == 0 and #(batch[batch_index]) < batchsize then
table.insert(batch[batch_index] , data[index])
table.insert(label_b[batch_index] ,label[index])
flag[index] = 1
i = i+1
if i > samplenum then -- 全部分组完毕
isbreak = 1
break
end
end
end
if isbreak == 1 then break end
end
return batch,label_b
end
然后根据数据分组和label分组以及分组索引确定训练集和测试集
-- 指定第index组为测试集,其余组合并为训练集
function generateTrain_Test(batch,label_b,index)
trainset = {data = {},label = {}}
testset = {data = {},label = {} }
testset.data = batch[index]
testset.label = label_b[index]
tmpdata = {}
tmplabel = {}
tmpi = 1
for i =1,#batch do
if i ~= index then
for j = 1,#(batch[i]) do
tmpdata[tmpi] = batch[i][j]
tmplabel[tmpi] = label_b[i][j]
tmpi = tmpi + 1
end
end
end
trainset.data = tmpdata
trainset.label = tmplabel
return trainset,testset
end
然后咱们把加载数据的过程综合一下
-- filepath数据集文件路径
-- batchsize是分组中每组的大小,testbatch_index是指定为测试集的分组索引
-- 返回训练集和测试集
function load_data(filepath,batchsize,testbatch_index)
rawdata,rawlabel = getData(filepath)
batch,label_b = splitDataSet(rawdata,rawlabel,batchsize)
-- Generate the trainset and testset
trainset,testset = generateTrain_Test(batch,label_b,testbatch_index)
return trainset,testset
end
当训练完成之后,咱们就要评估一下模型预测的准确率了,这里既然是分类,那就以分类准确率来衡量。
-- model是训练好的模型
-- data是测试集属性数据,label是测试数据的label
function Test_MultiClass(model,data,label)
pred = model:forward(data)
print("pred")
print(pred)
-- 对预测结果进行排序,取最大概率对应的类别最为该样本的类别
tmp,index = torch.sort(pred,2,true)
print(tmp)
print(index)
correct = 0
print("comp")
-- 将所有样本的真实结果和预测结果进行比较,计算分类准确率
for i = 1,pred:size(1) do -- label[i] is a Tensor object,but index[i][1] is a number
-- if label[i]:eq(index[i][1]):all() then -- we alse can write like that
if torch.eq(label[i],index[i][1])[1] == 1 then
correct = correct + 1
end
end
print("correct")
print(correct)
correctRate = correct * 1.0 / pred:size(1)
print(string.format("correct is %f",correctRate))
if correctRate > 0.98 then torch.save('model',mlp) end
for i = 1,pred:size(1) do
print(pred[i][1],pred[i][2],pred[i][3],y[i])
end
end
好了,到此为止,可以开始训练了,玩DNN了,话说数据处理的过程比模型部分麻烦很多啊,大家有没有这样的感觉。不过一旦数据处理好之后,模型的调整也是一个需要耐心的过程。
本节完整代码如下
require 'torch'
require 'nn'
require 'optim'
-------------------------------------------------------------------------------------
-- Split String
-- str: Stirng to be splited
-- split_char: the separator,default is ','
-------------------------------------------------------------------------------------
function lua_string_split(str, split_char)
split_char = split_char or ','
local sub_str_tab = {};
local i = 0;
local j = 0;
while true do
j = string.find(str, split_char,i+1);
if j == nil then
local last_str = string.sub(str,i+1,#str)
if last_str ~= "" and last_str ~= split_char then
table.insert(sub_str_tab,last_str);
end
break;
end;
local subtmp = string.sub(str,i+1,j-1)
if subtmp ~= "" and subtmp ~= split_char then
table.insert(sub_str_tab,subtmp);
end
i = j;
end
return sub_str_tab;
end
-------------------------------------------------------------------------------------
-- Normlization the input data
-- x: the input data
-- Each feature sub it's mean and div it's std
-------------------------------------------------------------------------------------
function inputNormlization(data)
mean = {}
stdv = {}
for i = 1,data:size(2) do
mean[i] = data[{{},{i}}]:mean()
print(string.format("mean[%d]=%f",i,mean[i]))
data[{{},{i}}]:add(-mean[i])
stdv[i] = data[{{},{i}}]:std()
print(string.format("stdv[%d]=%f",i,stdv[i]))
data[{{},{i}}]:div(stdv[i])
end
end
function Test_MultiClass(model,data,label)
pred = model:forward(data)
print("pred")
print(pred)
tmp,index = torch.sort(pred,2,true)
print(tmp)
print(index)
correct = 0
print("comp")
for i = 1,pred:size(1) do -- label[i] is a Tensor object,but index[i][1] is a number
-- if label[i]:eq(index[i][1]):all() then -- we alse can write like that
if torch.eq(label[i],index[i][1])[1] == 1 then
correct = correct + 1
end
end
print("correct")
print(correct)
correctRate = correct * 1.0 / pred:size(1)
print(string.format("correct is %f",correctRate))
if correctRate > 0.98 then torch.save('model',mlp) end
for i = 1,pred:size(1) do
print(pred[i][1],pred[i][2],pred[i][3],y[i])
end
end
-------------------------------------------------------------------------------------
-- Get raw data and the raw label
-- filepath:the path of file
-- Return a table of raw data and it's label
-- This function fits for training data which has label
-------------------------------------------------------------------------------------
function getData(filepath)
file = torch.DiskFile(filepath,"r")
strfile = file:readString("*a")
file:close()
lines = lua_string_split(strfile,"\n")
label = {}
data = {}
for i =1,#lines do
record = {}
elements = lua_string_split(lines[i],",")
for j = 2,#elements do
table.insert(record,tonumber(elements[j]))
end
table.insert(label,tonumber(elements[1]))
table.insert(data,record)
end
print(#data)
return data,label
end
-------------------------------------------------------------------------------------
-- Split DataSet
-- data: rew data
-- label: the label of the raw
-- num: split the data into num batches
-- This function splits the raw data into num batches,using for selecting the
-- train data and the validation data
-------------------------------------------------------------------------------------
function splitDataSet(data,label,num)
local samplenum = #data
local batch = {}
local label_b = {}
for i = 1,num do
batch[i] = {}
label_b[i] = {}
end
local batchsize = samplenum / num
local lastbatchsize = samplenum - batchsize * (num - 1)
flag = {}
for i = 1,samplenum do
flag[i] = 0
end
math.randomseed(os.time())
i = 1
isbreak = 0
while 1 do
for batch_index = 1,num do
index = math.random(samplenum)
if flag[index] == 0 and #(batch[batch_index]) < batchsize then
table.insert(batch[batch_index] , data[index])
table.insert(label_b[batch_index] ,label[index])
flag[index] = 1
i = i+1
if i > samplenum then
isbreak = 1
break
end
end
end
if isbreak == 1 then break end
end
return batch,label_b
end
-------------------------------------------------------------------------------------
-- Generate Train and Test data
-- batch: pieces of batches data
-- label_b: pieces of batches label
-- index: the index of batch that will be selected as test data
-------------------------------------------------------------------------------------
function generateTrain_Test(batch,label_b,index)
trainset = {data = {},label = {}}
testset = {data = {},label = {} }
testset.data = batch[index]
testset.label = label_b[index]
tmpdata = {}
tmplabel = {}
tmpi = 1
for i =1,#batch do
if i ~= index then
for j = 1,#(batch[i]) do
tmpdata[tmpi] = batch[i][j]
tmplabel[tmpi] = label_b[i][j]
tmpi = tmpi + 1
end
end
end
trainset.data = tmpdata
trainset.label = tmplabel
return trainset,testset
end
-------------------------------------------------------------------------------------
-- Load the data
-- filepath: the path of dataset
-- batchsize: the number of batches splited the data into
-- testbatch_index: the index of batch that will be selected as test data
-- Return trainset and testset
-------------------------------------------------------------------------------------
function load_data(filepath,batchsize,testbatch_index)
rawdata,rawlabel = getData(filepath)
batch,label_b = splitDataSet(rawdata,rawlabel,batchsize)
-- Generate the trainset and testset
trainset,testset = generateTrain_Test(batch,label_b,testbatch_index)
return trainset,testset
end
trainset,testset = load_data('./datasets/rawdata.txt',6,3)
x = torch.Tensor(trainset.data)
y = torch.Tensor(trainset.label)
x = x:reshape(#trainset.data,13)
y = y:reshape(#trainset.label,1)
inputNormlization(x)
mlp = init_model()
-- The criterion for multiclassification
criterion=nn.ClassNLLCriterion()
-- Get the parameters and derivative of the parameters
w, dl_dw = mlp:getParameters()
config = {
learningRate = 1e-2,
}
-- One way to train the model
-- Using the function optim.rprop,it can be fast
-- It needs a optim function as the first parameter,and the second parameter is also
-- the optim function's parameter,the thrid is a config
--for i = 1,50 do
-- for j = 1,#trainset.data do
-- input = x[j]
-- output = y[j]
-- feval = function(w_new)
-- if w ~= w_new then w:copy(w_new) end
-- dl_dw:zero()
-- pred = mlp:forward(input)
-- loss = criterion:forward(pred, output)
---- print("loss is ")
---- print(loss)
-- gradCriterion = criterion:backward(pred, output)
---- print("gradCriterion")
---- print(gradCriterion)
-- gradInput = mlp:backward(input, gradCriterion)
---- print("gradInput")
---- print(gradInput)
-- return loss, dl_dw
-- end
---- print("the w is :")
---- print(w)
---- print("the dl_dw is ")
---- print(dl_dw)
-- optim.rprop(feval,w,config)
-- end
--end
-- The other way ot train the model is update the parameters by each samples
function gradUpdate(mlp, x, y, criterion, learningRate)
local pred = mlp:forward(x)
print("pred")
print(pred)
local err = criterion:forward(pred, y)
mlp:zeroGradParameters()
local gradCriterion = criterion:backward(pred, y)
print("grad of loss")
print(gradCriterion)
mlp:backward(x, gradCriterion)
mlp:updateParameters(learningRate)
end
for i = 1,40 do
for j = 1,#trainset.data do
input = x[j]
output = y[j]
print("train data")
print(x[j])
print(y[j])
gradUpdate(mlp, input, output, criterion, 0.01)
end
end
x = torch.Tensor(testset.data)
y = torch.Tensor(testset.label)
-- 数据集是table格式,需要转成torch可以接受的tensor格式
x = x:reshape(#testset.data,13)
y = y:reshape(#testset.label,1)
inputNormlization(x) -- 测试数据归一化
Test_MultiClass(mlp,x,y) -- 验证以及计算分类准确率