一、题目:
目标值-3类鸢尾花:setosa、vericolor、virginica
特征值-4个:花瓣长度、花瓣宽度、花萼长度、花萼宽度
利用决策树方法对鸢尾花样本进行分类
样本:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica
二、代码实现:
程序思路:样本导入---分离样本得到待检测数据和标签---对检测数据贴标签---生成决策树---计算准确率---打印决策树
1、主程序:
%% 处理数据
sample=importdata('iris.data');
label=[];
data=[];
% 将数据分别提取出来,得到数组表示的数据部分data,和字符串数组表示的标签部分label
for i=1:150
ch=cell2mat(sample(i));
% disp(ch);
data=strvcat(data,ch(1:fen_label(ch)-1));
label=strvcat(label,ch(fen_label(ch)+1:length(ch)));
end
data=str2num(data);
% 处理标签部分,得到标签的种类和标签的序列
alabel=[];
label_n=0;
mlabel=label;
while true
if size(mlabel,1)==0
break;
end
alabel=strvcat(alabel,mlabel(1,:));
label_n=label_n+1;
mid_index=find_label(mlabel,alabel(label_n,:));
mlabel(mid_index,:)=[];
end
%data:数据 label:对应标签 alabel:单独标签记录 label_n:标签种类数
%% 生成树
attri=strvcat('sepal length(cm)','sepal width(cm)','petal length(cm)','petal width(cm)');
feat_index=1:size(data,2);
tree=creat_tree(data,label,feat_index,attri);
%% 计算准确率
tcount=0;
for i=1:size(data,1)
mtree=tree;
while true
if strcmp(mtree.left,'null')==1 && strcmp(mtree.right,'null')==1
if strcmp(mtree.value,label(i,:))==1
tcount=tcount+1;
end
break;
end
if data(i,mtree.add_value(1))<mtree.add_value(2)
mtree=mtree.left;
elseif data(i,mtree.add_value(1))>=mtree.add_value(2)
mtree=mtree.right;
end
end
end
accuracy=tcount/size(data,1);
%% 打印树
A=cell(1);
[A,i]=prev(tree,A,1,0);
print_tree(A,accuracy);
2、分离数据和标签
function index = fen_label(label)
%函数主要实现找出数据和标签分割的逗号所在索引
for i=length(label):-1:1
if label(i)==','
index=i;
break;
end
end
end
function label_index = find_label(label_mat,label_obj)
%函数主要实现找出label_mat中标签名为label_obj对应的标号
label_index=[];
for i=1:size(label_mat,1)
if strcmp(label_mat(i,:),label_obj)
label_index=[label_index,i];
end
end
end
3、创建决策树
function tree=creat_tree(data,label,feat_index,attri)
%通过递归的方法创建树
tree=struct('value','null','add_value','null','left','null','right','null');
if size(label,1)==length(find_label(label,label(1,:))) %标签种类只有一种
% disp('1');
tree.value=label(1,:);
return;
end
if isempty(feat_index) %属性集为空
% disp('1');
tree.value=find_mlabel(label);
return;
end
mid1=1;
for i=feat_index
mid1=mid1*length(unique(data(i,:)));
end
% if length(unique(data(feat_index,:)))<=length(feat_index) %样本在每种属性上取值相同
% tree.value=find_mlabel(label);
% return;
% end
if mid1==1
% disp('1');
tree.value=find_mlabel(label);
return;
end
[bestfeat,T_fen] = choose_feat(data,feat_index,label);
tree.value=strvcat(attri(bestfeat,:),num2str(T_fen));
tree.add_value=[bestfeat,T_fen];
feat_index(find(feat_index==bestfeat))=[];
index1=find(data(:,bestfeat)<T_fen);
index2=find(data(:,bestfeat)>=T_fen);
tree.left=creat_tree(data(index1,:),label(index1,:),feat_index,attri);
tree.right=creat_tree(data(index2,:),label(index2,:),feat_index,attri);
end
function [bestfeat,T_fen] = choose_feat(data,feat_index,label)
%这里主要是用于选择最理想的属性
%feat_index:需要进行比较的属性
%data:样本的数据 label:对应样本数据的标签
%T_fen:对应最优属性的二分法的最优分割点
%bestfeat:最优属性的索引
best_gain=0;
bestfeat=0;
% T_fen对应属性获得最大增益时的二分法取值
for i=feat_index
mmat=unique(data(:,i));%得到对应属性取值的序列
if length(mmat)-1==0
T_mat=mmat(1);
else
T_mat=ones(1,length(mmat)-1);%定义二分法的值
for j=1:length(mmat)-1
T_mat(j)=(mmat(j)+mmat(j+1))/2;
end
end
for j=T_mat
Gain=Ent_cal_data(data,label,i,j);
if Gain>best_gain
best_gain=Gain;
bestfeat=i;
T_fen=j;
end
end
end
end
function mo_label = find_mlabel(label)
%找到标签数最多的一类,并返回其值
mlabel=label;%复制,以免数据丢失
max_con=0;%记录最大标签数
while true
if size(mlabel,1)==0
break;
end
id1=find_label(mlabel,mlabel(1,:));
if length(id1)>max_con
max_con=length(id1);
% id2=find_label(mlabel,mlabel(1,:));
mo_label=mlabel(1,:);
end
mlabel(id1,:)=[];
end
end
4、打印决策树和计算精度
function [A,i]=prev(T,A,i,j)
%遍历树 并产生可以被treeplot用来画图的结点序列
%输入i应为1;j应为0;
%函数迭代过程中传递不了A值,所以要在输入和输出上将cell设为变量
if isstruct(T)==1 && (strcmp(T.left,'null')==0 || strcmp(T.right,'null')==0)
A{i,1}=T.value;
A{i,2}=j;
i=i+1;j=i-1;
% i随迭代不断增加,但j是固定在每步迭代当中
[A,i]=prev(T.left,A,i,j);
i=i+1;
[A,i]=prev(T.right,A,i,j);
elseif isstruct(T)==1 && strcmp(T.left,'null')==1 && strcmp(T.right,'null')==1
A{i,1}=T.value;
A{i,2}=j;
else
A{i,1}=T;
A{i,2}=j;
end
end
function print_tree(A,P)
%打印树
for i=1:length(A)
nodes(1,i)=A{i,2};
end
treeplot(nodes)
[x,y]=treelayout(nodes);
x=x';
y=y';
%name1=cellstr(num2str((1:count)'));
for i=1:length(A)
name{i,1}=A{i,1};
end
for i=1:length(A)
text(x(i,1),y(i,1),name{i,1})
d=num2str(100*P);
s=strcat('鸢尾花决策树 精确度为',d,'%');
title({s},'FontSize',12,'FontName','宋体');
end
三、运行结果