通过决策树方法实现鸢尾花分类

文章描述了如何利用决策树方法对鸢尾花数据集进行分类,该数据集包含三种类型(setosa,versicolor,virginica)的鸢尾花样本,基于四个特征值(花瓣长度、花瓣宽度、花萼长度和花萼宽度)。通过编程实现,包括数据预处理、决策树构建以及计算分类准确率的过程。
摘要由CSDN通过智能技术生成

一、题目:

目标值-3类鸢尾花:setosa、vericolor、virginica

特征值-4个:花瓣长度、花瓣宽度、花萼长度、花萼宽度

利用决策树方法对鸢尾花样本进行分类

样本:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.4,3.7,1.5,0.2,Iris-setosa
4.8,3.4,1.6,0.2,Iris-setosa
4.8,3.0,1.4,0.1,Iris-setosa
4.3,3.0,1.1,0.1,Iris-setosa
5.8,4.0,1.2,0.2,Iris-setosa
5.7,4.4,1.5,0.4,Iris-setosa
5.4,3.9,1.3,0.4,Iris-setosa
5.1,3.5,1.4,0.3,Iris-setosa
5.7,3.8,1.7,0.3,Iris-setosa
5.1,3.8,1.5,0.3,Iris-setosa
5.4,3.4,1.7,0.2,Iris-setosa
5.1,3.7,1.5,0.4,Iris-setosa
4.6,3.6,1.0,0.2,Iris-setosa
5.1,3.3,1.7,0.5,Iris-setosa
4.8,3.4,1.9,0.2,Iris-setosa
5.0,3.0,1.6,0.2,Iris-setosa
5.0,3.4,1.6,0.4,Iris-setosa
5.2,3.5,1.5,0.2,Iris-setosa
5.2,3.4,1.4,0.2,Iris-setosa
4.7,3.2,1.6,0.2,Iris-setosa
4.8,3.1,1.6,0.2,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa
5.2,4.1,1.5,0.1,Iris-setosa
5.5,4.2,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
5.0,3.2,1.2,0.2,Iris-setosa
5.5,3.5,1.3,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
4.4,3.0,1.3,0.2,Iris-setosa
5.1,3.4,1.5,0.2,Iris-setosa
5.0,3.5,1.3,0.3,Iris-setosa
4.5,2.3,1.3,0.3,Iris-setosa
4.4,3.2,1.3,0.2,Iris-setosa
5.0,3.5,1.6,0.6,Iris-setosa
5.1,3.8,1.9,0.4,Iris-setosa
4.8,3.0,1.4,0.3,Iris-setosa
5.1,3.8,1.6,0.2,Iris-setosa
4.6,3.2,1.4,0.2,Iris-setosa
5.3,3.7,1.5,0.2,Iris-setosa
5.0,3.3,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.9,3.1,4.9,1.5,Iris-versicolor
5.5,2.3,4.0,1.3,Iris-versicolor
6.5,2.8,4.6,1.5,Iris-versicolor
5.7,2.8,4.5,1.3,Iris-versicolor
6.3,3.3,4.7,1.6,Iris-versicolor
4.9,2.4,3.3,1.0,Iris-versicolor
6.6,2.9,4.6,1.3,Iris-versicolor
5.2,2.7,3.9,1.4,Iris-versicolor
5.0,2.0,3.5,1.0,Iris-versicolor
5.9,3.0,4.2,1.5,Iris-versicolor
6.0,2.2,4.0,1.0,Iris-versicolor
6.1,2.9,4.7,1.4,Iris-versicolor
5.6,2.9,3.6,1.3,Iris-versicolor
6.7,3.1,4.4,1.4,Iris-versicolor
5.6,3.0,4.5,1.5,Iris-versicolor
5.8,2.7,4.1,1.0,Iris-versicolor
6.2,2.2,4.5,1.5,Iris-versicolor
5.6,2.5,3.9,1.1,Iris-versicolor
5.9,3.2,4.8,1.8,Iris-versicolor
6.1,2.8,4.0,1.3,Iris-versicolor
6.3,2.5,4.9,1.5,Iris-versicolor
6.1,2.8,4.7,1.2,Iris-versicolor
6.4,2.9,4.3,1.3,Iris-versicolor
6.6,3.0,4.4,1.4,Iris-versicolor
6.8,2.8,4.8,1.4,Iris-versicolor
6.7,3.0,5.0,1.7,Iris-versicolor
6.0,2.9,4.5,1.5,Iris-versicolor
5.7,2.6,3.5,1.0,Iris-versicolor
5.5,2.4,3.8,1.1,Iris-versicolor
5.5,2.4,3.7,1.0,Iris-versicolor
5.8,2.7,3.9,1.2,Iris-versicolor
6.0,2.7,5.1,1.6,Iris-versicolor
5.4,3.0,4.5,1.5,Iris-versicolor
6.0,3.4,4.5,1.6,Iris-versicolor
6.7,3.1,4.7,1.5,Iris-versicolor
6.3,2.3,4.4,1.3,Iris-versicolor
5.6,3.0,4.1,1.3,Iris-versicolor
5.5,2.5,4.0,1.3,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor
6.1,3.0,4.6,1.4,Iris-versicolor
5.8,2.6,4.0,1.2,Iris-versicolor
5.0,2.3,3.3,1.0,Iris-versicolor
5.6,2.7,4.2,1.3,Iris-versicolor
5.7,3.0,4.2,1.2,Iris-versicolor
5.7,2.9,4.2,1.3,Iris-versicolor
6.2,2.9,4.3,1.3,Iris-versicolor
5.1,2.5,3.0,1.1,Iris-versicolor
5.7,2.8,4.1,1.3,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
7.1,3.0,5.9,2.1,Iris-virginica
6.3,2.9,5.6,1.8,Iris-virginica
6.5,3.0,5.8,2.2,Iris-virginica
7.6,3.0,6.6,2.1,Iris-virginica
4.9,2.5,4.5,1.7,Iris-virginica
7.3,2.9,6.3,1.8,Iris-virginica
6.7,2.5,5.8,1.8,Iris-virginica
7.2,3.6,6.1,2.5,Iris-virginica
6.5,3.2,5.1,2.0,Iris-virginica
6.4,2.7,5.3,1.9,Iris-virginica
6.8,3.0,5.5,2.1,Iris-virginica
5.7,2.5,5.0,2.0,Iris-virginica
5.8,2.8,5.1,2.4,Iris-virginica
6.4,3.2,5.3,2.3,Iris-virginica
6.5,3.0,5.5,1.8,Iris-virginica
7.7,3.8,6.7,2.2,Iris-virginica
7.7,2.6,6.9,2.3,Iris-virginica
6.0,2.2,5.0,1.5,Iris-virginica
6.9,3.2,5.7,2.3,Iris-virginica
5.6,2.8,4.9,2.0,Iris-virginica
7.7,2.8,6.7,2.0,Iris-virginica
6.3,2.7,4.9,1.8,Iris-virginica
6.7,3.3,5.7,2.1,Iris-virginica
7.2,3.2,6.0,1.8,Iris-virginica
6.2,2.8,4.8,1.8,Iris-virginica
6.1,3.0,4.9,1.8,Iris-virginica
6.4,2.8,5.6,2.1,Iris-virginica
7.2,3.0,5.8,1.6,Iris-virginica
7.4,2.8,6.1,1.9,Iris-virginica
7.9,3.8,6.4,2.0,Iris-virginica
6.4,2.8,5.6,2.2,Iris-virginica
6.3,2.8,5.1,1.5,Iris-virginica
6.1,2.6,5.6,1.4,Iris-virginica
7.7,3.0,6.1,2.3,Iris-virginica
6.3,3.4,5.6,2.4,Iris-virginica
6.4,3.1,5.5,1.8,Iris-virginica
6.0,3.0,4.8,1.8,Iris-virginica
6.9,3.1,5.4,2.1,Iris-virginica
6.7,3.1,5.6,2.4,Iris-virginica
6.9,3.1,5.1,2.3,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
6.8,3.2,5.9,2.3,Iris-virginica
6.7,3.3,5.7,2.5,Iris-virginica
6.7,3.0,5.2,2.3,Iris-virginica
6.3,2.5,5.0,1.9,Iris-virginica
6.5,3.0,5.2,2.0,Iris-virginica
6.2,3.4,5.4,2.3,Iris-virginica
5.9,3.0,5.1,1.8,Iris-virginica

二、代码实现:

程序思路:样本导入---分离样本得到待检测数据和标签---对检测数据贴标签---生成决策树---计算准确率---打印决策树

1、主程序:

%% 处理数据
sample=importdata('iris.data');
label=[];
data=[];

% 将数据分别提取出来,得到数组表示的数据部分data,和字符串数组表示的标签部分label
for i=1:150
    ch=cell2mat(sample(i));
%     disp(ch);
    data=strvcat(data,ch(1:fen_label(ch)-1));
    label=strvcat(label,ch(fen_label(ch)+1:length(ch)));
end
data=str2num(data);
% 处理标签部分,得到标签的种类和标签的序列
alabel=[];
label_n=0;
mlabel=label;
while true
    if size(mlabel,1)==0
        break;
    end
    alabel=strvcat(alabel,mlabel(1,:));
    label_n=label_n+1;
    mid_index=find_label(mlabel,alabel(label_n,:));
    mlabel(mid_index,:)=[];
end
%data:数据 label:对应标签 alabel:单独标签记录 label_n:标签种类数
%% 生成树

attri=strvcat('sepal length(cm)','sepal width(cm)','petal length(cm)','petal width(cm)');
feat_index=1:size(data,2);
tree=creat_tree(data,label,feat_index,attri);

%% 计算准确率
tcount=0;
for i=1:size(data,1)
    mtree=tree;
    while true
        if strcmp(mtree.left,'null')==1 && strcmp(mtree.right,'null')==1
            if strcmp(mtree.value,label(i,:))==1
                tcount=tcount+1;
            end
            break;
        end
        if data(i,mtree.add_value(1))<mtree.add_value(2)
            mtree=mtree.left;
        elseif data(i,mtree.add_value(1))>=mtree.add_value(2)
            mtree=mtree.right;
        end
    end
end
accuracy=tcount/size(data,1);
%% 打印树
A=cell(1);
[A,i]=prev(tree,A,1,0);
print_tree(A,accuracy);

 2、分离数据和标签

function index = fen_label(label)
%函数主要实现找出数据和标签分割的逗号所在索引

for i=length(label):-1:1
    if label(i)==','
        index=i;
        break;
    end
end
    
end
function label_index = find_label(label_mat,label_obj)
%函数主要实现找出label_mat中标签名为label_obj对应的标号
label_index=[];
for i=1:size(label_mat,1)
    if strcmp(label_mat(i,:),label_obj)
        label_index=[label_index,i];
    end
end
end

 3、创建决策树

function tree=creat_tree(data,label,feat_index,attri)
%通过递归的方法创建树
tree=struct('value','null','add_value','null','left','null','right','null');

if size(label,1)==length(find_label(label,label(1,:)))  %标签种类只有一种
%     disp('1');
    tree.value=label(1,:);
    return;
end
if isempty(feat_index)  %属性集为空
%     disp('1');
    tree.value=find_mlabel(label);
    return;
end
mid1=1;
for i=feat_index
   mid1=mid1*length(unique(data(i,:)));
end
% if length(unique(data(feat_index,:)))<=length(feat_index) %样本在每种属性上取值相同
%     tree.value=find_mlabel(label);
%     return;
% end
if mid1==1
%     disp('1');
    tree.value=find_mlabel(label);
    return;
end

[bestfeat,T_fen] = choose_feat(data,feat_index,label);

tree.value=strvcat(attri(bestfeat,:),num2str(T_fen));
tree.add_value=[bestfeat,T_fen];

feat_index(find(feat_index==bestfeat))=[];
index1=find(data(:,bestfeat)<T_fen);
index2=find(data(:,bestfeat)>=T_fen);
tree.left=creat_tree(data(index1,:),label(index1,:),feat_index,attri);
tree.right=creat_tree(data(index2,:),label(index2,:),feat_index,attri);

end
function [bestfeat,T_fen] = choose_feat(data,feat_index,label)
%这里主要是用于选择最理想的属性
%feat_index:需要进行比较的属性
%data:样本的数据 label:对应样本数据的标签

%T_fen:对应最优属性的二分法的最优分割点
%bestfeat:最优属性的索引

best_gain=0;
bestfeat=0;
% T_fen对应属性获得最大增益时的二分法取值
for i=feat_index
    mmat=unique(data(:,i));%得到对应属性取值的序列
    if length(mmat)-1==0
        T_mat=mmat(1);
    else
        T_mat=ones(1,length(mmat)-1);%定义二分法的值
        for j=1:length(mmat)-1
            T_mat(j)=(mmat(j)+mmat(j+1))/2;
        end
    end
    
    for j=T_mat
        Gain=Ent_cal_data(data,label,i,j);
        if Gain>best_gain
            best_gain=Gain;
            bestfeat=i;
            T_fen=j;
        end
    end
end

end
function mo_label = find_mlabel(label)
%找到标签数最多的一类,并返回其值
mlabel=label;%复制,以免数据丢失
max_con=0;%记录最大标签数
while true
    if size(mlabel,1)==0
        break;
    end
    id1=find_label(mlabel,mlabel(1,:));
    if length(id1)>max_con
        max_con=length(id1);
%       id2=find_label(mlabel,mlabel(1,:));
        mo_label=mlabel(1,:);
    end
    mlabel(id1,:)=[];
end

end

 4、打印决策树和计算精度

function [A,i]=prev(T,A,i,j)
%遍历树 并产生可以被treeplot用来画图的结点序列
%输入i应为1;j应为0;
%函数迭代过程中传递不了A值,所以要在输入和输出上将cell设为变量

if isstruct(T)==1 && (strcmp(T.left,'null')==0 || strcmp(T.right,'null')==0)
   A{i,1}=T.value;
   A{i,2}=j;
   i=i+1;j=i-1;
   % i随迭代不断增加,但j是固定在每步迭代当中
   [A,i]=prev(T.left,A,i,j);
   i=i+1;
   [A,i]=prev(T.right,A,i,j);
elseif isstruct(T)==1 && strcmp(T.left,'null')==1 && strcmp(T.right,'null')==1
        A{i,1}=T.value;
        A{i,2}=j;
    else
        A{i,1}=T;
        A{i,2}=j;
end

end
function print_tree(A,P)
%打印树
for i=1:length(A)
    nodes(1,i)=A{i,2};
end
treeplot(nodes)
[x,y]=treelayout(nodes);
x=x';
y=y';
%name1=cellstr(num2str((1:count)'));
for i=1:length(A)
   name{i,1}=A{i,1};
end
for i=1:length(A)
text(x(i,1),y(i,1),name{i,1})
d=num2str(100*P);
s=strcat('鸢尾花决策树  精确度为',d,'%');
title({s},'FontSize',12,'FontName','宋体');

end

三、运行结果

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值