有bug,一直未找到,闲暇下来在看看,存一档
输入格式
首先为正整数n、m,分别代表特征个数、训练样本个数。
随后为m行,每行有n+1个整数。其中(1<n<=100,1<m<=1000)。
在后续的m行中,每行代表一个样本中的n个整数特征值( )与样本的实际观测结果y(整数)。其中 。
输出格式
决策树的前序遍历结果,若是一个内部节点则输出inner(特征id),若是一个叶节点,则输出leaf(特征id)。在对每个节点打印时,若该节点处在n层,需在前面打印n个---,并打印该分支的特征值。
输入样例:
3 10
2 1 1 0
2 0 1 1
1 1 1 0
0 1 0 0
0 0 1 2
0 1 0 0
1 0 0 1
1 0 1 3
1 1 1 0
2 0 0 1
输出样例:
inner(1)
---0->inner(0)
------0->leaf(2)
------1->inner(2)
---------0->leaf(1)
---------1->leaf(3)
------2->leaf(1)
---1->leaf(0)
代码如下:问题,在最后打印出决策树时总是出错
#include<iostream>
#include<vector>
#include<math.h>
#include<map>
#include<stdio.h>
using namespace std;
struct TNode{
//构造树的节点
int feature;
int type;
//feature不为-1,代表是分支节点,相同,type不为-1代表是叶节点,二者不能同时为-1
TNode *child[32];
//int*a[5]是指针数组,int(*a)[5]是数组指针,前者表示一个数组,数组元素都是指向int型变量的指针,后者表示一个指针,该指针指向一个int型有5个元素的数组,希望对你有所帮助
TNode(){
feature=-1;
type=-1;
for(int i=0;i<32;i++){
child[i]=0;
}
}
};
TNode* construct_decision_tree(vector<vector<int> > training_sample,vector<int> feature);
double compute_single_entropy(double p){
//计算单个的信息熵,只需要传入y为某个值的比例即可
if(p!=0)
return -p*(log(p)/log(2));
return 0;
}
double compute_double_entropy(vector<int> y){
//计算某一堆分类的信息熵,只需要将关于输出的数组传入即可
map<int,int> y_count;
//将每个y出现的次数都存下来
for(int i=0;i<y.size();i++){
y_count[y[i]]++;
}
double all_entropy=0.0;
//遍历整个map,将所有的entropy相加
map<int,int>::iterator beg=y_count.begin();
map<int,int>::iterator end=y_count.end();
while(beg!=end){
all_entropy+=compute_single_entropy(double(beg->second)/double(y.size()));
beg++;
}
return all_entropy;
}
int find_the_best_division(vector<vector<int> > training_sample,vector<int>feature){
//找出最好的分类方法,所需的是:整个训练特征向量组和每个特征向量的结果标注
double best_division;
int best_feature=-1;
map<int,int> x_count;
map<int,vector<int> > y_value;
//记录每个x可能出现的y;
for(int colnum=0;colnum<training_sample[0].size();colnum++){
double division=0.0;
for(int row=0;row<training_sample.size();row++){
x_count[training_sample[row][colnum]]++;
y_value[training_sample[row][colnum]].push_back(training_sample[row][training_sample[0].size()-1]);
}
map<int,int>::iterator beg=x_count.begin();
map<int,int>::iterator end=x_count.end();
while(beg!=end){
division+=double(beg->second)/double(training_sample.size())*compute_double_entropy(y_value[beg->first]);
if(best_feature==-1||best_division>division){
best_division=division;
best_feature=colnum;
}
beg++;
}
x_count.clear();
y_value.clear();
}
return feature[best_feature];
}
int check_is_leaf(vector<vector<int> > training_sample,vector<int> feature){
//判断是不是叶子节点,是叶子节点,把对应的标注结果返回
map<int,int> y_count;
for(int i=0;i<training_sample.size();i++){
y_count[training_sample[i][training_sample[0].size()-1]]++;
}
cout<<y_count.size()<<'-'<<feature.size()<<endl;
if(y_count.size()!=1&&feature.size()!=0) return -1;
map<int,int>::iterator beg=y_count.begin();
map<int,int>::iterator max=beg;
map<int,int>::iterator end=y_count.end();
while(beg!=end){
if(beg->second>max->second) max=beg;
beg++;
}
return max->first;
}
void create_branch_tree(vector<vector<int> >training_sample,TNode* root,int division_feature,vector<int> feature){
int division_location=0;
//先找到最好分类值的位置
while(feature[division_location]!=division_feature) division_location++;
vector<int>::iterator it=feature.begin()+division_location;
feature.erase(it);
map<int,vector<vector<int> > >classification;
for(int i=0;i<training_sample.size();i++){
int key=training_sample[i][division_location];
vector<int> value;
for(int j=0;j<division_location;j++) value.push_back(training_sample[i][j]);
for(int j=division_location;j<training_sample[0].size();j++) value.push_back(training_sample[i][j]);
classification[key].push_back(value);
value.clear();
}
map<int,vector<vector<int> > >::iterator beg=classification.begin();
map<int,vector<vector<int> > >::iterator end=classification.end();
int i=0;
while(beg!=end){
root->child[i]=construct_decision_tree(beg->second,feature);
beg++;
}
}
TNode* construct_decision_tree(vector<vector<int> > training_sample,vector<int> feature){
TNode *root=new TNode();
int type=check_is_leaf(training_sample,feature);
if(type!=-1){
root->type=type;
cout<<"besttype"<<type<<endl;
}
else if(feature.size()==1){
map<int,vector<vector<int> > >classification;
for(int i=0;i<training_sample.size();i++){
int key=training_sample[i][0];
vector<int> value;
value.push_back(training_sample[i][0]);
classification[key].push_back(value);
value.clear();
}
map<int,vector<vector<int> > >::iterator beg=classification.begin();
map<int,vector<vector<int> > >::iterator end=classification.end();
int i=0;
while(beg!=end){
root->child[i]=construct_decision_tree(beg->second,feature);
beg++;
}
}
else{
int division_feature=find_the_best_division(training_sample,feature);
//上式返回的是最好的feature的值
cout<<"bestdivis"<<division_feature<<endl;
root->feature=division_feature;
create_branch_tree(training_sample,root,division_feature,feature);
}
return root;
}
void preorder_decision_tree(TNode* root,int value,int level){
for(int i=0;i<level;i++){
cout<<"---";
}
if(level) cout<<value<<"->";
if(root->feature==-1) cout<<"leaf("<<root->type<<")"<<endl;
else cout<<"inner("<<root->feature<<")"<<endl;
for(int i=0;i<=32;i++){
if(root->child[i]){
preorder_decision_tree(root->child[i],i,level+1);
}
}
}
int main(){
freopen("in.txt","r",stdin);
int feature_num,training_num;
cin>>feature_num>>training_num;
vector<vector<int> >training_set;
vector<int> row;
for(int i=0;i<training_num;i++){
row.clear();
for(int j=0;j<=feature_num;j++){
int n;
cin>>n;
row.push_back(n);
}
training_set.push_back(row);
}
vector<int> feature;
for(int i=0;i<feature_num;i++) feature.push_back(i);
TNode* root=construct_decision_tree(training_set,feature);
cout<<root->feature<<" "<<root->type<<endl;
preorder_decision_tree(root,0,0);
return 0;
}