本篇文章是☞【干货分享】C4.5算法(上)的下篇,没看过的学员请移步。
属性集有四种:天气,温度,适度,风速
类标签集合两种:进行取消
步骤:
1.计算信息熵
2.分别计算按不同属性的划分信息熵
3.计算出信息增益
4.计算增益信息
5.计算信息增益率
6.重复1-5计算出不同属性划分的信息增益率
1.计算信息熵
Info(D) = -9/14 * log2(9/14) - 5/14 * log2(5/14) = 0.94
2.分别计算按不同属性的划分信息熵
Info(天气) = 5/14 * [- 2/5 * log2(2/5) – 3/5 * log2(3/5)] + 4/14 * [ - 4/4 * log2(4/4) - 0/4 * log2(0/4)] + 5/14 * [ - 3/5 * log2(3/5) – 2/5 * log2(2/5)] = 0.694
Info(温度) = 4/14 * [- 2/4 * log2(2/4) – 2/4 * log2(2/4)] + 6/14 * [ - 4/6 * log2(4/6) - 2/6 * log2(2/6)] + 4/14 * [ - 3/4 * log2(3/4) – 1/4 * log2(1/4)] = 0.9
Info(湿度) = 7/14 * [- 3/7 * log2(3/7) – 4/7 * log2(4/7)] + 7/14 * [ - 6/7 * log2(6/7) - 1/7 * log2(1/7)] = 0.789
Info(风速) = 6/14 * [- 3/6 * log2(3/6) – 3/6 * log2(3/6)] + 8/14 * [ - 6/8 * log2(6/8) - 2/8 * log2(2/8)] = 0.892
3.计算出信息增益
Gain(天气) = Info(D) - Info(天气) = 0.940 - 0.694 = 0.246
Gain(温度) = Info(D) - Info(温度) = 0.940 - 0.911 = 0.029
Gain(湿度) = Info(D) - Info(湿度) = 0.940 - 0.789 = 0.151
Gain(风速) = Info(D) - Info(风速) = 0.940 - 0.892 = 0.048
4.计算增益信息
H(天气) = - 5/14 * log2(5/14) - 5/14 * log2(5/14) - 4/14 * log2(4/14) = 1.58
H(温度) = - 4/14 * log2(4/14) - 6/14 * log2(6/14) - 4/14 * log2(4/14) = 1.56
H(湿度) = - 7/14 * log2(7/14) - 7/14 * log2(7/14) = 1.0
H(风速) = - 6/14 * log2(6/14) - 8/14 * log2(8/14) = 0.99
5.计算信息增益率
IGR(天气) = Info(天气) / H(天气) = 0.1560
IGR(温度) = Info(温度) / H(温度) = 0.0186
IGR(湿度) = Info(湿度) / H(湿度) = 0.1510
IGR(风速) = Info(风速) / H(风速) = 0.0487
结论:选择天气作为根节点
代码
程序引入
struct attrItem
{
std::vector<int> itemNum;
set<int> itemLine;
};
struct attributes
{
string attriName;
vector<double> statResult;
map<string, attrItem*> attriItem;
};
vector<attributes*> statTree;
节点数据结构
struct TreeNode
{
std::string m_sAttribute;
int m_iDeciNum;
int m_iUnDecinum;
std::vector<TreeNode*> m_vChildren;
};
主要流程
#include "DecisionTree.h"
int main(int argc, char* argv[]){
string filename = "source.txt";
DecisionTree dt ;
int attr_node = 0;
TreeNode* treeHead = nullptr;
s