机器学习之分类算法2
决策树
信息论基础
信息增益
信息熵
注:此处的log都是2为底
信息增益的计算
常见决策树算法
基尼系数划分更细致
criterion:默认gini,也可entropy
决策树结构与本地保存
随机森林
集成学习
随机森林API
注:单颗树是随机有放回抽样
优点
代码实现
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pandas as pd
from sklearn.model_selection import train_test_split
def decision():
titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titantic.txt")
x = titan[['pclass', 'age', 'sex']]
y = titan['survived']
print(x)
x['age'].fillna(x['age'].mean(), inplace=True)
x_trian, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
dict = DictVectorizer(sparse=False)
x_train = dict.fit_transform(x_trian.to_dict(orient='records'))
x_test = dict.transform(x_test.to_dict(orient='records'))
dec = DecisionTreeClassifier(max_depth=5)
dec.fit(x_train, y_train)
print('预测准确率', dec.score(x_test, y_test))
export_graphviz(dec, out_file='./tree.dot', feature_names=['年龄', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male'])
return
决策树生成与剪枝:https://blog.csdn.net/am290333566/article/details/81187562.
sklearn中参数含义:https://blog.csdn.net/qq_16000815/article/details/80954039.