【实验名称】 实验:集成学习
【实验目的】
1.了解决策树,随机森林理论基础
2.平台实现算法
3. 编程实现决策树,随机森林算法
【实验原理】
决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。
【实验环境】
OS:Ubuntu16.04
PyCharm: 2017.3
【实验步骤】
实验开始前,我们先安装实验所需依赖库
pip install sklearn
pip install matplotlib
题目一:分析影响房价的因素
我们现在需要使用决策树来对房价的因素进行分析发现哪个属性最重要,在前面实验中我们已经熟悉了波士顿房价这个数据集,这个数据集由13个属性,以及一个价格所组成,房子的属性影响价格的走势,但是每一个属性的重要程度是不一样的,回想决策树的原理,我们可以使用决策树帮我们判别出特征属性的重要程度。提示:使用决策树回归器DecisionTreeRegressor的feature_importances_ 方法。
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn import datasets
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
if __name__=='__main__':
# Load housing data
housing_data = datasets.load_boston()
# Shuffle the data
X, y = shuffle(housing_data.data, housing_data.target, random_state=7)
# Split the data 80/20 (80% for training, 20% for testing)
num_training = int(0.8 * len(X))
X_train, y_train = X[:num_training], y[:num_training]
X_test, y_test = X[num_training:], y[num_training:]
# Fit decision tree regression model
dt_regressor = DecisionTreeRegressor(max_depth=4)
dt_regressor.fit(X_train, y_train)
# Evaluate performance of Decision Tree regressor
y_pred_dt = dt_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred_dt)
evs = explained_variance_score(y_test, y_pred_dt)
print ("\n#### Decision Tree performance ####")
print ("Mean squared error =", round(mse, 2))
print ("Explained variance score =", round(evs, 2))
# Plot relative feature importances
feature_importances = 100.0 * (dt_regressor.feature_importances_ / max(dt_regressor.feature_importances_))
# Sort the values and flip them
index_sorted = np.flipud(np.argsort(feature_importances))
# Arrange the X ticks
pos = np.arange(index_sorted.shape[0]) + 0.5
# Plot the bar graph
plt.figure()
plt.bar(pos, feature_importances[index_sorted], align='center')
plt.xticks(pos, housing_data.feature_names[index_sorted])
plt.ylabel('Relative Importance')
plt.title('Decision Tree regressor')
plt.show()
题目二:随机森林
使用sklearn自带数据集digits进行练习,请将数据集划分为训练与测试集,分别使用决策树和随机森林算法对数据digits的训练集进行分类,比较决策树算法和随机森林算法的表现,并思考是否可以继续提升随机森林算法的表现?该如何实现呢?
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
dig = datasets.load_digits()
X_train,X_test,y_train,y_test=train_test_split(dig.data,dig.target,test_size=0.4,random_state=0)
# Create classifiers
clf = DecisionTreeClassifier()
rfc = RandomForestClassifier()
rfc2 = RandomForestClassifier(n_estimators=200,max_features=8)
clf_pre = clf.fit(X_train, y_train).predict(X_test)
rfc_pre =rfc.fit(X_train, y_train).predict(X_test)
rfc2_pre =rfc2.fit(X_train, y_train).predict(X_test)
print(metrics.accuracy_score(y_test, clf_pre))
print(metrics.accuracy_score(y_test, rfc_pre))
print(metrics.accuracy_score(y_test, rfc2_pre))