集成算法:
Bagging 袋装法:构建多个相互独立的评估器,然后对其预测进行平均或多数表决原则来决定集成评估器的结果(随机森林)
Boosting 提升法:基评估器是相关的;
Stacking;
sklearn中集成算法: ensemble
sklearn.ensemble
类:
1. 随机森林分类
RandomForestClassifier
重要参数:
criterion:不纯度衡量指标
max_depth:树的最大深度,超过最大深度会被剪掉
n_estimators :这是随机森林的数量,即评估器的数量,n_estimators 越大,模型效果越好(任何模型都有决策边界)
random_state:用来设置分枝中的随机模式的参数,默认为None;
bootarap:代表采用有放回的随机抽样技术;
obb_score:袋外数据测试,将obb_score参数调整为True,训练完毕后,用obb_score_属性来查看袋外数据上测试的结果;
随机森林四个常用接口:apply;fit;predict;score
2. 随机森林回归
RandomForestRegressor
与分类基本差不多 ,仅有不同是参数Criterion不一致
回归树衡量分枝质量的指标,支持的标准有三种:
1)输入"mse"使用均方误差mean squared error(MSE),父节点和叶子节点之间的均方误差的差额将被用来作为
特征选择的标准,这种方法通过使用叶子节点的均值来最小化L2损失
2)输入“friedman_mse”使用费尔德曼均方误差,这种指标使用弗里德曼针对潜在分枝中的问题改进后的均方误差
3)输入"mae"使用绝对平均误差MAE(mean absolute error),这种指标使用叶节点的中值来最小化L1损失
随机森林回归可以填补缺失值!
# 填补缺失值
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
dataset = load_boston()
# print(dataset)
x_full, y_full = dataset.data, dataset.target
# print(x_full.shape[0])
# print(y_full.shape[0])
n_samples = x_full.shape[0]
n_features = x_full.shape[1]
# 为完整数据集放入缺失值
rng = np.random.RandomState(0)
missing_rate = 0.5
n_missing_samples = int(np.floor(n_samples*n_features*missing_rate))
# np.floor向下取整,返回.0格式的浮点数
missing_features = rng.randint(0, n_features, n_missing_samples)
missing_samples = rng.randint(0, n_samples, n_missing_samples)
# randint(下限,上限,n)请在上限和下限之间取出n个整数
x_missing = x_full.copy()
y_missing = y_full.copy()
x_missing[missing_samples, missing_features] = np.nan
x_missing = pd.DataFrame(x_missing)
# print(x_missing)
# 使用0或均值填补缺失值
# 使用均值进行填补
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
x_missing_mean = imp_mean.fit_transform(x_missing)
# print(x_missing_mean)
# 使用0进行填补
imp_0 = SimpleImputer(missing_values=np.nan, strategy="constant", fill_value=0)
x_missing_0 = imp_0.fit_transform(x_missing)
# print(x_missing_0)
# 使用随机森林回归填补缺失值
x_missing_reg = x_missing.copy()
b = x_missing_reg.isnull().sum(axis=0)
# print(b)
a = np.sort(x_missing_reg.isnull().sum(axis=0))
# print(a)
sortindex = np.argsort(x_missing.isnull().sum()).values
# print(sortindex)
for i in sortindex:
# 构建我们的新特征矩阵和新标签
df = x_missing_reg
fillc = df.iloc[:, i]
df = pd.concat([df.iloc[:, df.columns != i], pd.DataFrame(y_full)], axis=1)
# 在新特征矩阵中,对含有缺失值的列,进行0的填补
df_0 = SimpleImputer(missing_values=np.nan,
strategy='constant', fill_value=0).fit_transform(df)
# 找出我们的训练集和测试集
ytrain = fillc[fillc.notnull()]
ytest = fillc[fillc.isnull()]
xtrain = df_0[ytrain.index, :]
xtest = df_0[ytest.index, :]
# 用随机森林回归填补缺失值
rfc = RandomForestRegressor(n_estimators=100)
rfc = rfc.fit(xtrain, ytrain)
ypredict = rfc.predict(xtest)
# 将填好的特征返回到我们的原始的特征矩阵中
x_missing_reg.loc[x_missing_reg.iloc[:, i].isnull(), i] = ypredict
# print(x_missing_reg)
# 对填补好的数据进行建模
X = [x_full, x_missing_mean, x_missing_0, x_missing_reg]
mse = []
std = []
for x in X:
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
scores = cross_val_score(estimator, x, y_full, scoring='neg_mean_squared_error', cv=5).mean()
mse.append(scores * -1)
print(mse)
# 用所得结果画出条形图
x_tabels = ['full data',
'zero imputation',
'mean imputation',
'regressor imputation']
colors = ['r', 'g', 'b', 'orange']
plt.figure(figsize=(12, 6))
ax = plt.subplot(111)
for i in np.arange(len(mse)):
ax.barh(i, mse[i], color=colors[i], alpha=0.6, align='center')
ax.set_title('Imputation Techniques with Boston data')
ax.set_xlim(left=np.min(mse)*0.9,
right=np.max(mse)*1.1)
ax.set_yticks(np.arange(len(mse)))
ax.set_xlabel('MSE')
ax.set_yticklabels(x_tabels)
plt.show()