如果你聚合一组预测器的预测,得到的预测结果也比单个的预测器要好。这样的一组预测器,我们称为集成。
理论上越多的树效果会越好,但实际上基本超过一定数量就差不多上下浮动了
预测器尽可能相互独立时,集成方法的效果最优
Bagging:训练多个分类器取平均
下面的代码创建并训练一个投票分类器,由三种不同的分类器组成
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVM()
voting_clf = VotingClassifier(
estimators =[('lr',log_cf),('rf',rnd_clf),('svc',svm_clf)],
voting = 'hard'
)
voting_clf.fit(x_train,y_train)
可以将概率在所有单个分类器上平均,然后让Scikit-learn给出平均概率最高的类别作为预测。
这称为软投票法。
通常比硬投票法表现更优,因为它给予那些高度自信的投票更高的权重。voting="soft"代替voting=“hard”
并且确保所有分类器都可以估算出概率。默认情况下,SVC类是不行的,需要将probability设置为True,这会导致SVC使用交叉验证来估算类别概率,减慢训练速度。
采样时如果将样本放回,这种方法叫作bagging,采样时样本不放回,这种方法叫pasting
以下代码训练了一个包含500个决策树分类器的集成,每次随机从训练集中采样100个训练实例进行训练,然后放回,如果想用pasting只需要设置bootstrap=false,参数n_jobs用来指示Scikit_learn用多少CPU内核进行训练和预测。
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(
DecisionTreeClassifier(),n_estimators=500,
max_samples=100,bootstrap=True,n_jobs=-1
)
bag_clf.fit(x_train,y_train)
y_pred = bag_clf.predict(x_test)
以下代码使用所有可用的CPU内核,训练一个拥有500课树的随机分类器(每棵树限制为最多
16个叶节点)
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16.n_jobs=-1)
rnd_clf.fit(x_train,y_train)
y_pred_rf = rnd_clf.predict(x_test)
输出每个特征的重要性
for name,score in zip(iris[“feature_names”],rnd_clf.feature_importances_):
print(name,score)
提升法
是指可以将几个弱学习器结合成一个强学习器的任意集成方法。
大多数提升法的总体思路是循环训练预测器,每一次都对其前序做出一些改正。、
AdaBoost
要构建一个AdaBoost分类器,首先需要训练一个基础分类器,用它来训练集进行预测。
然后对错误分类的训练实例增加其相对权重,接着,使用这个新的权重对第二个分类器进行
训练。
梯度提升:
让新的预测器针对前一个预测器的残差进行拟合
我们来看一个简单的回归示例,使用决策树作为基础预测器
首先,在训练集上拟合一个DecisionTreeRegressor:
from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor( max_depth=2 )
tree_reg1.fit(X,y)
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor( max_depth=2 )
tree_reg2.fit(X,y2)
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor( max_depth=2 )
tree_reg3.fit(X,y3)
将所有树的预测相加,从而对新实例进行预测:
y_pred = sum(tree.predict(x_new) for tree in (tree_reg1,tree_reg2,tree_3) )
以下代码可以创建上面的集成:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2,n_estimators=3,learning_rate=1.0)
gbrt.fit(X,y)
寻找最佳参数
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
x_train,x_val,y_train,y_val = train_test_split(x,y)
gbrt = GradientBoostingRegressor( max_depth=2,n_estimators=120 )
gbrt.fit(x_train,y_train)
errors = [mean_squared_error(y_val,y_pred) for y_pred in gbrt.staged_predict(x_val)]
bst_n_estimators = np.argmin(errors)
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(x_train,y_train)
以下代码会在验证误差连续5次迭代未改善时,直接停止训练
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=Ture)
min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1,120):
gbrt.n_estimators = n_estimators
gbrt.fit(x_train, y_train)
y_pred = gbrt.predict(x_val)
val_error = mean_squared_error(y_val, y_pred)
if val_error < min_val_error:
min_val_error = val_error
error_going_up += 1
else:
error_going_up += 1
if error_going_up == 5:
break
堆叠法
底部的三个预测器分别预测了不同的值,而最终的预测器将这些预测作为输入,进行最终预测
代表:随机森林
随机:数据采样随机,特征选择随机
森林:很多个决策树并行放在一起
特征重要性衡量:
破坏某列特征,与原来情况进行比较
理论上越多的树效果会越好,但实际上基本超过一定数量就差不多上下浮动了
Boosting:从弱学习器开始加强,通过加权进行训练
代表:AdaBoost
Adaboost会根据前一次的分类效果调整数据权重
解释:如果某一个数据在这次分错了,那么在下一次我就会给它更大的权重
最终的结果:每个分类器根据自身的准确性来确定各自的权重,再合体
Stacking:聚合多个分类或回归模型(可以分阶段来做)
可以堆叠各种各样的分类器
第一阶段得出各自结果 第二阶段再用前一阶段结果训练
随机森林案例:
预测titanic获救情况
import pandas
titanic = pandas.read_csv("titanic_train.csv")
print( titanic.describe() )
数据填充:
titanic['age'] = titanci['age'].fillna( titanic['age'].median() ) #因为Age一列缺失值,将其平均值补上作为观测
数据映射:(将性别male,female映射为0,1)
print titanic["sex"].unique()
titanic.loc[titanic["sex"] == "male","sex"] = 0
titanic.loc[titanic["sex"] == "female","sex"] = 1
利用线性回归进行预测
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"] #特征项
alg = LinearRegression()
kf = KFold(titanic.shape[0],N_folds=3,random_state=1)
predictions = []
for train,test in kf:
train_predictors = (titanic[predictors].iloc[train,:])
train_target = titanic["Survived"].iloc[train]
alg.fit(train_predictors,train_target)
test_predictions = alg.predict(titanic[predictors].iloc[test,:])
predictions.append(test_predictions)
import numpy as np
predictions = np.concatenate(predictions,axis=0)
predictions[predictions > 0.5] = 1
predictions[predictions <=0.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
随机森林进行预测
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
alg = RandomForestClassifier(random_state=1,n_estimators=10,min_samples_split=2,min_samples_leaf=1)
kf = cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
print(scores.mean())
换参数后:
alg = RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=4,min_samples_leaf=2)
kf = cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
print(scores.mean())
选取最佳特征:
import numpy as np
from sklearn.feature_selection import SelectKBest,f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","Namelength"]
selector = SelectKBest(f_classif,k=5)
selector.fit(tutanic[predictors],titanic["Survived"])
scores = -np.log10(selector.pvalues_)
plt.bar(range(len(predictors)),scores)
plt.xticks(range(len(predictors)),predictors,rotation="vertical")
plt.show()
predictors=["Pclass","Sex","Fare","Ttile"]
alg = RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=8,min_samples_leaf=4)
集成算法取均值
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
algorithms =[
[GradientBoostingClassifier(random_state=1,n_estimators=25,max-depth=3),["","",""]]
[LogisticRegression(random_state=1),["","",""]]
]
kf =KFold(titanic.shape[0],n_folds=3,random_state=1)
predictions = []
for train,test in kf:
train_target = titanic["Survived"].iloc[train]
full_test_predictions = []
for alg,predictors in algorithms:
alg.fit(titanic[predictors].iloc[train,:],train_target)
test_predictions = alg.predict_proba(titanic[predictors].iloc[test,:].astype(float))[:,1]
full_test_predictions.append(test_predictions)
test_predictions = (full_test_predictions[0] + full_test_predictions[1]) /2
test_predictions[test_predictions <= .5] = 0
test_predictions[test_predictions > .5] = 1
predictions.append(test_predictions)
predictions = np.concatenate(predictions,axis=0)
accuracy = sum(predictions[predictions == titanic["Survived"]]/len(predictions))
print( accuracy )