集成学习算法

最新推荐文章于 2024-11-12 12:29:51 发布

cz? 帅哥:null

最新推荐文章于 2024-11-12 12:29:51 发布

阅读量292

点赞数 4

文章标签：集成学习算法机器学习

本文链接：https://blog.csdn.net/cghhbvcgjb/article/details/141958276

版权

集成学习通过建立几个模型来解决单一预测问题。它的工作原理是生成多个分类器/模型，各自独立地学习和做出预测。这些预测最后结合成组合预测，因此优于任何一个单分类做出的预测。

解决欠拟合：弱弱组合变强、boosting

解决过拟合：互相遏制变壮、Bagging

Bagging和随机森林

Bagging构造

Bagging：先采样一些不同的数据集，训练分类器使得能大致分出类别，然后基于这些分类器平权投票，获取到最终结果

采样-学习-集成

随机森林构造

随机森林：随机森林是一个包含多个决策树地分类器，并且其输出的类别是由个别树输出的类别的众数而定。

随机森林=Bagging + 决策树

关键步骤：

一次随机抽出一个样本，有放回的抽样，重复N次（可能出现重复的样本）

随机选出m个特征m << M。建立弱决策树（M是所有特征）

包外估计

在构造随机森林的过程中，进行有放回的抽样，总有一部分抽样我们选不到。

每一次抽样，没有选择到的数据我们叫做保外数据

当数据足够多，任意一组数据为保外数据的概率为

约为0.368，这一部分包外数据可以用于基分类器的验证集。

随机森林API

随机森林预测案例

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import  export_graphviz
from sklearn.ensemble import RandomForestClassifier
from  sklearn.model_selection import GridSearchCV

# 1.获取数据
url = "https://hbiostat.org/data/repo/titanic.txt"
titan = pd.read_csv(url)

# 2.数据基本处理
# 2.1 确定特征值,⽬标值
x = titan[["pclass", "age", "sex"]]
y = titan["survived"]
# 2.2 缺失值处理
x['age'].fillna(value=titan["age"].mean(),inplace=True)
# 2.3 数据集划分
x_train, x_test,y_train, y_test = train_test_split(x, y, random_state=22,test_size=0.2)
# 3.特征⼯程(字典特征抽取)
x_train = x_train.to_dict(orient="records")
x_test = x_test.to_dict(orient="records")
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# 实例化一个随机森林
rf = RandomForestClassifier()
# 通过超参数调优
param = {"n_estimators":[100,120,300], "max_depth":[3,7,11]}
gc = GridSearchCV(rf,param_grid=param,cv=3)
gc.fit(x_train,y_train)
y_pre = gc.predict(x_test)
ret1 = gc.score(x_test, y_test)
print(ret1)