随机森林学习笔记1

最新推荐文章于 2021-03-26 13:22:46 发布

程序媛爱学习

最新推荐文章于 2021-03-26 13:22:46 发布

阅读量422

点赞数 2

分类专栏：机器学习随机森林

本文链接：https://blog.csdn.net/weixin_44995835/article/details/98200040

版权

机器学习同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

随机森林

2 篇文章 0 订阅

订阅专栏

from sklearn.ensemble import RandomForestClassifier

1.RandomForestClassifier（）

参数	含义
n_estimators	森林中树木的数量，默认为10
criterion	衡量分割质量的函数，默认“基尼系数”
max_depth	树深，整数或无
min_samples_split	分割内部节点所需的最小样本数，默认为2。如果整数，就认为是最小数；如果浮点数，就要乘上n_samples作为每次分割的最小样本数
min_samples_leaf	叶节点上所需的最小样本数，默认为1。整数和浮点数的情况同上
min_weight_fraction_leaf	浮点数，叶节点(所有输入样本)所需权值之和的最小加权分数。
max_features	int float,string,none,最佳分割时的特征数。
max_leaf_nodes
min_impurity_decrease
min_impurity_split	float，树木停止生长的阈值，节点的不纯净度超过这个阈值就会分裂
bootstrap	布尔值，构建树时是否使用引导样例。如果为False，则使用整个数据集构建每个树。默认为True
oob_score	布尔值，是否使用袋外样本估计泛化精度。默认为False
n_jobs	拟合和预测并行运行的作业数
random_state	如果int, random_state是随机数生成器使用的种子;如果是RandomState实例，random_state是随机数生成器;如果没有，则随机数生成器是np.random使用的随机状态实例。
verbose	控制拟合和预测时的冗长
warm_start	当设置为True时，重用上一个调用的解决方案以适应并向集成中添加更多的估计器，否则，只适应一个全新的forest
class_weight	与{class_label: weight}形式的类关联的权重

2.pandas的drop()
axis=0,删行
axis=1,删列
具体例子可看：
https://blog.csdn.net/legalhighhigh/article/details/80546422

3. train_test_split（）：分割训练集与测试集
用法如下

X_train,X_test, y_train, y_test =sklearn.model_selection.train_test_split(train_data,train_target,test_size=0.4, random_state=0,stratify=y_train)

https://blog.csdn.net/jiushinayang/article/details/81098186

4.读入数据，用NaN代替/,并且查看每个特征下方未知数目
代码如下：

    df = pd.read_csv('SCD_data.csv', encoding='gb2312')
    df = df.replace("/", np.NAN)
    print(df.isnull().sum())

结果：
在这里插入图片描述
5.想用众值代替某列中未知值的时候，发现明明写入了语句，"NaN"依然存在。
后来在检查的时候，发现是因为当前列的众数不止一个，输出的是Series,而不像mean（）返回的是一个值
解决办法：

df[col] = df[col].fillna(df[col].mode()[0])     # 默认取第一个众数

6.iloc（）：提取数据

data.iloc[0] #取第一行数据
data.iloc[:,[0]] #取第0列所有行
data.iloc[[0,1],[0,1]] #提取第0、1行，第0、1列中的数据
data.iloc[:,:] #取所有数据

最终成果相关代码如下：

# 分割训练集合测试集
    X, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# 随机森林评估特征重要性
    feat_labels = df.columns[1:]
    forest = RandomForestClassifier(n_estimators=10000, n_jobs=-1, random_state=0)
    forest.fit(X_train, y_train)
    # Gini_importance就是树在这个feature进行分叉时，Gini_impurity下降的数值。
    importances = forest.feature_importances_
    # print(type(importances))
    indices = np.argsort(importances)[::-1]
    # print(indices)
    final_labels = []
    for f in range(X_train.shape[1]):
        # 给予10000颗决策树平均不纯度衰减的计算来评估特征重要性
        final_labels.append(feat_labels[indices[f]])
        print("%2d) %-*s %f" % (f + 1, 30, final_labels[f], importances[indices[f]]))


   # 生成表格数据
    rows = [final_labels, importances[indices]]
    with open('test1.csv', 'w', newline='')as csv_file:
        # 获取一个csv对象进行内容写入
        writer = csv.writer(csv_file)
        # 写入多行
        writer.writerows(rows)

    # 可视化特征重要性-依据平均不纯度衰减
    plt.title('Feature Importance-RandomForest')
    plt.bar(range(X_train.shape[1]), importances[indices], color='lightblue', align='center')
    plt.xticks(range(X_train.shape[1]), final_labels, rotation=90)
    plt.xlim([-1, X_train.shape[1]])
    plt.tight_layout()
    plt.show()

程序媛爱学习

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
随机森林学习笔记1

from sklearn.ensemble import RandomForestClassifier1.RandomForestClassifier（）参数含义n_estimators森林中树木的数量，默认为10criterion衡量分割质量的函数，默认“基尼系数”max_depth树深，整数或无min_samples_split分割内部节点所...
复制链接

扫一扫

专栏目录