【机器学习】随机森林的理论和python代码实现(超简单)

Random Forest运作理论

视频链接 https://www.bilibili.com/video/BV1Ra4y1E752/?spm_id_from=333.337.search-card.all.click&vd_source=a542d98d483fd367e498fc1f04b5dc10

定义

random forest is a collection of a bunch of decision tree

supervised machine learning, need labelled data

有监督学习,需要有标签的信息

在一张图片中,每一个像素都有一个像素值,我们可以设置不同的像素值代表了不同的质地

比如air < 10; 11 < pore < 60; pyrite>170

就变成了一个decision tree: root node, leaf node, internal node

once all possible branches in our decision tree end in leaf nodes, we’re done. We’ve trained a decision tree.

决策点的选择

决策点:gives a best split of input data

pick a node that gives the best split: use gini impurity

使用基尼系数来决定决策点的选择

决策树缺点:会过拟合,在训练集上表现好,但是在测试集上表现差

运作流程

random subset of features available

pick the one that gives the best split in data

最后的结果由大多数的决策树投票选出 majority

random forest 代码实现

视频链接 https://www.youtube.com/watch?v=YYjvkSJoui4

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

df = pd.read_csv("au_label.csv")
print(df.head())

sizes = df['label'].value_counts(sort = 1)
print(sizes)
#就会统计label的数量
df.drop(["people"], axis = 1, inplace = True)

#handle missing values
#df = df.dropna()

#convert non-numeric data to numeric
#eg: good-1 bad-0
#df.Productivity[df.Productivity == "bad"] = 0
#df.Productivity[df.Productivity == "good"] = 1

#define dependent variable
Y = df["label"].values
Y = Y.astype("int")

#Define independent variables
#column_selection3 =  [' AU01_r', ' AU02_r', ' AU04_r', ' AU05_r', ' AU06_r', ' AU07_r', ' AU09_r', ' AU10_r', ' AU12_r', ' AU14_r', ' AU15_r', ' AU17_r', ' AU20_r', ' AU23_r', ' AU25_r', ' AU26_r',' AU45_r']
X = df.drop(labels = ["label"], axis = 1)

#now the data is ready

#split data into train and test datasets
#从sklearn中倒入切分数据集的函数
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state = 20)

#print(X_test)

#倒入随机森林模型
from sklearn.ensemble import RandomForestClassifier
#define a model
model = RandomForestClassifier(n_estimators = 10, random_state = 30)
#训练模型
model.fit(X_train, Y_train)
#test dataset
prediction_test = model.predict(X_test)
print(prediction_test)
#compare with Y_test, check if it is correct
from sklearn import metrics
print("Accuracy = ", metrics.accuracy_score(Y_test, prediction_test))

#figure out which feature is the most important
#找到权重更加大的特征
#print(model.feature_importances_)
feature_list = list(X.columns)
feature_imp = pd.Series(model.feature_importances_, index = feature_list).sort_values(ascending=False)
print(feature_imp)
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值