【机器学习】随机森林的理论和python代码实现（超简单）

最新推荐文章于 2024-05-10 15:03:16 发布

小松不菜

最新推荐文章于 2024-05-10 15:03:16 发布

阅读量892

点赞数

分类专栏：深度学习文章标签： python 机器学习随机森林

本文链接：https://blog.csdn.net/zjutkarma/article/details/129812193

版权

深度学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Random Forest运作理论

视频链接 https://www.bilibili.com/video/BV1Ra4y1E752/?spm_id_from=333.337.search-card.all.click&vd_source=a542d98d483fd367e498fc1f04b5dc10

定义

random forest is a collection of a bunch of decision tree

supervised machine learning, need labelled data

有监督学习，需要有标签的信息

在一张图片中，每一个像素都有一个像素值，我们可以设置不同的像素值代表了不同的质地

比如air < 10; 11 < pore < 60; pyrite>170

就变成了一个decision tree: root node, leaf node, internal node

once all possible branches in our decision tree end in leaf nodes, we’re done. We’ve trained a decision tree.

决策点的选择

决策点：gives a best split of input data

pick a node that gives the best split: use gini impurity

使用基尼系数来决定决策点的选择

决策树缺点：会过拟合，在训练集上表现好，但是在测试集上表现差

运作流程

random subset of features available

pick the one that gives the best split in data

最后的结果由大多数的决策树投票选出 majority

random forest 代码实现

视频链接 https://www.youtube.com/watch?v=YYjvkSJoui4

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

df = pd.read_csv("au_label.csv")
print(df.head())

sizes = df['label'].value_counts(sort = 1)
print(sizes)
#就会统计label的数量
df.drop(["people"], axis = 1, inplace = True)

#handle missing values
#df = df.dropna()

#convert non-numeric data to numeric
#eg: good-1 bad-0
#df.Productivity[df.Productivity == "bad"] = 0
#df.Productivity[df.Productivity == "good"] = 1

#define dependent variable
Y = df["label"].values
Y = Y.astype("int")

#Define independent variables
#column_selection3 =  [' AU01_r', ' AU02_r', ' AU04_r', ' AU05_r', ' AU06_r', ' AU07_r', ' AU09_r', ' AU10_r', ' AU12_r', ' AU14_r', ' AU15_r', ' AU17_r', ' AU20_r', ' AU23_r', ' AU25_r', ' AU26_r',' AU45_r']
X = df.drop(labels = ["label"], axis = 1)

#now the data is ready

#split data into train and test datasets
#从sklearn中倒入切分数据集的函数
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state = 20)

#print(X_test)

#倒入随机森林模型
from sklearn.ensemble import RandomForestClassifier
#define a model
model = RandomForestClassifier(n_estimators = 10, random_state = 30)
#训练模型
model.fit(X_train, Y_train)
#test dataset
prediction_test = model.predict(X_test)
print(prediction_test)
#compare with Y_test, check if it is correct
from sklearn import metrics
print("Accuracy = ", metrics.accuracy_score(Y_test, prediction_test))

#figure out which feature is the most important
#找到权重更加大的特征
#print(model.feature_importances_)
feature_list = list(X.columns)
feature_imp = pd.Series(model.feature_importances_, index = feature_list).sort_values(ascending=False)
print(feature_imp)