数据分析——随机森林解决分类问题

最新推荐文章于 2024-06-25 17:43:58 发布

wxsy024680

最新推荐文章于 2024-06-25 17:43:58 发布

阅读量8k

点赞数 8

分类专栏：数据分析文章标签： python 随机森林分类

本文链接：https://blog.csdn.net/wxsy024680/article/details/116267153

版权

数据分析专栏收录该内容

16 篇文章 0 订阅

订阅专栏

根据轴承振动数据预测轴承故障，轴承振动数据一共有792组，每组数据包括6000个时间点的振幅。轴承标签数据一共有10类，0表示无故障，1~9分别表示不同的故障。
百度网盘下载训练数据：
链接：https://pan.baidu.com/s/1oKPwn_rAgA5pMk5geCdZKg
提取码：bqjb

二分类

将标签1~9改为1，将多分类问题变为二分类问题

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import roc_curve, auc, roc_auc_score

x = []
y = []
# 将item分割成sub_item_num份
sub_item_num = 20
rfile = open('train_data.csv', 'r')
items = csv.reader(rfile)
for i, item in enumerate(items):
	if i >= 1:
		# 获取y，将标签1~9改为1
		label = 1 if int(item[-1])>0 else 0
		y.append(label)
		# 获取x
		item = np.array(item)
		item = item[1:-1].astype(np.float)
		item = item**2
		sub_items = np.array_split(item, sub_item_num)
		feature = []
		for sub_item in sub_items:
			feature.append(np.max(sub_item))
		x.append(feature)
rfile.close()

# 获取训练数据和测试数据
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.3, random_state = 10)

# 决策树分类
dtc = DecisionTreeClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=10)
dtc = dtc.fit(xtrain, ytrain)
accuracy = dtc.score(xtest, ytest)
print('决策树准确率：\n', accuracy)

# 随机森林分类
rfc = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=10)
rfc = rfc.fit(xtrain, ytrain)
accuracy = rfc.score(xtest, ytest)
print('随机森林准确率：\n', accuracy)
print('决策树1的信息：\n', rfc.estimators_[0])
print('类别列表：\n', rfc.classes_)
print('类别数：\n', rfc.n_classes_)
print('预测的标签：\n', rfc.predict(xtest)[:10]) # 仅显示前10条数据
print('各个标签的概率值：\n', rfc.predict_proba(xtest)[:10,:]) # 仅显示前10条数据
print('标签1的概率值：\n', rfc.predict_proba(xtest)[:10,1]) # 仅显示前10条数据
print('各个特征的重要性：\n', rfc.feature_importances_)
print('各个特征的重要性排名：\n', np.argsort(rfc.feature_importances_))

# n_estimators：决策树个数
# min_samples_split：节点最少样本数量，即低于此数量就不再作为节点
# min_samples_leaf：叶子最少样本数量，即低于此数量就不再作为叶子
# max_depth：决策树层数
# cv：表示几折交叉验证

# 优化随机森林超参数n_estimators
param_test1 = {'n_estimators':range(1, 20 , 1)}
gs1 = GridSearchCV(estimator = RandomForestClassifier(max_depth=8, 
													min_samples_leaf=10, 
													min_samples_split=20, 
													random_state=10),
				param_grid = param_test1,
				scoring = 'roc_auc',
				cv = 5) # cv表示几折交叉验证
gs1.fit(xtrain, ytrain)
print(gs1.best_params_, gs1.best_score_)

# 优化随机森林超参数max_depth
param_test3 = {'max_depth': range(1, 10, 1)}
gs3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators=3, 
													min_samples_leaf=10, 
													min_samples_split=20, 
													random_state=10),
				param_grid = param_test3,
				scoring = 'roc_auc',
				cv = 5) # cv表示几折交叉验证
gs3.fit(xtrain, ytrain)
print(gs3.best_params_, gs3.best_score_)

# 优化随机森林超参数min_samples_leaf和min_samples_split
param_test2 = {'min_samples_leaf':range(1, 10, 1), 'min_samples_split':range(2, 20, 1)}
gs2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators=3, 
													max_depth=2, 
													random_state=10),
				param_grid = param_test2,
				scoring = 'roc_auc',
				cv = 5)
gs2.fit(xtrain, ytrain)
print(gs2.best_params_, gs2.best_score_)

# 优化随机森林超参数class_weight和criterion
param_test4 = {'class_weight': [None, 'balanced'], 'criterion': ['gini', 'entropy']}
gs4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators=3, 
													max_depth=2,
													min_samples_leaf=1, 
													min_samples_split=2, 
													random_state=10),
				param_grid = param_test4,
				scoring = 'roc_auc',
				cv = 5) # cv表示几折交叉验证
gs4.fit(xtrain, ytrain)
print(gs4.best_params_, gs4.best_score_)

# 用roc_auc_score评估
score = roc_auc_score(ytest, gs4.best_estimator_.predict_proba(xtest)[:,1])
print('roc_auc_score：\n', score)

# 优化后的随机森林
rfc = RandomForestClassifier(n_estimators=3, max_depth=2, min_samples_leaf=1, min_samples_split=2, random_state=10)
rfc = rfc.fit(xtrain, ytrain)
accuracy = rfc.score(xtest, ytest)
print('优化后的随机森林准确率：\n', accuracy)

决策树准确率：
0.9957983193277311
随机森林准确率：
0.9957983193277311
决策树1的信息：
DecisionTreeClassifier(class_weight=None, criterion=‘gini’, max_depth=None,
max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1165313289, splitter=‘best’)
类别列表：
[0 1]
类别数：
2
预测的标签：
[1 1 0 1 1 1 1 0 1 0]
各个标签的概率值：
[[0. 1.]
[0. 1.]
[1. 0.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[1. 0.]
[0. 1.]
[1. 0.]]
标签1的概率值：
[1. 1. 0. 1. 1. 1. 1. 0. 1. 0.]
各个特征的重要性：
[0. 0.28922611 0.09799994 0. 0.00417205 0.
0.0980406 0. 0.10085798 0.00506141 0. 0.09672961
0.002046 0.09802512 0.10439175 0.0042681 0. 0.09706366
0.00211767 0. ]
各个特征的重要性排名：
[ 0 16 10 7 5 19 3 12 18 4 15 9 11 17 2 13 6 8 14 1]
{‘n_estimators’: 3} 1.0
{‘max_depth’: 2} 1.0
{‘min_samples_leaf’: 1, ‘min_samples_split’: 2} 1.0
{‘class_weight’: None, ‘criterion’: ‘gini’} 1.0
roc_auc_score：
0.9999509419152276
优化后的随机森林准确率：
0.9957983193277311

多分类

将二分类优化好的参数应用于多分类时，准确率由85%降到39%，说明二分类优化的参数不适合多分类

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score

x = []
y = []
# 将item分割成sub_item_num份
sub_item_num = 20
rfile = open('train_data.csv', 'r')
items = csv.reader(rfile)
for i, item in enumerate(items):
	if i >= 1:
		# 获取y
		label = int(item[-1])
		y.append(label)
		# 获取x
		item = np.array(item)
		item = item[1:-1].astype(np.float)
		item = item**2
		sub_items = np.array_split(item, sub_item_num)
		feature = []
		for sub_item in sub_items:
			feature.append(np.max(sub_item))
		x.append(feature)
rfile.close()

# 获取训练数据和测试数据
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.3, random_state = 10)

# 决策树分类
dtc = DecisionTreeClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=10)
dtc = dtc.fit(xtrain, ytrain)
accuracy = dtc.score(xtest, ytest)
print('决策树准确率：\n', accuracy)

# 随机森林分类
rfc = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=10)
rfc = rfc.fit(xtrain, ytrain)
accuracy = rfc.score(xtest, ytest)
print('随机森林准确率：\n', accuracy)
print('决策树1的信息：\n', rfc.estimators_[0])
print('类别列表：\n', rfc.classes_)
print('类别数：\n', rfc.n_classes_)
print('预测的标签：\n', rfc.predict(xtest)[:10]) # 仅显示前10条数据
print('各个标签的概率值：\n', rfc.predict_proba(xtest)[:10,:]) # 仅显示前10条数据
print('标签1的概率值：\n', rfc.predict_proba(xtest)[:10,1]) # 仅显示前10条数据
print('各个特征的重要性：\n', rfc.feature_importances_)
print('各个特征的重要性排名：\n', np.argsort(rfc.feature_importances_))

# 优化后的随机森林
rfc = RandomForestClassifier(n_estimators=3, max_depth=2, min_samples_leaf=1, min_samples_split=2, random_state=10)
rfc = rfc.fit(xtrain, ytrain)
accuracy = rfc.score(xtest, ytest)
print('优化后的随机森林准确率：\n', accuracy)

决策树准确率：
0.6680672268907563
随机森林准确率：
0.8529411764705882
决策树1的信息：
DecisionTreeClassifier(class_weight=None, criterion=‘gini’, max_depth=None,
max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1165313289, splitter=‘best’)
类别列表：
[0 1 2 3 4 5 6 7 8 9]
类别数：
10
预测的标签：
[4 4 0 7 7 3 4 0 6 0]
各个标签的概率值：
[[0. 0. 0. 0. 0.5 0. 0. 0.4 0. 0.1]
[0. 0. 0. 0. 0.8 0. 0. 0.1 0. 0.1]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.1 0.1 0. 0.4 0. 0.4]
[0. 0. 0. 0. 0. 0. 0.1 0.7 0. 0.2]
[0. 0.2 0.1 0.7 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.7 0. 0. 0.3 0. 0. ]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0. 0.6 0.3 0. 0.1]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
标签1的概率值：
[0. 0. 0. 0. 0. 0.2 0. 0. 0. 0. ]
各个特征的重要性：
[0.03396424 0.09647236 0.06529748 0.01845416 0.03670531 0.04392909
0.08551289 0.03128865 0.06090745 0.0263214 0.03080538 0.05375302
0.0860377 0.06233137 0.07279089 0.02700919 0.03563748 0.06931447
0.02881671 0.03465075]
各个特征的重要性排名：
[ 3 9 15 18 10 7 0 19 16 4 5 11 8 13 2 17 14 6 12 1]
优化后的随机森林准确率：
0.3949579831932773

wxsy024680

关注

8
点赞
踩
43

收藏

觉得还不错? 一键收藏
2
评论
数据分析——随机森林解决分类问题

根据轴承振动数据预测轴承故障，轴承振动数据一共有792组，每组数据包括6000个时间点的振幅。轴承标签数据一共有10类，0表示无故障，1~9分别表示不同的故障。百度网盘下载训练数据：链接：https://pan.baidu.com/s/1oKPwn_rAgA5pMk5geCdZKg提取码：bqjb二分类将标签1~9改为1，将多分类问题变为二分类问题import csvimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.
复制链接

扫一扫

专栏目录