数据分析——随机森林解决分类问题

根据轴承振动数据预测轴承故障,轴承振动数据一共有792组,每组数据包括6000个时间点的振幅。轴承标签数据一共有10类,0表示无故障,1~9分别表示不同的故障。
百度网盘下载训练数据:
链接:https://pan.baidu.com/s/1oKPwn_rAgA5pMk5geCdZKg
提取码:bqjb

二分类

将标签1~9改为1,将多分类问题变为二分类问题

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import roc_curve, auc, roc_auc_score

x = []
y = []
# 将item分割成sub_item_num份
sub_item_num = 20
rfile = open('train_data.csv', 'r')
items = csv.reader(rfile)
for i, item in enumerate(items):
	if i >= 1:
		# 获取y,将标签1~9改为1
		label = 1 if int(item[-1])>0 else 0
		y.append(label)
		# 获取x
		item = np.array(item)
		item = item[1:-1].astype(np.float)
		item = item**2
		sub_items = np.array_split(item, sub_item_num)
		feature = []
		for sub_item in sub_items:
			feature.append(np.max(sub_item))
		x.append(feature)
rfile.close()

# 获取训练数据和测试数据
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.3, random_state = 10)

# 决策树分类
dtc = DecisionTreeClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=10)
dtc = dtc.fit(xtrain, ytrain)
accuracy = dtc.score(xtest, ytest)
print('决策树准确率:\n', accuracy)

# 随机森林分类
rfc = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=10)
rfc = rfc.fit(xtrain, ytrain)
accuracy = rfc.score(xtest, ytest)
print('随机森林准确率:\n', accuracy)
print('决策树1的信息:\n', rfc.estimators_[0])
print('类别列表:\n', rfc.classes_)
print('类别数:\n', rfc.n_classes_)
print('预测的标签:\n', rfc.predict(xtest)[:10]) # 仅显示前10条数据
print('各个标签的概率值:\n', rfc.predict_proba(xtest)[:10,:]) # 仅显示前10条数据
print('标签1的概率值:\n', rfc.predict_proba(xtest)[:10,1]) # 仅显示前10条数据
print('各个特征的重要性:\n', rfc.feature_importances_)
print('各个特征的重要性排名:\n', np.argsort(rfc.feature_importances_))

# n_estimators:决策树个数
# min_samples_split:节点最少样本数量,即低于此数量就不再作为节点
# min_samples_leaf:叶子最少样本数量,即低于此数量就不再作为叶子
# max_depth:决策树层数
# cv:表示几折交叉验证

# 优化随机森林超参数n_estimators
param_test1 = {'n_estimators':range(1, 20 , 1)}
gs1 = GridSearchCV(estimator = RandomForestClassifier(max_depth=8, 
													min_samples_leaf=10, 
													min_samples_split=20, 
													random_state=10),
				param_grid = param_test1,
				scoring = 'roc_auc',
				cv = 5) # cv表示几折交叉验证
gs1.fit(xtrain, ytrain)
print(gs1.best_params_, gs1.best_score_)

# 优化随机森林超参数max_depth
param_test3 = {'max_depth': range(1, 10, 1)}
gs3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators=3, 
													min_samples_leaf=10, 
													min_samples_split=20, 
													random_state=10),
				param_grid = param_test3,
				scoring = 'roc_auc',
				cv = 5) # cv表示几折交叉验证
gs3.fit(xtrain, ytrain)
print(gs3.best_params_, gs3.best_score_)

# 优化随机森林超参数min_samples_leaf和min_samples_split
param_test2 = {'min_samples_leaf':range(1, 10, 1), 'min_samples_split':range(2, 20, 1)}
gs2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators=3, 
													max_depth=2, 
													random_state=10),
				param_grid = param_test2,
				scoring = 'roc_auc',
				cv = 5)
gs2.fit(xtrain, ytrain)
print(gs2.best_params_, gs2.best_score_)

# 优化随机森林超参数class_weight和criterion
param_test4 = {'class_weight': [None, 'balanced'], 'criterion': ['gini', 'entropy']}
gs4 = GridSearchCV(estimator = RandomForestClassifier(n_estimators=3, 
													max_depth=2,
													min_samples_leaf=1, 
													min_samples_split=2, 
													random_state=10),
				param_grid = param_test4,
				scoring = 'roc_auc',
				cv = 5) # cv表示几折交叉验证
gs4.fit(xtrain, ytrain)
print(gs4.best_params_, gs4.best_score_)

# 用roc_auc_score评估
score = roc_auc_score(ytest, gs4.best_estimator_.predict_proba(xtest)[:,1])
print('roc_auc_score:\n', score)

# 优化后的随机森林
rfc = RandomForestClassifier(n_estimators=3, max_depth=2, min_samples_leaf=1, min_samples_split=2, random_state=10)
rfc = rfc.fit(xtrain, ytrain)
accuracy = rfc.score(xtest, ytest)
print('优化后的随机森林准确率:\n', accuracy)

决策树准确率:
0.9957983193277311
随机森林准确率:
0.9957983193277311
决策树1的信息:
DecisionTreeClassifier(class_weight=None, criterion=‘gini’, max_depth=None,
max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1165313289, splitter=‘best’)
类别列表:
[0 1]
类别数:
2
预测的标签:
[1 1 0 1 1 1 1 0 1 0]
各个标签的概率值:
[[0. 1.]
[0. 1.]
[1. 0.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[1. 0.]
[0. 1.]
[1. 0.]]
标签1的概率值:
[1. 1. 0. 1. 1. 1. 1. 0. 1. 0.]
各个特征的重要性:
[0. 0.28922611 0.09799994 0. 0.00417205 0.
0.0980406 0. 0.10085798 0.00506141 0. 0.09672961
0.002046 0.09802512 0.10439175 0.0042681 0. 0.09706366
0.00211767 0. ]
各个特征的重要性排名:
[ 0 16 10 7 5 19 3 12 18 4 15 9 11 17 2 13 6 8 14 1]
{‘n_estimators’: 3} 1.0
{‘max_depth’: 2} 1.0
{‘min_samples_leaf’: 1, ‘min_samples_split’: 2} 1.0
{‘class_weight’: None, ‘criterion’: ‘gini’} 1.0
roc_auc_score:
0.9999509419152276
优化后的随机森林准确率:
0.9957983193277311

多分类

将二分类优化好的参数应用于多分类时,准确率由85%降到39%,说明二分类优化的参数不适合多分类

import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score

x = []
y = []
# 将item分割成sub_item_num份
sub_item_num = 20
rfile = open('train_data.csv', 'r')
items = csv.reader(rfile)
for i, item in enumerate(items):
	if i >= 1:
		# 获取y
		label = int(item[-1])
		y.append(label)
		# 获取x
		item = np.array(item)
		item = item[1:-1].astype(np.float)
		item = item**2
		sub_items = np.array_split(item, sub_item_num)
		feature = []
		for sub_item in sub_items:
			feature.append(np.max(sub_item))
		x.append(feature)
rfile.close()

# 获取训练数据和测试数据
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.3, random_state = 10)

# 决策树分类
dtc = DecisionTreeClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=10)
dtc = dtc.fit(xtrain, ytrain)
accuracy = dtc.score(xtest, ytest)
print('决策树准确率:\n', accuracy)

# 随机森林分类
rfc = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_leaf=1, min_samples_split=2, random_state=10)
rfc = rfc.fit(xtrain, ytrain)
accuracy = rfc.score(xtest, ytest)
print('随机森林准确率:\n', accuracy)
print('决策树1的信息:\n', rfc.estimators_[0])
print('类别列表:\n', rfc.classes_)
print('类别数:\n', rfc.n_classes_)
print('预测的标签:\n', rfc.predict(xtest)[:10]) # 仅显示前10条数据
print('各个标签的概率值:\n', rfc.predict_proba(xtest)[:10,:]) # 仅显示前10条数据
print('标签1的概率值:\n', rfc.predict_proba(xtest)[:10,1]) # 仅显示前10条数据
print('各个特征的重要性:\n', rfc.feature_importances_)
print('各个特征的重要性排名:\n', np.argsort(rfc.feature_importances_))

# 优化后的随机森林
rfc = RandomForestClassifier(n_estimators=3, max_depth=2, min_samples_leaf=1, min_samples_split=2, random_state=10)
rfc = rfc.fit(xtrain, ytrain)
accuracy = rfc.score(xtest, ytest)
print('优化后的随机森林准确率:\n', accuracy)

决策树准确率:
0.6680672268907563
随机森林准确率:
0.8529411764705882
决策树1的信息:
DecisionTreeClassifier(class_weight=None, criterion=‘gini’, max_depth=None,
max_features=‘auto’, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1165313289, splitter=‘best’)
类别列表:
[0 1 2 3 4 5 6 7 8 9]
类别数:
10
预测的标签:
[4 4 0 7 7 3 4 0 6 0]
各个标签的概率值:
[[0. 0. 0. 0. 0.5 0. 0. 0.4 0. 0.1]
[0. 0. 0. 0. 0.8 0. 0. 0.1 0. 0.1]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.1 0.1 0. 0.4 0. 0.4]
[0. 0. 0. 0. 0. 0. 0.1 0.7 0. 0.2]
[0. 0.2 0.1 0.7 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0.7 0. 0. 0.3 0. 0. ]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0. 0.6 0.3 0. 0.1]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ]]
标签1的概率值:
[0. 0. 0. 0. 0. 0.2 0. 0. 0. 0. ]
各个特征的重要性:
[0.03396424 0.09647236 0.06529748 0.01845416 0.03670531 0.04392909
0.08551289 0.03128865 0.06090745 0.0263214 0.03080538 0.05375302
0.0860377 0.06233137 0.07279089 0.02700919 0.03563748 0.06931447
0.02881671 0.03465075]
各个特征的重要性排名:
[ 3 9 15 18 10 7 0 19 16 4 5 11 8 13 2 17 14 6 12 1]
优化后的随机森林准确率:
0.3949579831932773

  • 8
    点赞
  • 43
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值