实验5:完全基于数据挖掘的方法建模预测
2021年10月11日
在前几个实验过程中,Dr.Li说有大约45个数据特征需要直接删除(基于人工经验的特征选择),作为一个没有任何领域专家所具备的经验知识的菜鸡,甚``是不解,如何判断传感器坏了(需要去现场),为什么说这些特征需要直接删除?这里通过特征工程的方法进行验证
。
实验思路:
-
数据载入、预处理;
-
没有多少的预处理过程;箱线图、散点图矩阵、heatmap、特征重要性评估;
-
建模预测,套索回归、随机森林回归;
-
模型评估,设想1(计算距离、欧式距离,海明距离等)、多标签分类评估指标;
附加说明:原始数据建模任务为多标签分类,含38个标签列。创新性的想法:将每个样本所对应的横向标签值视为二进制编码、问题转化为回归问题
训练回归模型,待模型预测输出之后,将回归输出结果进行二进制映射,得到二进制编码串,即样本的标签值0-1。本实验进行可行性验证
。
实验结果:
-
模型在测试集上的决定系数 R 2 R^2 R2达到了0.92;
-
模型在训练集上的决定系数 R 2 R^2 R2达到了0.97;
-
基于数据驱动的方法,特征选择的结果与人工经验直接剔除的结果大致吻合,即前面实验中直接删除的特征在这里基本都被识别出来;
-
然而,对回归输出结果进行二进制编码输出,并采用多标签分类指标进行评估之后,与上述性能指标相去甚远,可行性有待进一步考量;
思考:
-
相较于实验3、4,并综合实验5的结果分析发现,全部特征的使用,对于模型性能的干扰不大,即当前数据特征的加入,似乎并未引入额外噪音;
-
基于实验3,4,5的实验结果发现,采用43个特征、85个特征、92个特征、RF Regression性能差异不大,则多余特征是否可以直接删除?
-
在计算机领域等大数据场景,考虑模型性能相当的情况下,可毫不犹豫地选用更少的特征,然而在土木等高危行业,完全基于数据驱动的方法是否可行?
- 在机器学习或数据挖掘领域,如何实现将人工经验与模型建立,预测全过程进行有机融合?
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
data = pd.read_csv("./Data_all_V2.csv")
data.head()
TIME | XC-1 | XC-2 | XC-3 | XC-4 | XC-5 | RF-1 | RF-2 | RF-3 | RF-4 | ... | K4-6 | K4-7 | K4-8 | K5-1 | K5-2 | K5-3 | K5-4 | K5-5 | K5-6 | Y_dec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-07-19 00:00:00 | 5.734000 | 5.426833 | 5.575333 | 5.534667 | 5.606333 | 0.1 | 0.1 | 1.503 | 1.2695 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
1 | 2020-07-19 01:00:00 | 5.730667 | 5.418667 | 5.571167 | 5.531667 | 5.603667 | 0.1 | 0.1 | 1.484 | 1.2685 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
2 | 2020-07-19 02:00:00 | 5.737833 | 5.424500 | 5.577833 | 5.537167 | 5.611667 | 0.1 | 0.1 | 1.443 | 1.2610 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1048576 |
3 | 2020-07-19 03:00:00 | 5.730833 | 5.418333 | 5.571000 | 5.525500 | 5.609333 | 0.1 | 0.1 | 1.474 | 1.2570 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1048576 |
4 | 2020-07-19 04:00:00 | 5.740000 | 5.429167 | 5.581000 | 5.549500 | 5.606333 | 0.1 | 0.1 | 1.489 | 1.2525 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 |
5 rows × 170 columns
data.describe() # 应力数据比较均匀,极值相差不大
XC-1 | XC-2 | XC-3 | XC-4 | XC-5 | RF-1 | RF-2 | RF-3 | RF-4 | RF-5 | ... | K4-6 | K4-7 | K4-8 | K5-1 | K5-2 | K5-3 | K5-4 | K5-5 | K5-6 | Y_dec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 601.000000 | 601.000000 | 601.000000 | 601.000000 | 601.000000 | ... | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 623.000000 | 6.230000e+02 |
mean | 8.821620 | 8.584109 | 8.705866 | 8.692483 | 8.725204 | 0.637256 | 0.118895 | 1.245230 | 1.111804 | 1.206263 | ... | 0.154093 | 0.044944 | 0.065811 | 0.086677 | 0.115570 | 0.104334 | 0.081862 | 0.109149 | 0.048154 | 2.928224e+10 |
std | 3.064538 | 3.213162 | 3.134184 | 3.175944 | 3.098137 | 0.663537 | 0.187299 | 0.339832 | 0.354685 | 0.349192 | ... | 0.361328 | 0.207347 | 0.248150 | 0.281588 | 0.319965 | 0.305939 | 0.274375 | 0.312077 | 0.214264 | 5.544480e+10 |
min | 5.608333 | 5.415000 | 5.568833 | 5.520167 | 5.588667 | 0.100000 | 0.100000 | 0.456000 | 0.100000 | 0.430500 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 |
25% | 6.083667 | 5.524750 | 5.810583 | 5.778250 | 5.833250 | 0.100000 | 0.100000 | 0.993000 | 0.880500 | 0.903000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.099200e+06 |
50% | 7.419000 | 7.354167 | 7.417333 | 7.409500 | 7.466167 | 0.379000 | 0.100000 | 1.367500 | 1.142000 | 1.276500 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.368709e+08 |
75% | 11.877583 | 11.671000 | 11.764083 | 11.918000 | 11.606667 | 0.961500 | 0.100000 | 1.458000 | 1.310500 | 1.523000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.439343e+10 |
max | 14.251500 | 14.176000 | 14.207167 | 14.253500 | 14.155833 | 2.817000 | 3.077500 | 2.036500 | 4.603500 | 2.730500 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.190783e+11 |
8 rows × 169 columns
# 缺失值统计,存在一定的缺失值
import missingno as msno
import seaborn as sns
msno.bar(data)
<matplotlib.axes._subplots.AxesSubplot at 0x1f6bab1e308>
# 按列计算缺失值
def missing_values_table(df):
# 总缺失值
mis_val = df.isnull().sum()
# 缺失值占比
mis_val_percent = mis_val / len(df) * 100
# 结果表
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# 重命名列名
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : '缺失值', 1 : '缺失值占比'})
# 按缺失值占比降序排列
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'缺失值占比', ascending=False).round(1)
# 打印信息
print ("一共" + str(df.shape[1]) + "列,\n"
"其中" + str(mis_val_table_ren_columns.shape[0]) +
"列有缺失值。")
return mis_val_table_ren_columns
missing_values_table(data)
一共170列,
其中100列有缺失值。
缺失值 | 缺失值占比 | |
---|---|---|
RF-1 | 22 | 3.5 |
RF-81 | 22 | 3.5 |
RF-95 | 22 | 3.5 |
RF-94 | 22 | 3.5 |
RF-93 | 22 | 3.5 |
... | ... | ... |
RF-50 | 9 | 1.4 |
RF-51 | 9 | 1.4 |
RF-58 | 9 | 1.4 |
RF-59 | 9 | 1.4 |
RF-103 | 9 | 1.4 |
100 rows × 2 columns
# 重复值统计,不存在重复值
data.duplicated().value_counts()
False 623
dtype: int64
# 异常值统计
data[['XC-1']].boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x1f6bd3055c8>
特征选择1,异常传感器数据删除
耀哥说,这里的传感器坏了,直接将这些特征删除
。
暂时未弄清楚,如何确定传感器故障的原因?
# 这里不进行耀哥所说的损坏传感器的数据特征删除,采用所有的特征,即完全依赖于数据驱动的方法
# 这里我先不进行删除,后面使用所有的特征进行重要性评估
# del_list = [1,2,6,10,12,13,14,15,16,17,23,24,28,29,36,37,39,42,43,48,51,54,58,59,60,62,66,67,69,70,72,74,76,77,85,87,89,90,92,94,97,104,105,108,116]
# del_col = ["RF-"+str(i) for i in del_list]
# data_new = data.drop(del_col,axis=1)
# 缺失值删除,这里在实验1中进行了缺失值填充,然相同的模型直接出现过拟合情况???
data_new = data.dropna()
data_new.shape
(601, 170)
X = data_new.iloc[:,1:131]
X.head()
XC-1 | XC-2 | XC-3 | XC-4 | XC-5 | RF-1 | RF-2 | RF-3 | RF-4 | RF-5 | ... | RF-116 | RF-117 | RF-118 | RF-119 | RF-120 | RF-121 | RF-122 | RF-123 | RF-124 | RF-125 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.734000 | 5.426833 | 5.575333 | 5.534667 | 5.606333 | 0.1 | 0.1 | 1.503 | 1.2695 | 1.5110 | ... | 0.6505 | 0.3530 | 0.6080 | 0.5690 | 2.5085 | 2.5195 | 1.9170 | 3.4460 | 2.9660 | 2.2025 |
1 | 5.730667 | 5.418667 | 5.571167 | 5.531667 | 5.603667 | 0.1 | 0.1 | 1.484 | 1.2685 | 1.5110 | ... | 0.6545 | 0.3550 | 0.6110 | 0.5700 | 2.5035 | 2.5055 | 2.8310 | 3.4210 | 2.9560 | 2.1920 |
2 | 5.737833 | 5.424500 | 5.577833 | 5.537167 | 5.611667 | 0.1 | 0.1 | 1.443 | 1.2610 | 1.5055 | ... | 0.6570 | 0.3595 | 0.6145 | 0.5725 | 2.4950 | 2.4910 | 2.6855 | 3.3850 | 2.9385 | 2.1760 |
3 | 5.730833 | 5.418333 | 5.571000 | 5.525500 | 5.609333 | 0.1 | 0.1 | 1.474 | 1.2570 | 1.4990 | ... | 0.6590 | 0.3620 | 0.6195 | 0.5750 | 2.4885 | 2.4870 | 4.0795 | 3.3595 | 2.9250 | 2.1605 |
4 | 5.740000 | 5.429167 | 5.581000 | 5.549500 | 5.606333 | 0.1 | 0.1 | 1.489 | 1.2525 | 1.4960 | ... | 0.6630 | 0.3630 | 0.6240 | 0.5785 | 2.4845 | 2.4905 | 3.4460 | 3.3470 | 2.9160 | 2.1595 |
5 rows × 130 columns
# 标签列取值差异过大,这里进行线性函数归一化
# Y = (data_new['Y_dec']-data_new['Y_dec'].min())/(data_new['Y_dec'].max()-data_new['Y_dec'].min())
# 这里对目标列进行对数变换
# data_new["Y_dec"] = np.log(data_new["Y_dec"].values+1)
# Y = data_new["Y_dec"]
# 这里不进行归一化,直接取出Y
Y = (data_new['Y_dec'])
Y
0 0
1 0
2 1048576
3 1048576
4 0
...
596 43017177090
597 43017177106
598 43017177106
599 43017177106
600 43017177106
Name: Y_dec, Length: 601, dtype: int64
特征选择2,热力图heatmap
print(X.columns)
Index(['XC-1', 'XC-2', 'XC-3', 'XC-4', 'XC-5', 'RF-1', 'RF-2', 'RF-3', 'RF-4',
'RF-5',
...
'RF-116', 'RF-117', 'RF-118', 'RF-119', 'RF-120', 'RF-121', 'RF-122',
'RF-123', 'RF-124', 'RF-125'],
dtype='object', length=130)
# 为了后面方面,将X和Y进行拼接
data_new = pd.concat([X, Y], axis=1)
data_new.shape
(601, 131)
data_new.head()
XC-1 | XC-2 | XC-3 | XC-4 | XC-5 | RF-1 | RF-2 | RF-3 | RF-4 | RF-5 | ... | RF-117 | RF-118 | RF-119 | RF-120 | RF-121 | RF-122 | RF-123 | RF-124 | RF-125 | Y_dec | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.734000 | 5.426833 | 5.575333 | 5.534667 | 5.606333 | 0.1 | 0.1 | 1.503 | 1.2695 | 1.5110 | ... | 0.3530 | 0.6080 | 0.5690 | 2.5085 | 2.5195 | 1.9170 | 3.4460 | 2.9660 | 2.2025 | 0 |
1 | 5.730667 | 5.418667 | 5.571167 | 5.531667 | 5.603667 | 0.1 | 0.1 | 1.484 | 1.2685 | 1.5110 | ... | 0.3550 | 0.6110 | 0.5700 | 2.5035 | 2.5055 | 2.8310 | 3.4210 | 2.9560 | 2.1920 | 0 |
2 | 5.737833 | 5.424500 | 5.577833 | 5.537167 | 5.611667 | 0.1 | 0.1 | 1.443 | 1.2610 | 1.5055 | ... | 0.3595 | 0.6145 | 0.5725 | 2.4950 | 2.4910 | 2.6855 | 3.3850 | 2.9385 | 2.1760 | 1048576 |
3 | 5.730833 | 5.418333 | 5.571000 | 5.525500 | 5.609333 | 0.1 | 0.1 | 1.474 | 1.2570 | 1.4990 | ... | 0.3620 | 0.6195 | 0.5750 | 2.4885 | 2.4870 | 4.0795 | 3.3595 | 2.9250 | 2.1605 | 1048576 |
4 | 5.740000 | 5.429167 | 5.581000 | 5.549500 | 5.606333 | 0.1 | 0.1 | 1.489 | 1.2525 | 1.4960 | ... | 0.3630 | 0.6240 | 0.5785 | 2.4845 | 2.4905 | 3.4460 | 3.3470 | 2.9160 | 2.1595 | 0 |
5 rows × 131 columns
len(data_new.columns)
131
data_new.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 601 entries, 0 to 600
Columns: 131 entries, XC-1 to Y_dec
dtypes: float64(130), int64(1)
memory usage: 619.8 KB
import numpy as np
from mlxtend.plotting import heatmap
import matplotlib.pyplot as plt
cols = data_new.columns[:20]
cm = np.corrcoef(data_new[cols].values.T)
hm = heatmap(cm, row_names=cols, column_names=cols, column_name_rotation=45, figsize=(15, 15))
plt.savefig('./heatmap-1.png', dpi=300)
plt.show()
cm = np.corrcoef(data_new[data_new.columns].values.T)
print(cm)
[[1. 0.99801785 0.99838858 ... 0.8491032 0.49766594 0.28886902]
[0.99801785 1. 0.9986607 ... 0.84473304 0.48884636 0.30590216]
[0.99838858 0.9986607 1. ... 0.84974265 0.49435048 0.30096088]
...
[0.8491032 0.84473304 0.84974265 ... 1. 0.59760806 0.19152249]
[0.49766594 0.48884636 0.49435048 ... 0.59760806 1. 0.10425281]
[0.28886902 0.30590216 0.30096088 ... 0.19152249 0.10425281 1. ]]
特征选择3 特征递归消除/L1特征选择/序列反向选择/特征重要性度量
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
X_train.shape
(480, 130)
from sklearn.ensemble import RandomForestRegressor
feat_labels = data_new.columns[:130]
forest = RandomForestRegressor(random_state=1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
plt.rcParams['font.sans-serif']=['FangSong']
plt.figure(figsize=(18, 5))
plt.title('特征重要性评估')
plt.bar(range(X_train.shape[1]),
importances[indices],
align='center')
plt.xticks(range(X_train.shape[1]),
feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('images/04_09.png', dpi=300)
plt.show()
1) RF-92 0.200760
2) RF-10 0.180728
3) RF-87 0.051799
4) RF-95 0.028264
5) RF-24 0.028129
6) RF-73 0.022808
7) RF-100 0.020915
8) RF-82 0.018629
9) RF-9 0.018139
10) RF-86 0.017640
11) RF-23 0.015947
12) RF-88 0.014931
13) RF-84 0.014429
14) RF-30 0.014212
15) RF-26 0.013331
16) RF-125 0.012515
17) RF-52 0.012292
18) RF-83 0.012036
19) RF-7 0.011987
20) RF-34 0.011529
21) RF-21 0.011237
22) RF-33 0.011077
23) RF-31 0.010535
24) RF-67 0.010051
25) RF-123 0.010015
26) RF-32 0.009059
27) RF-79 0.009013
28) RF-39 0.008055
29) RF-96 0.007899
30) XC-2 0.007668
31) RF-109 0.007457
32) RF-18 0.006936
33) RF-47 0.006934
34) RF-49 0.006903
35) RF-25 0.006135
36) RF-38 0.005785
37) XC-1 0.004617
38) RF-20 0.004595
39) RF-107 0.004383
40) XC-5 0.004339
41) RF-5 0.004297
42) RF-11 0.004213
43) RF-8 0.004203
44) RF-120 0.004015
45) RF-27 0.003573
46) RF-64 0.003397
47) RF-74 0.003380
48) RF-93 0.003163
49) RF-97 0.002960
50) RF-118 0.002928
51) RF-54 0.002877
52) RF-112 0.002807
53) RF-44 0.002696
54) RF-45 0.002664
55) RF-40 0.002663
56) RF-48 0.002653
57) RF-69 0.002637
58) RF-43 0.002630
59) RF-57 0.002541
60) RF-22 0.002534
61) RF-55 0.002529
62) RF-65 0.002446
63) RF-58 0.002433
64) XC-4 0.002097
65) RF-104 0.002049
66) RF-15 0.002011
67) RF-101 0.001984
68) RF-80 0.001779
69) RF-4 0.001730
70) RF-78 0.001726
71) RF-99 0.001721
72) RF-124 0.001704
73) RF-50 0.001680
74) RF-56 0.001673
75) RF-1 0.001652
76) RF-121 0.001644
77) RF-102 0.001643
78) RF-60 0.001586
79) RF-53 0.001452
80) RF-42 0.001424
81) RF-111 0.001396
82) RF-91 0.001379
83) RF-35 0.001346
84) RF-17 0.001324
85) RF-122 0.001320
86) RF-103 0.001314
87) RF-41 0.001214
88) RF-117 0.001205
89) RF-16 0.001203
90) RF-46 0.001163
91) RF-119 0.001126
92) RF-113 0.001126
93) RF-63 0.001057
94) RF-71 0.001048
95) RF-61 0.000935
96) RF-19 0.000858
97) RF-114 0.000819
98) RF-85 0.000786
99) RF-70 0.000708
100) RF-116 0.000691
101) RF-105 0.000685
102) RF-98 0.000570
103) RF-3 0.000515
104) RF-106 0.000429
105) RF-110 0.000421
106) RF-81 0.000414
107) RF-115 0.000270
108) RF-6 0.000260
109) XC-3 0.000243
110) RF-36 0.000191
111) RF-68 0.000176
112) RF-75 0.000128
113) RF-77 0.000108
114) RF-76 0.000046
115) RF-13 0.000010
116) RF-72 0.000005
117) RF-29 0.000002
118) RF-2 0.000001
119) RF-12 0.000000
120) RF-94 0.000000
121) RF-51 0.000000
122) RF-90 0.000000
123) RF-59 0.000000
124) RF-37 0.000000
125) RF-14 0.000000
126) RF-108 0.000000
127) RF-62 0.000000
128) RF-28 0.000000
129) RF-66 0.000000
130) RF-89 0.000000
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(forest, threshold=0.001057, prefit=True)
X_selected = sfm.transform(X_train)
print('最终确定选择的特征个数:',
X_selected.shape[1])
最终确定选择的特征个数: 92
list1 = []
for f in range(X_selected.shape[1]):
list1.append(feat_labels[indices[f]])
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
1) RF-92 0.200760
2) RF-10 0.180728
3) RF-87 0.051799
4) RF-95 0.028264
5) RF-24 0.028129
6) RF-73 0.022808
7) RF-100 0.020915
8) RF-82 0.018629
9) RF-9 0.018139
10) RF-86 0.017640
11) RF-23 0.015947
12) RF-88 0.014931
13) RF-84 0.014429
14) RF-30 0.014212
15) RF-26 0.013331
16) RF-125 0.012515
17) RF-52 0.012292
18) RF-83 0.012036
19) RF-7 0.011987
20) RF-34 0.011529
21) RF-21 0.011237
22) RF-33 0.011077
23) RF-31 0.010535
24) RF-67 0.010051
25) RF-123 0.010015
26) RF-32 0.009059
27) RF-79 0.009013
28) RF-39 0.008055
29) RF-96 0.007899
30) XC-2 0.007668
31) RF-109 0.007457
32) RF-18 0.006936
33) RF-47 0.006934
34) RF-49 0.006903
35) RF-25 0.006135
36) RF-38 0.005785
37) XC-1 0.004617
38) RF-20 0.004595
39) RF-107 0.004383
40) XC-5 0.004339
41) RF-5 0.004297
42) RF-11 0.004213
43) RF-8 0.004203
44) RF-120 0.004015
45) RF-27 0.003573
46) RF-64 0.003397
47) RF-74 0.003380
48) RF-93 0.003163
49) RF-97 0.002960
50) RF-118 0.002928
51) RF-54 0.002877
52) RF-112 0.002807
53) RF-44 0.002696
54) RF-45 0.002664
55) RF-40 0.002663
56) RF-48 0.002653
57) RF-69 0.002637
58) RF-43 0.002630
59) RF-57 0.002541
60) RF-22 0.002534
61) RF-55 0.002529
62) RF-65 0.002446
63) RF-58 0.002433
64) XC-4 0.002097
65) RF-104 0.002049
66) RF-15 0.002011
67) RF-101 0.001984
68) RF-80 0.001779
69) RF-4 0.001730
70) RF-78 0.001726
71) RF-99 0.001721
72) RF-124 0.001704
73) RF-50 0.001680
74) RF-56 0.001673
75) RF-1 0.001652
76) RF-121 0.001644
77) RF-102 0.001643
78) RF-60 0.001586
79) RF-53 0.001452
80) RF-42 0.001424
81) RF-111 0.001396
82) RF-91 0.001379
83) RF-35 0.001346
84) RF-17 0.001324
85) RF-122 0.001320
86) RF-103 0.001314
87) RF-41 0.001214
88) RF-117 0.001205
89) RF-16 0.001203
90) RF-46 0.001163
91) RF-119 0.001126
92) RF-113 0.001126
print(list1)
['RF-92', 'RF-10', 'RF-87', 'RF-95', 'RF-24', 'RF-73', 'RF-100', 'RF-82', 'RF-9', 'RF-86', 'RF-23', 'RF-88', 'RF-84', 'RF-30', 'RF-26', 'RF-125', 'RF-52', 'RF-83', 'RF-7', 'RF-34', 'RF-21', 'RF-33', 'RF-31', 'RF-67', 'RF-123', 'RF-32', 'RF-79', 'RF-39', 'RF-96', 'XC-2', 'RF-109', 'RF-18', 'RF-47', 'RF-49', 'RF-25', 'RF-38', 'XC-1', 'RF-20', 'RF-107', 'XC-5', 'RF-5', 'RF-11', 'RF-8', 'RF-120', 'RF-27', 'RF-64', 'RF-74', 'RF-93', 'RF-97', 'RF-118', 'RF-54', 'RF-112', 'RF-44', 'RF-45', 'RF-40', 'RF-48', 'RF-69', 'RF-43', 'RF-57', 'RF-22', 'RF-55', 'RF-65', 'RF-58', 'XC-4', 'RF-104', 'RF-15', 'RF-101', 'RF-80', 'RF-4', 'RF-78', 'RF-99', 'RF-124', 'RF-50', 'RF-56', 'RF-1', 'RF-121', 'RF-102', 'RF-60', 'RF-53', 'RF-42', 'RF-111', 'RF-91', 'RF-35', 'RF-17', 'RF-122', 'RF-103', 'RF-41', 'RF-117', 'RF-16', 'RF-46', 'RF-119', 'RF-113']
重新选择特征,确定X
X = data_new[list1]
X.columns
Index(['RF-92', 'RF-10', 'RF-87', 'RF-95', 'RF-24', 'RF-73', 'RF-100', 'RF-82',
'RF-9', 'RF-86', 'RF-23', 'RF-88', 'RF-84', 'RF-30', 'RF-26', 'RF-125',
'RF-52', 'RF-83', 'RF-7', 'RF-34', 'RF-21', 'RF-33', 'RF-31', 'RF-67',
'RF-123', 'RF-32', 'RF-79', 'RF-39', 'RF-96', 'XC-2', 'RF-109', 'RF-18',
'RF-47', 'RF-49', 'RF-25', 'RF-38', 'XC-1', 'RF-20', 'RF-107', 'XC-5',
'RF-5', 'RF-11', 'RF-8', 'RF-120', 'RF-27', 'RF-64', 'RF-74', 'RF-93',
'RF-97', 'RF-118', 'RF-54', 'RF-112', 'RF-44', 'RF-45', 'RF-40',
'RF-48', 'RF-69', 'RF-43', 'RF-57', 'RF-22', 'RF-55', 'RF-65', 'RF-58',
'XC-4', 'RF-104', 'RF-15', 'RF-101', 'RF-80', 'RF-4', 'RF-78', 'RF-99',
'RF-124', 'RF-50', 'RF-56', 'RF-1', 'RF-121', 'RF-102', 'RF-60',
'RF-53', 'RF-42', 'RF-111', 'RF-91', 'RF-35', 'RF-17', 'RF-122',
'RF-103', 'RF-41', 'RF-117', 'RF-16', 'RF-46', 'RF-119', 'RF-113'],
dtype='object')
# 再次划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
X_train.shape
(480, 92)
套索回归
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.00001)
reg.fit(X_train,y_train)
D:\installation\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:532: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 1.6435372468214974e+23, tolerance: 1.4973870832681085e+20
positive)
Lasso(alpha=1e-05)
# 这里怎样确定的参考值为0.12467
plt.plot(np.abs(reg.coef_),'--o')
plt.axhline(y=0.02,c='r')
<matplotlib.lines.Line2D at 0x1f6c19fe748>
np.mean(np.abs(reg.coef_))
18311770661.999977
reg.score(X_test,y_test)
0.732052935705332
reg.score(X_train,y_train)
0.7804793075636222
pred = reg.predict(X_test)
# 预测结果有负值
pred
array([-5.91279193e+09, 1.52970644e+11, 1.21356713e+11, 1.61144231e+09,
-2.76269753e+09, 1.66805503e+11, -2.55209259e+09, 6.57431795e+10,
2.09392306e+09, 3.00407735e+10, 5.10659695e+07, 1.44706078e+10,
2.75700557e+09, 8.36149121e+09, 1.09271166e+10, 1.53840924e+10,
4.89804223e+09, -3.27441451e+09, 1.01305009e+10, 5.01193482e+09,
1.13311659e+10, 7.10822012e+09, -1.91037528e+10, 1.12133120e+10,
2.94543418e+10, 3.37137129e+10, 6.84364644e+09, 7.92962844e+09,
1.39378767e+11, 6.76242943e+10, 6.81095813e+09, 2.35075869e+08,
-9.01481269e+09, -3.12739062e+09, 3.70687644e+10, 8.92374996e+10,
-9.79097338e+09, -1.36527924e+10, 2.95734998e+10, 1.21358170e+11,
3.69019313e+10, -7.57923541e+08, -8.41263434e+09, 7.08175212e+10,
-2.16577354e+10, 2.28445865e+09, 4.30734803e+10, -1.39146317e+10,
-8.80558734e+09, 1.21641591e+11, -1.02231574e+10, -2.09872847e+09,
1.00209332e+09, 1.35053067e+11, 2.42631162e+09, -1.21676250e+10,
-1.04680970e+10, 2.41489349e+10, 2.87759396e+09, -9.82787928e+09,
7.25166935e+08, 7.30268846e+09, 8.89993429e+10, 5.27150258e+10,
6.40587784e+10, 6.49945475e+08, 9.59509662e+10, 7.83788370e+09,
-6.21577580e+09, 7.91980402e+09, -2.44524065e+10, -2.00705094e+09,
1.53761308e+11, 1.37174202e+10, 3.12556649e+10, 1.22462383e+10,
-1.37108599e+10, 1.19086132e+10, 1.17209150e+11, 1.58853291e+10,
1.89075897e+11, 1.14837540e+11, 1.77801946e+10, 4.35574632e+10,
1.71481394e+10, -1.93642884e+10, 1.16004011e+10, -1.43264415e+10,
-4.38351832e+10, -9.74000150e+09, 1.16270513e+10, 5.39487296e+10,
1.88127992e+09, -1.13001965e+10, -1.69820432e+09, 1.52138838e+11,
-2.58193379e+10, -3.93855871e+09, 1.30366979e+11, -1.90334073e+10,
9.72897998e+09, -1.11751827e+10, 1.58294676e+10, 7.61421260e+10,
5.90676211e+07, 2.31508892e+10, -4.34778130e+09, 1.70370590e+11,
1.93214298e+10, 1.33169111e+11, 1.54414149e+10, 5.60941514e+10,
3.17653949e+10, 6.26557415e+09, -4.92238959e+10, 9.81001782e+10,
3.71165210e+10, 1.70241027e+10, 7.99393150e+10, 1.77629873e+09,
-1.55065335e+10])
plt.rcParams['axes.unicode_minus']=False
plt.plot(pred,'-o')
plt.plot(y_test.values,'-x')
[<matplotlib.lines.Line2D at 0x1f6c1a78bc8>]
随机森林回归
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators=500,random_state=1)
forest.fit(X_train, y_train)
RandomForestRegressor(n_estimators=500, random_state=1)
forest.score(X_test, y_test)
0.9261195509450422
forest.score(X_train, y_train)
0.9755027191067267
pred = forest.predict(X_test)
plt.figure(figsize=(9, 5))
plt.plot(pred, '-o')
plt.plot(y_test.values, '-x')
plt.legend(['pred','y_test'])
plt.savefig('./拟合结果3.png', dpi=300)
plt.show()
y_test.shape
(121,)
pred
array([1.24675352e+08, 2.05173475e+11, 1.12789236e+11, 8.00262968e+07,
7.54421520e+08, 1.85587425e+11, 1.55189248e+05, 9.81478164e+10,
5.74017449e+07, 4.24724722e+09, 4.66480466e+07, 5.08733195e+08,
1.00001649e+08, 6.44360553e+09, 1.42626958e+10, 3.13156145e+07,
4.31114181e+07, 1.26695267e+08, 7.76068297e+08, 7.59096300e+07,
8.48918886e+06, 7.27663689e+09, 1.17102611e+10, 3.46650597e+10,
1.10393074e+10, 7.40686389e+09, 4.81951760e+07, 4.17041954e+08,
1.18842789e+11, 1.97305119e+10, 3.22288935e+10, 3.89535361e+08,
6.65526165e+07, 5.75985656e+08, 6.27017229e+08, 6.49339062e+10,
3.85719023e+08, 1.89915162e+09, 1.55328249e+10, 9.74938349e+10,
8.37426616e+09, 8.95340757e+06, 2.69645769e+09, 2.85048298e+10,
1.87720648e+07, 7.47554743e+08, 6.02530517e+10, 2.15799840e+07,
4.55129420e+08, 8.61797050e+10, 8.86174570e+08, 5.78620744e+08,
2.09165917e+08, 1.16666346e+11, 1.08871650e+10, 2.74455762e+09,
3.28018528e+08, 6.68731939e+08, 5.17918113e+08, 4.66320301e+08,
7.33137992e+07, 6.73922735e+08, 3.76987720e+10, 1.56428215e+10,
5.41022623e+10, 2.90909762e+07, 1.28268144e+11, 8.92707179e+09,
4.63233331e+07, 1.17945429e+09, 9.14957227e+08, 1.12137744e+07,
1.83293585e+11, 4.09141865e+10, 1.33210105e+10, 1.31548227e+10,
6.86595383e+08, 1.04837817e+08, 1.25573677e+11, 8.05585407e+09,
1.80610749e+11, 4.69923831e+10, 4.03452763e+10, 1.16284253e+10,
8.46682973e+07, 5.45160184e+06, 1.06452752e+09, 4.64929293e+08,
2.83771817e+10, 3.62310379e+08, 4.31160930e+06, 3.97632956e+10,
1.96497034e+08, 6.50559268e+09, 1.02850940e+09, 1.63454052e+11,
1.93554624e+06, 1.53093184e+05, 1.34768527e+11, 8.28636442e+07,
2.03984846e+08, 3.90851214e+08, 4.71234506e+09, 8.92876697e+10,
4.98159381e+07, 1.76728362e+10, 3.07549350e+08, 1.28416273e+11,
3.70288798e+10, 1.36616852e+11, 5.65949200e+07, 5.09815668e+10,
1.16628258e+10, 8.99349129e+09, 3.09913240e+10, 1.34674082e+11,
3.16818562e+10, 3.90879421e+08, 1.28392593e+11, 6.20190259e+09,
4.31842726e+08])
list_forest_pred = list(pred)
list_forest_pred
[124675352.248,
205173474701.0255,
112789236039.8526,
80026296.82,
754421520.08,
185587425370.55,
155189.248,
98147816379.65141,
57401744.884,
4247247221.6748667,
46648046.592,
508733194.752,
100001649.256,
6443605531.820465,
14262695805.496351,
31315614.504,
43111418.068,
126695266.728,
776068297.328,
75909630.036,
8489188.864,
7276636891.948751,
11710261065.919668,
34665059729.64825,
11039307422.680023,
7406863889.814001,
48195175.96,
417041953.82,
118842788770.70879,
19730511943.936913,
32228893465.0654,
389535361.076,
66552616.464,
575985656.2351364,
627017229.124,
64933906221.420006,
385719023.424,
1899151623.344267,
15532824945.758196,
97493834936.33267,
8374266160.352725,
8953407.568,
2696457689.141733,
28504829788.50441,
18772064.84,
747554743.296,
60253051736.09195,
21579984.008,
455129419.776,
86179705011.769,
886174569.6234286,
578620744.4596444,
209165916.504,
116666345834.87895,
10887164990.000875,
2744557618.8492255,
328018527.82,
668731939.1013333,
517918112.6706667,
466320300.524,
73313799.18,
673922734.728,
37698771993.227325,
15642821490.815332,
54102262309.18782,
29090976.244,
128268143897.54721,
8927071788.246752,
46323333.06,
1179454292.77,
914957226.78,
11213774.4,
183293585492.91522,
40914186531.93481,
13321010536.4859,
13154822719.0016,
686595383.3788,
104837817.052,
125573677330.19968,
8055854065.595169,
180610749089.55646,
46992383090.550606,
40345276294.254875,
11628425323.005379,
84668297.288,
5451601.836,
1064527519.5970285,
464929293.44,
28377181747.692825,
362310378.664,
4311609.304,
39763295589.8338,
196497034.256,
6505592679.9942665,
1028509400.3388445,
163454051992.4498,
1935546.236,
153093.184,
134768526920.15146,
82863644.164,
203984845.772,
390851214.396,
4712345059.564934,
89287669687.39032,
49815938.1,
17672836223.857914,
307549349.88159996,
128416272544.25165,
37028879840.38972,
136616852034.36374,
56594920.04,
50981566750.73101,
11662825816.32518,
8993491285.912485,
30991324005.078377,
134674082283.40555,
31681856183.926003,
390879421.376,
128392592893.31665,
6201902588.575354,
431842726.056]
"""
Python向下取整:直接使用int
向上取整,直接使用ceil()
"""
import math
list_bin = []
for i in list_forest_pred:
list_bin.append(bin(int(math.ceil(i))))
list_bin
['0b111011011100110010100011001',
'0b10111111000101010010101100010110001110',
'0b1101001000010110000110010100101001000',
'0b100110001010001101010111001',
'0b101100111101111000111100010001',
'0b10101100110101110111110110000001011011',
'0b100101111000110110',
'0b1011011011010000100001101001110111100',
'0b11011010111110000110010001',
'0b11111101001001111101100101110110',
'0b10110001111100101011101111',
'0b11110010100101010011100001011',
'0b101111101011110011101110010',
'0b110000000000100011001111000011100',
'0b1101010010000111110111011101111110',
'0b1110111011101011010011111',
'0b10100100011101001111111011',
'0b111100011010011011101100011',
'0b101110010000011101110011001010',
'0b100100001100100100111111111',
'0b100000011000100011100101',
'0b110110001101110001010101011011100',
'0b1010111001111111000110011101001010',
'0b100000010010001100101101010110010010',
'0b1010010001111111100111011010011111',
'0b110111001011110111100011000010010',
'0b10110111110110011001101000',
'0b11000110110111000111000100010',
'0b1101110101011100101010000101110100011',
'0b10010011000000001111011100001001000',
'0b11110000000111111011110001100011010',
'0b10111001101111101011010000010',
'0b11111101111000001100101001',
'0b100010010101001101011111111001',
'0b100101010111111000011000001110',
'0b111100011110010111001100011100101110',
'0b10110111111011001101011110000',
'0b1110001001100101100000100001000',
'0b1110011101110101000001100101110010',
'0b1011010110011000101011101110010111001',
'0b111110011001001010010100100110001',
'0b100010001001111001000000',
'0b10100000101110001010110111011010',
'0b11010100011000001010010111101011101',
'0b1000111100111000001100001',
'0b101100100011101100011110111000',
'0b111000000111010111001001101101011001',
'0b1010010010100100011010001',
'0b11011001000001011100101001100',
'0b1010000010000101101011111000010110100',
'0b110100110100011111001101101010',
'0b100010011111010000110101001001',
'0b1100011101111001111001011101',
'0b1101100101001110110110010100101101011',
'0b1010001000111011001111010000111111',
'0b10100011100101101010000000110011',
'0b10011100011010010101001100000',
'0b100111110111000000101000100100',
'0b11110110111101100110110100001',
'0b11011110010110111101110101101',
'0b100010111101010111000001000',
'0b101000001010110011111010101111',
'0b100011000111000001011001110000011010',
'0b1110100100011000101000001101110011',
'0b110010011000101111110000001000100110',
'0b1101110111110010010100001',
'0b1110111011101011000001000110100011010',
'0b1000010100000110000100111000101101',
'0b10110000101101011010000110',
'0b1000110010011010000101101010101',
'0b110110100010010010001110101011',
'0b101010110001101111001111',
'0b10101010101101001001100010100001010101',
'0b100110000110101011001111000100100100',
'0b1100011001111111100111110101101001',
'0b1100010000000101101010101001000000',
'0b101000111011001001110100111000',
'0b110001111111011001010111010',
'0b1110100111100110001100100000100010011',
'0b111100000001010101001001111110010',
'0b10101000001101001111010101001010100010',
'0b101011110000111101101101110001110011',
'0b100101100100110001000000111110000111',
'0b1010110101000110111011000001101100',
'0b101000010111110111110001010',
'0b10100110010111101010010',
'0b111111011100110110011010100000',
'0b11011101101100100001000001110',
'0b11010011011011010010110111000110100',
'0b10101100110000110101011101011',
'0b10000011100101000111010',
'0b100101000010000100111011110101100110',
'0b1011101101100100111010001011',
'0b110000011110000110111011101101000',
'0b111101010011011100111011011001',
'0b10011000001110100111101110011010011001',
'0b111011000100010111011',
'0b100101011000000110',
'0b1111101100000110101001000011001001001',
'0b100111100000110011000011101',
'0b1100001010001000111111001110',
'0b10111010010111110101010001111',
'0b100011000111000001010110111100100',
'0b1010011001001111101011011011110111000',
'0b10111110000010000110000011',
'0b10000011101011000100001010010000000',
'0b10010010101001101010010100110',
'0b1110111100110001101001101000010100001',
'0b100010011111000101111101110111100001',
'0b1111111001110111111111011101001000011',
'0b11010111111001000111101001',
'0b101111011110101111001111100100011111',
'0b1010110111001010001001100101011001',
'0b1000011000000011011100100101010110',
'0b11100110111001110100001001101100110',
'0b1111101011011001100110110100111101100',
'0b11101100000011000101100001010111000',
'0b10111010011000101100010111110',
'0b1110111100100110010110111110111111110',
'0b101110001101010011000010111111101',
'0b11001101111010110010110100111']
len(list_bin)
121
(pd.DataFrame(y_test)).to_csv('./y_test.csv')
(pd.DataFrame(list_bin)).to_csv('./list_bin.csv')
print(int("0b111011011100110010100011001",2))
124675353
将二进制串补全到38位
str1 = '0b1010101'
print(str1.zfill(10))
00b1010101
# 去掉二进制串的‘0b’开头
list_bin_Q0b = []
for str_bin in list_bin:
list_bin_Q0b.append(str_bin[2:])
list_bin_Q0b
['111011011100110010100011001',
'10111111000101010010101100010110001110',
'1101001000010110000110010100101001000',
'100110001010001101010111001',
'101100111101111000111100010001',
'10101100110101110111110110000001011011',
'100101111000110110',
'1011011011010000100001101001110111100',
'11011010111110000110010001',
'11111101001001111101100101110110',
'10110001111100101011101111',
'11110010100101010011100001011',
'101111101011110011101110010',
'110000000000100011001111000011100',
'1101010010000111110111011101111110',
'1110111011101011010011111',
'10100100011101001111111011',
'111100011010011011101100011',
'101110010000011101110011001010',
'100100001100100100111111111',
'100000011000100011100101',
'110110001101110001010101011011100',
'1010111001111111000110011101001010',
'100000010010001100101101010110010010',
'1010010001111111100111011010011111',
'110111001011110111100011000010010',
'10110111110110011001101000',
'11000110110111000111000100010',
'1101110101011100101010000101110100011',
'10010011000000001111011100001001000',
'11110000000111111011110001100011010',
'10111001101111101011010000010',
'11111101111000001100101001',
'100010010101001101011111111001',
'100101010111111000011000001110',
'111100011110010111001100011100101110',
'10110111111011001101011110000',
'1110001001100101100000100001000',
'1110011101110101000001100101110010',
'1011010110011000101011101110010111001',
'111110011001001010010100100110001',
'100010001001111001000000',
'10100000101110001010110111011010',
'11010100011000001010010111101011101',
'1000111100111000001100001',
'101100100011101100011110111000',
'111000000111010111001001101101011001',
'1010010010100100011010001',
'11011001000001011100101001100',
'1010000010000101101011111000010110100',
'110100110100011111001101101010',
'100010011111010000110101001001',
'1100011101111001111001011101',
'1101100101001110110110010100101101011',
'1010001000111011001111010000111111',
'10100011100101101010000000110011',
'10011100011010010101001100000',
'100111110111000000101000100100',
'11110110111101100110110100001',
'11011110010110111101110101101',
'100010111101010111000001000',
'101000001010110011111010101111',
'100011000111000001011001110000011010',
'1110100100011000101000001101110011',
'110010011000101111110000001000100110',
'1101110111110010010100001',
'1110111011101011000001000110100011010',
'1000010100000110000100111000101101',
'10110000101101011010000110',
'1000110010011010000101101010101',
'110110100010010010001110101011',
'101010110001101111001111',
'10101010101101001001100010100001010101',
'100110000110101011001111000100100100',
'1100011001111111100111110101101001',
'1100010000000101101010101001000000',
'101000111011001001110100111000',
'110001111111011001010111010',
'1110100111100110001100100000100010011',
'111100000001010101001001111110010',
'10101000001101001111010101001010100010',
'101011110000111101101101110001110011',
'100101100100110001000000111110000111',
'1010110101000110111011000001101100',
'101000010111110111110001010',
'10100110010111101010010',
'111111011100110110011010100000',
'11011101101100100001000001110',
'11010011011011010010110111000110100',
'10101100110000110101011101011',
'10000011100101000111010',
'100101000010000100111011110101100110',
'1011101101100100111010001011',
'110000011110000110111011101101000',
'111101010011011100111011011001',
'10011000001110100111101110011010011001',
'111011000100010111011',
'100101011000000110',
'1111101100000110101001000011001001001',
'100111100000110011000011101',
'1100001010001000111111001110',
'10111010010111110101010001111',
'100011000111000001010110111100100',
'1010011001001111101011011011110111000',
'10111110000010000110000011',
'10000011101011000100001010010000000',
'10010010101001101010010100110',
'1110111100110001101001101000010100001',
'100010011111000101111101110111100001',
'1111111001110111111111011101001000011',
'11010111111001000111101001',
'101111011110101111001111100100011111',
'1010110111001010001001100101011001',
'1000011000000011011100100101010110',
'11100110111001110100001001101100110',
'1111101011011001100110110100111101100',
'11101100000011000101100001010111000',
'10111010011000101100010111110',
'1110111100100110010110111110111111110',
'101110001101010011000010111111101',
'11001101111010110010110100111']
# 将上述二进制串补全到38位
list_bin_BQ = []
for str_bin in list_bin_Q0b:
list_bin_BQ.append(str_bin.zfill(38))
list_bin_BQ
['00000000000111011011100110010100011001',
'10111111000101010010101100010110001110',
'01101001000010110000110010100101001000',
'00000000000100110001010001101010111001',
'00000000101100111101111000111100010001',
'10101100110101110111110110000001011011',
'00000000000000000000100101111000110110',
'01011011011010000100001101001110111100',
'00000000000011011010111110000110010001',
'00000011111101001001111101100101110110',
'00000000000010110001111100101011101111',
'00000000011110010100101010011100001011',
'00000000000101111101011110011101110010',
'00000110000000000100011001111000011100',
'00001101010010000111110111011101111110',
'00000000000001110111011101011010011111',
'00000000000010100100011101001111111011',
'00000000000111100011010011011101100011',
'00000000101110010000011101110011001010',
'00000000000100100001100100100111111111',
'00000000000000100000011000100011100101',
'00000110110001101110001010101011011100',
'00001010111001111111000110011101001010',
'00100000010010001100101101010110010010',
'00001010010001111111100111011010011111',
'00000110111001011110111100011000010010',
'00000000000010110111110110011001101000',
'00000000011000110110111000111000100010',
'01101110101011100101010000101110100011',
'00010010011000000001111011100001001000',
'00011110000000111111011110001100011010',
'00000000010111001101111101011010000010',
'00000000000011111101111000001100101001',
'00000000100010010101001101011111111001',
'00000000100101010111111000011000001110',
'00111100011110010111001100011100101110',
'00000000010110111111011001101011110000',
'00000001110001001100101100000100001000',
'00001110011101110101000001100101110010',
'01011010110011000101011101110010111001',
'00000111110011001001010010100100110001',
'00000000000000100010001001111001000000',
'00000010100000101110001010110111011010',
'00011010100011000001010010111101011101',
'00000000000001000111100111000001100001',
'00000000101100100011101100011110111000',
'00111000000111010111001001101101011001',
'00000000000001010010010100100011010001',
'00000000011011001000001011100101001100',
'01010000010000101101011111000010110100',
'00000000110100110100011111001101101010',
'00000000100010011111010000110101001001',
'00000000001100011101111001111001011101',
'01101100101001110110110010100101101011',
'00001010001000111011001111010000111111',
'00000010100011100101101010000000110011',
'00000000010011100011010010101001100000',
'00000000100111110111000000101000100100',
'00000000011110110111101100110110100001',
'00000000011011110010110111101110101101',
'00000000000100010111101010111000001000',
'00000000101000001010110011111010101111',
'00100011000111000001011001110000011010',
'00001110100100011000101000001101110011',
'00110010011000101111110000001000100110',
'00000000000001101110111110010010100001',
'01110111011101011000001000110100011010',
'00001000010100000110000100111000101101',
'00000000000010110000101101011010000110',
'00000001000110010011010000101101010101',
'00000000110110100010010010001110101011',
'00000000000000101010110001101111001111',
'10101010101101001001100010100001010101',
'00100110000110101011001111000100100100',
'00001100011001111111100111110101101001',
'00001100010000000101101010101001000000',
'00000000101000111011001001110100111000',
'00000000000110001111111011001010111010',
'01110100111100110001100100000100010011',
'00000111100000001010101001001111110010',
'10101000001101001111010101001010100010',
'00101011110000111101101101110001110011',
'00100101100100110001000000111110000111',
'00001010110101000110111011000001101100',
'00000000000101000010111110111110001010',
'00000000000000010100110010111101010010',
'00000000111111011100110110011010100000',
'00000000011011101101100100001000001110',
'00011010011011011010010110111000110100',
'00000000010101100110000110101011101011',
'00000000000000010000011100101000111010',
'00100101000010000100111011110101100110',
'00000000001011101101100100111010001011',
'00000110000011110000110111011101101000',
'00000000111101010011011100111011011001',
'10011000001110100111101110011010011001',
'00000000000000000111011000100010111011',
'00000000000000000000100101011000000110',
'01111101100000110101001000011001001001',
'00000000000100111100000110011000011101',
'00000000001100001010001000111111001110',
'00000000010111010010111110101010001111',
'00000100011000111000001010110111100100',
'01010011001001111101011011011110111000',
'00000000000010111110000010000110000011',
'00010000011101011000100001010010000000',
'00000000010010010101001101010010100110',
'01110111100110001101001101000010100001',
'00100010011111000101111101110111100001',
'01111111001110111111111011101001000011',
'00000000000011010111111001000111101001',
'00101111011110101111001111100100011111',
'00001010110111001010001001100101011001',
'00001000011000000011011100100101010110',
'00011100110111001110100001001101100110',
'01111101011011001100110110100111101100',
'00011101100000011000101100001010111000',
'00000000010111010011000101100010111110',
'01110111100100110010110111110111111110',
'00000101110001101010011000010111111101',
'00000000011001101111010110010110100111']
# 将补全的串写入到CSV文件
#将一个列表内元素按行放到一个csv文件中
# 待写入列表为:list_bin_BQ
#遍历该列表
n = 0
for i in list_bin_BQ:
if n<len(list_bin_BQ)-1:
#以append的方式不断写入到csv文件中
with open("./output.csv", 'a',encoding='utf8') as name:
#写入文件时增加换行符,保证每个元素位于一行
name.write(i + '\n')
n += 1
else:
#以append的方式不断写入到csv文件中
with open("./output.csv", 'a',encoding='utf8') as name:
#最后一次写出不换行
name.write(i)
n += 1
# 将y_test进行二进制转换
# list(y_test)
list_y_test = list(y_test)
list_y_test_bin = []
for i in list_y_test:
list_y_test_bin.append(bin(i))
list_y_test_bin
['0b10000000000000000',
'0b11001100000010000101010000100100000001',
'0b10000000000100000000000000000000000000',
'0b0',
'0b110000000000000000000000000000',
'0b10110000100000000000100010000000000001',
'0b0',
'0b1101000100000000001000000100000000000',
'0b0',
'0b1101001000100001000010100000000',
'0b10000000000000000',
'0b10010000000010000100100000000',
'0b10000000000000000',
'0b10000000000100000000000000000',
'0b101000010000000100000000100000010101',
'0b1000000000000000000000',
'0b100000000000000000000000000',
'0b100000000000000000000',
'0b110000000000000000000000000000',
'0b100000100000000000000000000',
'0b100000000000000000000',
'0b1000010000100000100100000000000000',
'0b0',
'0b101000000100000001100000100000010010',
'0b10101000100000000100001000000',
'0b1000000000000100000000100000000',
'0b1000000000000000000000',
'0b11000000000000000100000000000',
'0b10000000000100010000110000000000001000',
'0b10000100001000000000000000000000000',
'0b101000000100000001100000100000000010',
'0b110000000010000000001000000000',
'0b100000100000000000000000000',
'0b1000000000',
'0b100000000000000000000000000000',
'0b1000000001000000001000010000000110000',
'0b11010000000000000100000000000',
'0b10110000101000100001000000001000',
'0b1100000010000000000100001000',
'0b110010000000000000000000000000001000',
'0b100010100000000000000100000000000',
'0b100000000000000000000000000',
'0b100000010000001000001100010',
'0b10000000000000000000000000100000000',
'0b0',
'0b110000000000000000000000000000',
'0b100010100000100001010000101000000000',
'0b0',
'0b100000001010000000000100000000',
'0b10000000001010000000101000000001000000',
'0b1000100000',
'0b1000100100',
'0b10000000000000001000100000000',
'0b10000000000100010000110000000000001000',
'0b100000100000000000',
'0b1000000001000100000000100101000',
'0b100000000100000000000000000000',
'0b110000000010000000001000000000',
'0b1000000000000000000000000000',
'0b100000000000000000100000000000',
'0b10000000000000000',
'0b110000000010000000010000000000',
'0b10000000010010001100000010100',
'0b0',
'0b100000000011000000100010000000001001',
'0b1000000000000000000000',
'0b10000000000100000000000001000000001000',
'0b10000000000100000000000000000',
'0b100000000000000000000000000',
'0b11000011000000000100000011100',
'0b1000100010010100110000',
'0b0',
'0b11001100000010000101010000100100000010',
'0b101000000000000100000100100000000100',
'0b101010000100000000010',
'0b1000000000000100000000100000000',
'0b1100011001010000100010011100',
'0b10000000000000',
'0b10000000000100010000000001000000001000',
'0b100000000001010000100000000000',
'0b10110000100000000000000010000001000011',
'0b10000101010000100100000010',
'0b101000000100000001100000100000010010',
'0b101000000000000100000000000100000',
'0b1000000000000000000000',
'0b0',
'0b100000111010000011000000000',
'0b110100000000000000000000',
'0b101000010000000100001000100000010101',
'0b100000000100000000000000000000',
'0b0',
'0b100000000010000000100010000100001001',
'0b1010000000000100000000',
'0b10000000001000000010000100010000',
'0b1000',
'0b10110000000000000000100000000000000000',
'0b10000000000000000000',
'0b0',
'0b10000000000000000000000000000000000010',
'0b100000100000000000000000000',
'0b10000000000000000000000000000',
'0b100000000100000000000000000000',
'0b1000000000000100000000100000000',
'0b10100000010000100000000001001000010000',
'0b0',
'0b10000100000000001010000100000000000',
'0b10010000100000000000000000000',
'0b10110000100000000000100000000000000001',
'0b10001000',
'0b10000000000000000000000000000000000000',
'0b100000000000',
'0b1001000010001001000000000100000000100',
'0b100000000000000001000000',
'0b1000010000100000000100000000000000',
'0b101000010000000100001000100000010101',
'0b10010000100000000000100001010000100010',
'0b100000000000100000000001001000000001',
'0b11100000101000000010000000000',
'0b10000001001010000000001000001000001100',
'0b1000010000000000000000000',
'0b11110000101000000000000000000']
# 去掉二进制串的‘0b’开头
list_y_test_bin_Q0b = []
for str_bin in list_y_test_bin:
list_y_test_bin_Q0b.append(str_bin[2:])
list_y_test_bin_Q0b
['10000000000000000',
'11001100000010000101010000100100000001',
'10000000000100000000000000000000000000',
'0',
'110000000000000000000000000000',
'10110000100000000000100010000000000001',
'0',
'1101000100000000001000000100000000000',
'0',
'1101001000100001000010100000000',
'10000000000000000',
'10010000000010000100100000000',
'10000000000000000',
'10000000000100000000000000000',
'101000010000000100000000100000010101',
'1000000000000000000000',
'100000000000000000000000000',
'100000000000000000000',
'110000000000000000000000000000',
'100000100000000000000000000',
'100000000000000000000',
'1000010000100000100100000000000000',
'0',
'101000000100000001100000100000010010',
'10101000100000000100001000000',
'1000000000000100000000100000000',
'1000000000000000000000',
'11000000000000000100000000000',
'10000000000100010000110000000000001000',
'10000100001000000000000000000000000',
'101000000100000001100000100000000010',
'110000000010000000001000000000',
'100000100000000000000000000',
'1000000000',
'100000000000000000000000000000',
'1000000001000000001000010000000110000',
'11010000000000000100000000000',
'10110000101000100001000000001000',
'1100000010000000000100001000',
'110010000000000000000000000000001000',
'100010100000000000000100000000000',
'100000000000000000000000000',
'100000010000001000001100010',
'10000000000000000000000000100000000',
'0',
'110000000000000000000000000000',
'100010100000100001010000101000000000',
'0',
'100000001010000000000100000000',
'10000000001010000000101000000001000000',
'1000100000',
'1000100100',
'10000000000000001000100000000',
'10000000000100010000110000000000001000',
'100000100000000000',
'1000000001000100000000100101000',
'100000000100000000000000000000',
'110000000010000000001000000000',
'1000000000000000000000000000',
'100000000000000000100000000000',
'10000000000000000',
'110000000010000000010000000000',
'10000000010010001100000010100',
'0',
'100000000011000000100010000000001001',
'1000000000000000000000',
'10000000000100000000000001000000001000',
'10000000000100000000000000000',
'100000000000000000000000000',
'11000011000000000100000011100',
'1000100010010100110000',
'0',
'11001100000010000101010000100100000010',
'101000000000000100000100100000000100',
'101010000100000000010',
'1000000000000100000000100000000',
'1100011001010000100010011100',
'10000000000000',
'10000000000100010000000001000000001000',
'100000000001010000100000000000',
'10110000100000000000000010000001000011',
'10000101010000100100000010',
'101000000100000001100000100000010010',
'101000000000000100000000000100000',
'1000000000000000000000',
'0',
'100000111010000011000000000',
'110100000000000000000000',
'101000010000000100001000100000010101',
'100000000100000000000000000000',
'0',
'100000000010000000100010000100001001',
'1010000000000100000000',
'10000000001000000010000100010000',
'1000',
'10110000000000000000100000000000000000',
'10000000000000000000',
'0',
'10000000000000000000000000000000000010',
'100000100000000000000000000',
'10000000000000000000000000000',
'100000000100000000000000000000',
'1000000000000100000000100000000',
'10100000010000100000000001001000010000',
'0',
'10000100000000001010000100000000000',
'10010000100000000000000000000',
'10110000100000000000100000000000000001',
'10001000',
'10000000000000000000000000000000000000',
'100000000000',
'1001000010001001000000000100000000100',
'100000000000000001000000',
'1000010000100000000100000000000000',
'101000010000000100001000100000010101',
'10010000100000000000100001010000100010',
'100000000000100000000001001000000001',
'11100000101000000010000000000',
'10000001001010000000001000001000001100',
'1000010000000000000000000',
'11110000101000000000000000000']
# 将y_test对应的二进制串补全到38位
# 待补全的列表为:list_y_test_bin_Q0b
list_bin_y_test_BQ = []
for str_bin in list_y_test_bin_Q0b:
list_bin_y_test_BQ.append(str_bin.zfill(38))
list_bin_y_test_BQ
['00000000000000000000010000000000000000',
'11001100000010000101010000100100000001',
'10000000000100000000000000000000000000',
'00000000000000000000000000000000000000',
'00000000110000000000000000000000000000',
'10110000100000000000100010000000000001',
'00000000000000000000000000000000000000',
'01101000100000000001000000100000000000',
'00000000000000000000000000000000000000',
'00000001101001000100001000010100000000',
'00000000000000000000010000000000000000',
'00000000010010000000010000100100000000',
'00000000000000000000010000000000000000',
'00000000010000000000100000000000000000',
'00101000010000000100000000100000010101',
'00000000000000001000000000000000000000',
'00000000000100000000000000000000000000',
'00000000000000000100000000000000000000',
'00000000110000000000000000000000000000',
'00000000000100000100000000000000000000',
'00000000000000000100000000000000000000',
'00001000010000100000100100000000000000',
'00000000000000000000000000000000000000',
'00101000000100000001100000100000010010',
'00000000010101000100000000100001000000',
'00000001000000000000100000000100000000',
'00000000000000001000000000000000000000',
'00000000011000000000000000100000000000',
'10000000000100010000110000000000001000',
'00010000100001000000000000000000000000',
'00101000000100000001100000100000000010',
'00000000110000000010000000001000000000',
'00000000000100000100000000000000000000',
'00000000000000000000000000001000000000',
'00000000100000000000000000000000000000',
'01000000001000000001000010000000110000',
'00000000011010000000000000100000000000',
'00000010110000101000100001000000001000',
'00000000001100000010000000000100001000',
'00110010000000000000000000000000001000',
'00000100010100000000000000100000000000',
'00000000000100000000000000000000000000',
'00000000000100000010000001000001100010',
'00010000000000000000000000000100000000',
'00000000000000000000000000000000000000',
'00000000110000000000000000000000000000',
'00100010100000100001010000101000000000',
'00000000000000000000000000000000000000',
'00000000100000001010000000000100000000',
'10000000001010000000101000000001000000',
'00000000000000000000000000001000100000',
'00000000000000000000000000001000100100',
'00000000010000000000000001000100000000',
'10000000000100010000110000000000001000',
'00000000000000000000100000100000000000',
'00000001000000001000100000000100101000',
'00000000100000000100000000000000000000',
'00000000110000000010000000001000000000',
'00000000001000000000000000000000000000',
'00000000100000000000000000100000000000',
'00000000000000000000010000000000000000',
'00000000110000000010000000010000000000',
'00000000010000000010010001100000010100',
'00000000000000000000000000000000000000',
'00100000000011000000100010000000001001',
'00000000000000001000000000000000000000',
'10000000000100000000000001000000001000',
'00000000010000000000100000000000000000',
'00000000000100000000000000000000000000',
'00000000011000011000000000100000011100',
'00000000000000001000100010010100110000',
'00000000000000000000000000000000000000',
'11001100000010000101010000100100000010',
'00101000000000000100000100100000000100',
'00000000000000000101010000100000000010',
'00000001000000000000100000000100000000',
'00000000001100011001010000100010011100',
'00000000000000000000000010000000000000',
'10000000000100010000000001000000001000',
'00000000100000000001010000100000000000',
'10110000100000000000000010000001000011',
'00000000000010000101010000100100000010',
'00101000000100000001100000100000010010',
'00000101000000000000100000000000100000',
'00000000000000001000000000000000000000',
'00000000000000000000000000000000000000',
'00000000000100000111010000011000000000',
'00000000000000110100000000000000000000',
'00101000010000000100001000100000010101',
'00000000100000000100000000000000000000',
'00000000000000000000000000000000000000',
'00100000000010000000100010000100001001',
'00000000000000001010000000000100000000',
'00000010000000001000000010000100010000',
'00000000000000000000000000000000001000',
'10110000000000000000100000000000000000',
'00000000000000000010000000000000000000',
'00000000000000000000000000000000000000',
'10000000000000000000000000000000000010',
'00000000000100000100000000000000000000',
'00000000010000000000000000000000000000',
'00000000100000000100000000000000000000',
'00000001000000000000100000000100000000',
'10100000010000100000000001001000010000',
'00000000000000000000000000000000000000',
'00010000100000000001010000100000000000',
'00000000010010000100000000000000000000',
'10110000100000000000100000000000000001',
'00000000000000000000000000000010001000',
'10000000000000000000000000000000000000',
'00000000000000000000000000100000000000',
'01001000010001001000000000100000000100',
'00000000000000100000000000000001000000',
'00001000010000100000000100000000000000',
'00101000010000000100001000100000010101',
'10010000100000000000100001010000100010',
'00100000000000100000000001001000000001',
'00000000011100000101000000010000000000',
'10000001001010000000001000001000001100',
'00000000000001000010000000000000000000',
'00000000011110000101000000000000000000']
# 将补全的串写入到CSV文件
#将一个列表内元素按行放到一个csv文件中
# 待写入列表为:list_bin_y_test_BQ
#遍历该列表
n = 0
for i in list_bin_y_test_BQ:
if n<len(list_bin_y_test_BQ)-1:
#以append的方式不断写入到csv文件中
with open("./output_y_test.csv", 'a',encoding='utf8') as name:
#写入文件时增加换行符,保证每个元素位于一行
name.write(i + '\n')
n += 1
else:
#以append的方式不断写入到csv文件中
with open("./output_y_test.csv", 'a',encoding='utf8') as name:
#最后一次写出不换行
name.write(i)
n += 1
计算预测串与真实串之间的匹配程度
设想,如果可以直接比较预测得到的二进制串与测试集二进制串,是否可以进行评估
# 思路1,计算欧式距离
# 思路2,计算海明距离
# 思路3,计算两个串的相似度
# 比较y_pred和y_test二进制串的相似度
import difflib
def string_similar(s1, s2):
return difflib.SequenceMatcher(None, s1, s2).quick_ratio()
xiangsidu_list = []
for i in range(0,len(list_bin_BQ)):
for j in range(0,len(list_bin_y_test_BQ)):
xiangsidu_list.append((string_similar(list_bin_BQ[i], list_bin_y_test_BQ[j])))
print((string_similar(list_bin_BQ[i], list_bin_y_test_BQ[j])))
# 上述输出结果过程,此处予以删除;
# 计算结果的均值
np.mean(xiangsidu_list,axis)
0.6696695293318331
计算汉明损失(Hamming-Loss)
# 打印pred结果串的类型
print(type(list_bin_BQ[0]))
# 打印y_test结果串的类型
print(type(list_bin_y_test_BQ[0]))
<class 'str'>
<class 'str'>
# 导入hamming_loss工具包
from sklearn.metrics import hamming_loss
# 测试hamming_loss计算结果
hamming_loss(y_true=[1,2,3,4], y_pred=[2,2,3,4])
0.25
# 测试程序,将字符串中的字符以逗号形式隔开。
# 如下,分开之后得到的列表元素为:逗号隔开的字符,因此后续需要进一步进行类型的转换
str = 'xiaoyao'
print(list(str))
['x', 'i', 'a', 'o', 'y', 'a', 'o']
# 处理y_pred串
y_pred_list = [(list(str)) for str in list_bin_BQ]
# 处理y_test串
y_test_list = [(list(str)) for str in list_bin_y_test_BQ]
# 查看数据类型
print(type(y_pred_list), type(y_test_list))
# 查看第‘0’个数据结构
print(y_pred_list[0], y_test_list[0])
<class 'list'> <class 'list'>
['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '1', '1', '0', '1', '1', '0', '1', '1', '1', '0', '0', '1', '1', '0', '0', '1', '0', '1', '0', '0', '0', '1', '1', '0', '0', '1'] ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']
# 根据官方文档,hamming_loss的输入为(y_true, y_pred),两者均为array。其支持多标签分类问题;
# 将上述两个list抓换为array
pred_array = np.array(y_pred_list)
test_array = np.array(y_test_list)
# 打印出预测array和真实array的shape
print((pred_array).shape,(test_array).shape)
(121, 38) (121, 38)
# 此时两个array的元素数据类型仍为str
type(pred_array[0][0])
numpy.str_
# 将两个array中的数据元素转换为int
int_pred_array = pred_array.astype(np.int32)
int_test_array = test_array.astype(np.int32)
# 计算hamming_loss,这里相较于上述回归模型,模型在测试集的R^2为0.92
hamming_loss(y_true=int_test_array, y_pred=int_pred_array)
0.4275772074815137
计算杰卡德相似系数
from sklearn.metrics import jaccard_score
# 直接将上述的int_pred_array、int_test_array作为参数传入
print(jaccard_score(int_test_array, int_pred_array, average='macro'))
print(jaccard_score(int_test_array, int_pred_array, average='samples'))
print(jaccard_score(int_test_array, int_pred_array, average=None))
0.11170386238918098
0.09724124663042615
[0.29411765 0.05555556 0.26470588 0.28 0.23684211 0.05882353
0.05263158 0.07142857 0.21153846 0.26984127 0.14583333 0.11940299
0.06557377 0.05084746 0.046875 0.06153846 0.09677419 0.15
0.07142857 0.13513514 0.16438356 0.07894737 0.02666667 0.03076923
0.0877193 0.05 0.18181818 0.03030303 0.11267606 0.14492754
0.01754386 0.03389831 0.08823529 0.11111111 0.08974359 0.1372549
0.08955224 0.03030303]
计算F1_score
from sklearn.metrics import f1_score
print(f1_score(int_test_array, int_pred_array, average='macro'))
print(f1_score(int_test_array, int_pred_array, average='samples'))
print(f1_score(int_test_array, int_pred_array, average=None))
0.19289818577352147
0.16674516791123403
[0.45454545 0.10526316 0.41860465 0.4375 0.38297872 0.11111111
0.1 0.13333333 0.34920635 0.425 0.25454545 0.21333333
0.12307692 0.09677419 0.08955224 0.11594203 0.17647059 0.26086957
0.13333333 0.23809524 0.28235294 0.14634146 0.05194805 0.05970149
0.16129032 0.0952381 0.30769231 0.05882353 0.20253165 0.25316456
0.03448276 0.06557377 0.16216216 0.2 0.16470588 0.24137931
0.16438356 0.05882353]