title:Everyone Do this at the Beginning!!
文章链接:传送门
文章目录
1.Summary of content
As the data is highly dimensional in this competition, it is really difficult to do even a little thing.
在这Kaggle竞赛中,数据是高度多维的,所以即使是一件小事也很难做到。所以,在你开始任何工作之前,作者试图通过删除不太有用的列来减小列维度,并选择了17列,您可以在加载数据集后删除这些列。
-
Selected
mostly-missing feaures
which have more than 99% of missing values. 选择缺失值超过99%的主要缺失特征。 -
Selected
too-skewed features
whose majority categories cover more than 99% of occurences. 选择的过于倾斜的特征,其大多数类别覆盖99%以上的出现。 -
Selected
hightly-correlated features
. Tested correlations between columns, picked up pairs whose corr is greater than 0.99, compared the distribution of the features in the pairs and corr withHasDetections
, and selected the minor column for elimination.选择高度相关的特征。测试列之间的相关性,提取corr大于0.99的对,将特征在对和corr中的分布与哈希检测进行比较,并选择次要列进行消除。
通过以上特称,可以筛选掉17种特征?
1. (M) PuaMode
2. (M) Census_ProcessorClass
3. (S) Census_IsWIMBootEnabled
4. (S) IsBeta
5. (S) Census_IsFlightsDisabled
6. (S) Census_IsFlightingInternal
7. (S) AutoSampleOptIn
8. (S) Census_ThresholdOptIn
9. (S) SMode
10. (S) Census_IsPortableOperatingSystem
11. (S) Census_DeviceFamily
12. (S) UacLuaenable
13. (S) Census_IsVirtualDevice
14. (C) Platform
15. (C) Census_OSSkuName
16. (C) Census_OSInstallLanguageIdentifier
17. (C) Processor
Here, (M) denotes mostly-missing feaures
, (S) means too-skewed features
, and © indicates hightly-correlated features
.
tips:在这个内核中,只使用了训练数据集,但后来使用训练+测试数据集时,结果是相同的。
2. Load Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# referred https://www.kaggle.com/theoviel/load-the-totality-of-the-data
dtypes = {
'MachineIdentifier': 'category',
'ProductName': 'category',
'EngineVersion': 'category',
'AppVersion': 'category',
'AvSigVersion': 'category',
'IsBeta': 'int8',
'RtpStateBitfield': 'float16',
'IsSxsPassiveMode': 'int8',
'DefaultBrowsersIdentifier': 'float32',
'AVProductStatesIdentifier': 'float32',
'AVProductsInstalled': 'float16',
'AVProductsEnabled': 'float16',
'HasTpm': 'int8',
'CountryIdentifier': 'int16',
'CityIdentifier': 'float32',
'OrganizationIdentifier': 'float16',
'GeoNameIdentifier': 'float16',
'LocaleEnglishNameIdentifier': 'int16',
'Platform': 'category',
'Processor': 'category',
'OsVer': 'category',
'OsBuild': 'int16',
'OsSuite': 'int16',
'OsPlatformSubRelease': 'category',
'OsBuildLab': 'category',
'SkuEdition': 'category',
'IsProtected': 'float16',
'AutoSampleOptIn': 'int8',
'PuaMode': 'category',
'SMode': 'float16',
'IeVerIdentifier': 'float16',
'SmartScreen': 'category',
'Firewall': 'float16',
'UacLuaenable': 'float32',
'UacLuaenable': 'float64',
# was 'float32'
'Census_MDC2FormFactor': 'category',
'Census_DeviceFamily': 'category',
'Census_OEMNameIdentifier': 'float32',
# was 'float16'
'Census_OEMModelIdentifier': 'float32',
'Census_ProcessorCoreCount': 'float16',
'Census_ProcessorManufacturerIdentifier': 'float16',
'Census_ProcessorModelIdentifier': 'float32',
# was 'float16'
'Census_ProcessorClass': 'category',
'Census_PrimaryDiskTotalCapacity': 'float64',
# was 'float32'
'Census_PrimaryDiskTypeName': 'category',
'Census_SystemVolumeTotalCapacity': 'float64',
# was 'float32'
'Census_HasOpticalDiskDrive': 'int8',
'Census_TotalPhysicalRAM': 'float32',
'Census_ChassisTypeName': 'category',
'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float32',
# was 'float16'
'Census_InternalPrimaryDisplayResolutionHorizontal': 'float32',
# was 'float16'
'Census_InternalPrimaryDisplayResolutionVertical': 'float32',
# was 'float16'
'Census_PowerPlatformRoleName': 'category',
'Census_InternalBatteryType': 'category',
'Census_InternalBatteryNumberOfCharges': 'float64',
# was 'float32'
'Census_OSVersion': 'category',
'Census_OSArchitecture': 'category',
'Census_OSBranch': 'category',
'Census_OSBuildNumber': 'int16',
'Census_OSBuildRevision': 'int32',
'Census_OSEdition': 'category',
'Census_OSSkuName': 'category',
'Census_OSInstallTypeName': 'category',
'Census_OSInstallLanguageIdentifier': 'float16',
'Census_OSUILocaleIdentifier': 'int16',
'Census_OSWUAutoUpdateOptionsName': 'category',
'Census_IsPortableOperatingSystem': 'int8',
'Census_GenuineStateName': 'category',
'Census_ActivationChannel': 'category',
'Census_IsFlightingInternal': 'float16',
'Census_IsFlightsDisabled': 'float16',
'Census_FlightRing': 'category',
'Census_ThresholdOptIn': 'float16',
'Census_FirmwareManufacturerIdentifier': 'float16',
'Census_FirmwareVersionIdentifier': 'float32',
'Census_IsSecureBootEnabled': 'int8',
'Census_IsWIMBootEnabled': 'float16',
'Census_IsVirtualDevice': 'float16',
'Census_IsTouchEnabled': 'int8',
'Census_IsPenCapable': 'int8',
'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16',
'Wdft_IsGamer': 'float16',
'Wdft_RegionIdentifier': 'float16',
'HasDetections': 'int8'
}
train = pd.read_csv('../input/train.csv', dtype=dtypes)
train.shape
#train.shape output
(8921483, 83)
#The empty list is used to place the name of the feature to be removed.
#定义一个空列表用于放置需要移除的特征名称
droppable_features = []
3. Feature Engineering-特征工程
3.1 mostly-missing Columns
#The counting proportion of missing values of each feature is calculated
(train.isnull().sum()/train.shape[0]).sort_values(ascending=False)
PuaMode 0.999741
Census_ProcessorClass 0.995894
DefaultBrowsersIdentifier 0.951416
Census_IsFlightingInternal 0.830440
Census_InternalBatteryType 0.710468
Census_ThresholdOptIn 0.635245
Census_IsWIMBootEnabled 0.634390
SmartScreen 0.356108
OrganizationIdentifier 0.308415
SMode 0.060277
CityIdentifier 0.036475
Wdft_IsGamer 0.034014
Wdft_RegionIdentifier 0.034014
Census_InternalBatteryNumberOfCharges 0.030124
Census_FirmwareManufacturerIdentifier 0.020541
Census_IsFlightsDisabled 0.017993
Census_FirmwareVersionIdentifier 0.017949
Census_OEMModelIdentifier 0.011459
Census_OEMNameIdentifier 0.010702
Firewall 0.010239
Census_TotalPhysicalRAM 0.009027
Census_IsAlwaysOnAlwaysConnectedCapable 0.007997
Census_OSInstallLanguageIdentifier 0.006735
IeVerIdentifier 0.006601
Census_PrimaryDiskTotalCapacity 0.005943
Census_SystemVolumeTotalCapacity 0.005941
Census_InternalPrimaryDiagonalDisplaySizeInInches 0.005283
Census_InternalPrimaryDisplayResolutionHorizontal 0.005267
Census_InternalPrimaryDisplayResolutionVertical 0.005267
Census_ProcessorModelIdentifier 0.004634
...
ProductName 0.000000
HasTpm 0.000000
OsBuild 0.000000
IsBeta 0.000000
OsSuite 0.000000
IsSxsPassiveMode 0.000000
HasDetections 0.000000
SkuEdition 0.000000
Census_OSInstallTypeName 0.000000
Census_IsPenCapable 0.000000
Census_IsTouchEnabled 0.000000
Census_IsSecureBootEnabled 0.000000
Census_FlightRing 0.000000
Census_ActivationChannel 0.000000
Census_GenuineStateName 0.000000
Census_IsPortableOperatingSystem 0.000000
Census_OSWUAutoUpdateOptionsName 0.000000
Census_OSUILocaleIdentifier 0.000000
Census_OSSkuName 0.000000
AutoSampleOptIn 0.000000
Census_OSEdition 0.000000
Census_OSBuildRevision 0.000000
Census_OSBuildNumber 0.000000
Census_OSBranch 0.000000
Census_OSArchitecture 0.000000
Census_OSVersion 0.000000
Census_HasOpticalDiskDrive 0.000000
Census_DeviceFamily 0.000000
Census_MDC2FormFactor 0.000000
MachineIdentifier 0.000000
Length: 83, dtype: float64
-
There are 2 columns which have more than 99% of missing values and they are useless.
(缺失值大于百分之99)
# 将这两个特征放入先前定义的空列表中
droppable_features.append('PuaMode')
droppable_features.append('Census_ProcessorClass')
3.2 Too skewed columns
3.2.1majority category covers more than 99% of occurences
#pd.options.display : 为编码者提供自定i一的格式
'''
'{:,.4f}' : 保留4位小数
'{:,.Kf}' : 保留K位小数,像c语言一样...
'''
# train[c].nunique() : 出现了多少种不同的特征值
# .value_counts(normalize=True).values[0]
'''
value_counts(): 每个特征值出现的次数
value_counts(normalize=True):每个特征值的计数占比,默认降序排序
value_counts(normalize=True).values[0]:返回计数占比最大的特征值的计数占比
'''
pd.options.display.float_format = '{:,.4f}'.format
sk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0] * 100} for c in train.columns])
sk_df = sk_df.sort_values('skewness', ascending=False)
sk_df
sk_df output
?
column | skewness | uniq | |
---|---|---|---|
75 | Census_IsWIMBootEnabled | 100.0000 | 2 |
5 | IsBeta | 99.9992 | 2 |
69 | Census_IsFlightsDisabled | 99.9990 | 2 |
68 | Census_IsFlightingInternal | 99.9986 | 2 |
27 | AutoSampleOptIn | 99.9971 | 2 |
71 | Census_ThresholdOptIn | 99.9749 | 2 |
29 | SMode | 99.9537 | 2 |
65 | Census_IsPortableOperatingSystem | 99.9455 | 2 |
28 | PuaMode | 99.9134 | 2 |
35 | Census_DeviceFamily | 99.8383 | 3 |
33 | UacLuaenable | 99.3925 | 11 |
76 | Census_IsVirtualDevice | 99.2961 | 2 |
1 | ProductName | 98.9356 | 6 |
12 | HasTpm | 98.7971 | 2 |
7 | IsSxsPassiveMode | 98.2666 | 2 |
32 | Firewall | 97.8583 | 2 |
11 | AVProductsEnabled | 97.3984 | 6 |
6 | RtpStateBitfield | 97.3262 | 7 |
20 | OsVer | 96.7613 | 58 |
18 | Platform | 96.6063 | 4 |
78 | Census_IsPenCapable | 96.1929 | 2 |
26 | IsProtected | 94.5624 | 2 |
79 | Census_IsAlwaysOnAlwaysConnectedCapable | 94.2581 | 2 |
70 | Census_FlightRing | 93.6580 | 10 |
45 | Census_HasOpticalDiskDrive | 92.2813 | 2 |
55 | Census_OSArchitecture | 90.8580 | 3 |
19 | Processor | 90.8530 | 3 |
66 | Census_GenuineStateName | 88.2992 | 5 |
39 | Census_ProcessorManufacturerIdentifier | 88.2789 | 7 |
77 | Census_IsTouchEnabled | 87.4457 | 2 |
… | … | … | … |
57 | Census_OSBuildNumber | 44.9351 | 165 |
64 | Census_OSWUAutoUpdateOptionsName | 44.3256 | 6 |
23 | OsPlatformSubRelease | 43.8887 | 9 |
21 | OsBuild | 43.8887 | 76 |
30 | IeVerIdentifier | 43.8454 | 303 |
2 | EngineVersion | 43.0990 | 70 |
24 | OsBuildLab | 41.0045 | 663 |
59 | Census_OSEdition | 38.8948 | 33 |
60 | Census_OSSkuName | 38.8934 | 30 |
62 | Census_OSInstallLanguageIdentifier | 35.8777 | 39 |
63 | Census_OSUILocaleIdentifier | 35.5414 | 147 |
48 | Census_InternalPrimaryDiagonalDisplaySizeInInches | 34.3398 | 785 |
42 | Census_PrimaryDiskTotalCapacity | 32.0408 | 5735 |
72 | Census_FirmwareManufacturerIdentifier | 30.8882 | 712 |
61 | Census_OSInstallTypeName | 29.2332 | 9 |
17 | LocaleEnglishNameIdentifier | 23.4780 | 276 |
81 | Wdft_RegionIdentifier | 20.8877 | 15 |
16 | GeoNameIdentifier | 17.1716 | 292 |
58 | Census_OSBuildRevision | 15.8453 | 285 |
54 | Census_OSVersion | 15.8452 | 469 |
36 | Census_OEMNameIdentifier | 14.5850 | 3832 |
8 | DefaultBrowsersIdentifier | 10.6257 | 2017 |
13 | CountryIdentifier | 4.4519 | 222 |
37 | Census_OEMModelIdentifier | 3.4559 | 175365 |
40 | Census_ProcessorModelIdentifier | 3.2576 | 3428 |
4 | AvSigVersion | 1.1469 | 8531 |
14 | CityIdentifier | 1.1030 | 107366 |
73 | Census_FirmwareVersionIdentifier | 1.0228 | 50494 |
44 | Census_SystemVolumeTotalCapacity | 0.5863 | 536848 |
0 | MachineIdentifier | 0.0000 | 8921483 |
83 rows × 3 columns
- There are 12 categorical columns whose majority category covers more than 99% of occurences, and they are useless, too.
droppable_features.extend(sk_df[sk_df.skewness > 99].column.tolist())
droppable_features
['PuaMode',
'Census_ProcessorClass',
'Census_IsWIMBootEnabled',
'IsBeta',
'Census_IsFlightsDisabled',
'Census_IsFlightingInternal',
'AutoSampleOptIn',
'Census_ThresholdOptIn',
'SMode',
'Census_IsPortableOperatingSystem',
'PuaMode',
'Census_DeviceFamily',
'UacLuaenable',
'Census_IsVirtualDevice']
可以发现 PuaMode一共出现了两次,因而删除一次。
# PuaMode is duplicated in the two categories.
droppable_features.remove('PuaMode')
# Drop these columns.将目录中的这些特征从数据集中删去
train.drop(droppable_features, axis=1, inplace=True)
3.2.2Fill missing values for columns that have more than 10% of missing values
许多特征的特征值为空,我们接下来会对缺失值超过10%的特征进行填充,对缺失值低于10%的特征进行休整,也就是删除这些特征值为NaN所在的行,记住是删除行,不是列,也就是我们只是删去它的含有NaN值的样本。
那么为什么我们对缺失值超过10%的特征进行填补,而对低于10%的进行删除呢?这是因为:缺失值超过10%的特征,其中含有NaN值的行数太多,我们建一个模型,一定要有足够的数据,而缺失值超过10%的那些特征(有的可能30%,有的可能50%,甚至更多)中含有NaN值的行数加起来估计已经达到过半的样本了,甚至更多,若删去,那么对数据的保存太少,对模型训练不利,故我们只对缺失值超过10%的特征进行填补,低于10%的进行 “样本删除”。
In[9]:
# Nan Values
null_counts = train.isnull().sum()
null_counts = null_counts / train.shape[0]
null_counts[null_counts > 0.1]
out[9]:
DefaultBrowsersIdentifier 0.9514
OrganizationIdentifier 0.3084
SmartScreen 0.3561
Census_InternalBatteryType 0.7105
dtype: float64
4 columns above should be filled missing values. # 有4个特征需要被填充
Replace missing values with 0.
In [11]: 填补:
'''
.fillna(0,inplece=True) : 对缺失值以0填充,并且在原始数据中进行修改,也就是说缺失值全部都用0替代了
.fillna(0,inplace=False) : 对缺失值以0填充,但能用来打印看一下,并不会改变原始数据,缺失值还是缺失值
'''
train.DefaultBrowsersIdentifier.fillna(0, inplace=True)
In [12]:第二个特征
#.value_counts() : 返回该特征中每种特征值出现的次数
train.SmartScreen.value_counts()
Out[12]:
RequireAdmin 4316183
ExistsNotSet 1046183
Off 186553
Warn 135483
Prompt 34533
Block 22533
off 1350
On 731
 416
 335
on 147
requireadmin 10
OFF 4
0 3
Promt 2
requireAdmin 1
Enabled 1
prompt 1
warn 1
00000000 1
 1
Name: SmartScreen, dtype: int64
In [13]: 'SmartSreen’中的特征值太杂乱,我们给它们赋值为较正规的字符串:
trans_dict = {
'off': 'Off', '': '2', '': '1', 'on': 'On', 'requireadmin': 'RequireAdmin', 'OFF': 'Off',
'Promt': 'Prompt', 'requireAdmin': 'RequireAdmin', 'prompt': 'Prompt', 'warn': 'Warn',
'00000000': '0', '': '3', np.nan: 'NoExist'
}
train.replace({'SmartScreen': trans_dict}, inplace=True)
In [14]:
train.SmartScreen.isnull().sum()
Out[14]: 因为所有缺失值都已经赋值为’NoExist’,所以isnull的数量是0
0
In [15]:第三个特征:
train.OrganizationIdentifier.value_counts()
Out[15]:
27.0000 4196457
18.0000 1764175
48.0000 63845
50.0000 45502
11.0000 19436
37.0000 19398
49.0000 13627
46.0000 10974
14.0000 4713
32.0000 4045
36.0000 3909
52.0000 3043
33.0000 2896
2.0000 2595
5.0000 1990
40.0000 1648
28.0000 1591
4.0000 1385
10.0000 1083
51.0000 917
20.0000 915
1.0000 893
8.0000 723
22.0000 418
39.0000 413
6.0000 412
31.0000 398
21.0000 397
47.0000 385
3.0000 331
16.0000 242
19.0000 172
26.0000 160
44.0000 150
29.0000 135
42.0000 132
7.0000 98
41.0000 77
45.0000 73
30.0000 64
43.0000 60
35.0000 32
23.0000 20
15.0000 13
25.0000 12
12.0000 7
34.0000 2
38.0000 1
17.0000 1
Name: OrganizationIdentifier, dtype: int64
这个特征是用来保存ID的,所以我们可以用0来给缺失值赋值:
train.replace({'OrganizationIdentifier': {np.nan: 0}}, inplace=True)
第四个特征:
In[17]:
pd.options.display.max_rows = 99
train.Census_InternalBatteryType.value_counts()
Out[17]:
lion 2028256
li-i 245617
# 183998
lip 62099
liio 32635
li p 8383
li 6708
nimh 4614
real 2744
bq20 2302
pbac 2274
vbox 1454
unkn 533
lgi0 399
lipo 198
lhp0 182
4cel 170
lipp 83
ithi 79
batt 60
ram 35
bad 33
virt 33
pad0 22
lit 16
ca48 16
a132 10
ots0 9
lai0 8
ÿÿÿÿ 8
lio 5
4lio 4
lio 4
asmb 4
li-p 4
0x0b 3
lgs0 3
icp3 3
3ion 2
a140 2
h00j 2
5nm1 2
lhpo 2
a138 2
lilo 1
li-h 1
lp 1
li? 1
ion 1
pbso 1
3500 1
6ion 1
@i 1
li 1
sams 1
ip 1
8 1
#TAB# 1
l&#TAB# 1
lio 1
˙˙˙ 1
l 1
cl53 1
liÿÿ 1
pa50 1
í-i 1
÷ÿóö 1
li-l 1
h4°s 1
d 1
lgl0 1
4ion 1
0ts0 1
sail 1
p-sn 1
a130 1
2337 1
lÿÿÿ 1
Name: Census_InternalBatteryType, dtype: int64
Census_InternalBatteryType has 75+% of missing values as well as “˙˙˙” and “unkn” values which seem to mean “unknown”. So replace these values with “unknown”.
这个特征有百分之75以上的缺失值,我们用“unknown”来替代
trans_dict = {
'˙˙˙': 'unknown', 'unkn': 'unknown', np.nan: 'unknown'
}
train.replace({'Census_InternalBatteryType': trans_dict}, inplace=True)
3.2.3 Remove missing values from the train.
In [19]:
train.shape
Out[19]:
(8921483, 70)
In [20]:
# .dropna(inplace=True):删除含有NaN的所有行,保留原来的索引值不变
train.dropna(inplace=True)
train.shape
Out[20]:
(7667789, 70)
MachineIdentifier is not useful for prediction of malware detection.
MachineIdentifier 是机器标识符(每台机器特有)因而对检测无用。
train.drop('MachineIdentifier', axis=1, inplace=True)
Label Encoding for category columns
为了是数据能够用于机器学习,我们需要把一些数据的类型转化为category类型
train['SmartScreen'] = train.SmartScreen.astype('category')
train['Census_InternalBatteryType'] = train.Census_InternalBatteryType.astype('category')
cate_cols = train.select_dtypes(include='category').columns.tolist()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in cate_cols:
train[col] = le.fit_transform(train[col])
Reduce the memory by codes from https://www.kaggle.com/timon88/load-whole-data-without-any-dtypes
def reduce_mem_usage(df):
""" iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
def reduce_mem_usage(df):
"""
iterate through all the columns of a dataframe and modify the data type
to reduce memory usage.
"""
#.memory_usage()
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
%time
train = reduce_mem_usage(train)
以上代码利用选取数据合适的位数,减少空间内存?
tips:116选用8位比用16位的数据类型占用更少内存
result:
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.48 µs
Memory usage of dataframe is 2464.34 MB
Memory usage after optimization is: 965.26 MB
Decreased by 60.8%
3.3 Highly correlated features
由于仍然有太多的特征,一次计算和查看所有的特征比较困难。因此,将它们按10列分组,并考虑它们的相关性,最后计算剩余特征的所有相关性。
cols = train.columns.tolist()
cols = train.columns.tolist()
import seaborn as sns
plt.figure(figsize=(10,10))
co_cols = cols[:10]
co_cols.append('HasDetections')
sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0)
plt.title('Correlation between 1 ~ 10th columns')
plt.show()!
result:There is no columns which have 0.99+ correlation.
In [27]:
co_cols = cols[10:20]
co_cols.append('HasDetections')
plt.figure(figsize=(10,10))
sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0)
plt.title('Correlation between 11 ~ 20th columns')
plt.show()
[外链图片转存失败(img-pQMbskHf-1564638774762)(https://www.kaggleusercontent.com/kf/11323120/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..13WQiVM3vL17EKhplQUmdQ.OBGwcJN77lqZ66hAZze78wZJb-vqr1ukjpU7zRviLmMBdrBccBetDxh6RYsQ349dsD63pjZNZ2eUcju_SgqEz7s6IOktoPUXTyIuta5MEWAeQCTAa-bTJ4UxCKBqG0Ni4U3AxWHdqs0K5ioYtsJ5dKYJSo8psehjKtdSV5oEJRo.Da7PFBRtcIzbDGh4A-d5iA/__results___files/__results___44_0.png)]
从图中看,有两个特征相关性超过99%,选择并删除其中一个出现次数比较少的
print(train.Platform.nunique()) #3
print(train.OsVer.nunique()) #45
Platform
vsOsVer
: remove Platform
corr_remove.append('Platform')
重复操作后
现在我们有3个特征要从10组特征的相关性中删除。
最后移除了Platform
, Census_OSSkuName
, Census_OSInstallLanguageIdentifier
三个特征
分析完各组内特征的相关性之后,下面分析各组之间的特征相关性:
corr = train.corr()
high_corr = (corr >= 0.99).astype('uint8')
plt.figure(figsize=(15,15))
sns.heatmap(high_corr, cmap='RdBu_r', annot=True, center=0.0)
出现了2个相关性>=0.99的特征。
print(train.Census_OSArchitecture.nunique())
print(train.Processor.nunique())
3
3
Census_OSArchitecture
and Processor
have the same length of unique values. Then which one? Let’s compare their correlation to the HasDetections
.
train[['Census_OSArchitecture', 'Processor', 'HasDetections']].corr()
Census_OSArchitecture | Processor | HasDetections | |
---|---|---|---|
Census_OSArchitecture | 1.0000 | 0.9951 | -0.0758 |
Processor | 0.9951 | 1.0000 | -0.0758 |
HasDetections | -0.0758 | -0.0758 | 1.0000 |
两个特征与标签HasDetections的相关系数都一样,因此移除哪个都一样,随机选择移除一个特征
corr_remove.append('Processor')
In [43]:
droppable_features.extend(corr_remove)
print(len(droppable_features))
droppable_features
Out[43]:17个特征的名单
17
['Census_ProcessorClass',
'Census_IsWIMBootEnabled',
'IsBeta',
'Census_IsFlightsDisabled',
'Census_IsFlightingInternal',
'AutoSampleOptIn',
'Census_ThresholdOptIn',
'SMode',
'Census_IsPortableOperatingSystem',
'PuaMode',
'Census_DeviceFamily',
'UacLuaenable',
'Census_IsVirtualDevice',
'Platform',
'Census_OSSkuName',
'Census_OSInstallLanguageIdentifier',
'Processor']
OSArchitecture | 1.0000 | 0.9951 | -0.0758 |
| Processor | 0.9951 | 1.0000 | -0.0758 |
| HasDetections | -0.0758 | -0.0758 | 1.0000 |
两个特征与标签HasDetections的相关系数都一样,因此移除哪个都一样,随机选择移除一个特征
```python
corr_remove.append('Processor')
In [43]:
droppable_features.extend(corr_remove)
print(len(droppable_features))
droppable_features
Out[43]:17个特征的名单
17
['Census_ProcessorClass',
'Census_IsWIMBootEnabled',
'IsBeta',
'Census_IsFlightsDisabled',
'Census_IsFlightingInternal',
'AutoSampleOptIn',
'Census_ThresholdOptIn',
'SMode',
'Census_IsPortableOperatingSystem',
'PuaMode',
'Census_DeviceFamily',
'UacLuaenable',
'Census_IsVirtualDevice',
'Platform',
'Census_OSSkuName',
'Census_OSInstallLanguageIdentifier',
'Processor']