Everyone Do this at the Beginning

最新推荐文章于 2019-08-01 22:17:36 发布

绝不挂科

最新推荐文章于 2019-08-01 22:17:36 发布

阅读量362

点赞数 1

本文链接：https://blog.csdn.net/weixin_43866317/article/details/98054775

版权

title：Everyone Do this at the Beginning!!

文章链接：传送门

文章目录

- - title：Everyone Do this at the Beginning!!

1.Summary of content

As the data is highly dimensional in this competition, it is really difficult to do even a little thing.

在这Kaggle竞赛中，数据是高度多维的，所以即使是一件小事也很难做到。所以，在你开始任何工作之前，作者试图通过删除不太有用的列来减小列维度，并选择了17列，您可以在加载数据集后删除这些列。

Selected mostly-missing feaures which have more than 99% of missing values. 选择缺失值超过99%的主要缺失特征。
Selected too-skewed features whose majority categories cover more than 99% of occurences. 选择的过于倾斜的特征，其大多数类别覆盖99%以上的出现。
Selected hightly-correlated features. Tested correlations between columns, picked up pairs whose corr is greater than 0.99, compared the distribution of the features in the pairs and corr with HasDetections, and selected the minor column for elimination.

选择高度相关的特征。测试列之间的相关性，提取corr大于0.99的对，将特征在对和corr中的分布与哈希检测进行比较，并选择次要列进行消除。

通过以上特称，可以筛选掉17种特征?

1.  (M) PuaMode
2.  (M) Census_ProcessorClass
3.  (S) Census_IsWIMBootEnabled
4.  (S) IsBeta
5.  (S) Census_IsFlightsDisabled
6.  (S) Census_IsFlightingInternal
7.  (S) AutoSampleOptIn
8.  (S) Census_ThresholdOptIn
9.  (S) SMode
10. (S) Census_IsPortableOperatingSystem
11. (S) Census_DeviceFamily
12. (S) UacLuaenable
13. (S) Census_IsVirtualDevice
14. (C) Platform
15. (C) Census_OSSkuName
16. (C) Census_OSInstallLanguageIdentifier
17. (C) Processor

Here, (M) denotes mostly-missing feaures, (S) means too-skewed features, and © indicates hightly-correlated features.

tips：在这个内核中，只使用了训练数据集，但后来使用训练+测试数据集时，结果是相同的。

2. Load Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# referred https://www.kaggle.com/theoviel/load-the-totality-of-the-data
dtypes = {
        'MachineIdentifier':                                    'category',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float32',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int16',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float32',
        'UacLuaenable':                                         'float64',
        # was 'float32'
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float32', 
        # was 'float16'
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float32', 
        # was 'float16'
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float64', 
        # was 'float32'
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float64',
        # was 'float32'
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32',
        # was 'float16'
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32', 
        # was 'float16'
        'Census_InternalPrimaryDisplayResolutionVertical':      'float32', 
        # was 'float16'
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float64',
        # was 'float32'
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
        }
train = pd.read_csv('../input/train.csv', dtype=dtypes)
train.shape

#train.shape output
(8921483, 83)

#The empty list is used to place the name of the feature to be removed.
#定义一个空列表用于放置需要移除的特征名称
droppable_features = []

3. Feature Engineering-特征工程

3.1 mostly-missing Columns

#The counting proportion of missing values of each feature is calculated
(train.isnull().sum()/train.shape[0]).sort_values(ascending=False)

PuaMode                                              0.999741
Census_ProcessorClass                                0.995894
DefaultBrowsersIdentifier                            0.951416
Census_IsFlightingInternal                           0.830440
Census_InternalBatteryType                           0.710468
Census_ThresholdOptIn                                0.635245
Census_IsWIMBootEnabled                              0.634390
SmartScreen                                          0.356108
OrganizationIdentifier                               0.308415
SMode                                                0.060277
CityIdentifier                                       0.036475
Wdft_IsGamer                                         0.034014
Wdft_RegionIdentifier                                0.034014
Census_InternalBatteryNumberOfCharges                0.030124
Census_FirmwareManufacturerIdentifier                0.020541
Census_IsFlightsDisabled                             0.017993
Census_FirmwareVersionIdentifier                     0.017949
Census_OEMModelIdentifier                            0.011459
Census_OEMNameIdentifier                             0.010702
Firewall                                             0.010239
Census_TotalPhysicalRAM                              0.009027
Census_IsAlwaysOnAlwaysConnectedCapable              0.007997
Census_OSInstallLanguageIdentifier                   0.006735
IeVerIdentifier                                      0.006601
Census_PrimaryDiskTotalCapacity                      0.005943
Census_SystemVolumeTotalCapacity                     0.005941
Census_InternalPrimaryDiagonalDisplaySizeInInches    0.005283
Census_InternalPrimaryDisplayResolutionHorizontal    0.005267
Census_InternalPrimaryDisplayResolutionVertical      0.005267
Census_ProcessorModelIdentifier                      0.004634
                                                       ...   
ProductName                                          0.000000
HasTpm                                               0.000000
OsBuild                                              0.000000
IsBeta                                               0.000000
OsSuite                                              0.000000
IsSxsPassiveMode                                     0.000000
HasDetections                                        0.000000
SkuEdition                                           0.000000
Census_OSInstallTypeName                             0.000000
Census_IsPenCapable                                  0.000000
Census_IsTouchEnabled                                0.000000
Census_IsSecureBootEnabled                           0.000000
Census_FlightRing                                    0.000000
Census_ActivationChannel                             0.000000
Census_GenuineStateName                              0.000000
Census_IsPortableOperatingSystem                     0.000000
Census_OSWUAutoUpdateOptionsName                     0.000000
Census_OSUILocaleIdentifier                          0.000000
Census_OSSkuName                                     0.000000
AutoSampleOptIn                                      0.000000
Census_OSEdition                                     0.000000
Census_OSBuildRevision                               0.000000
Census_OSBuildNumber                                 0.000000
Census_OSBranch                                      0.000000
Census_OSArchitecture                                0.000000
Census_OSVersion                                     0.000000
Census_HasOpticalDiskDrive                           0.000000
Census_DeviceFamily                                  0.000000
Census_MDC2FormFactor                                0.000000
MachineIdentifier                                    0.000000
Length: 83, dtype: float64

There are 2 columns which have more than 99% of missing values and they are useless.

（缺失值大于百分之99）

# 将这两个特征放入先前定义的空列表中
droppable_features.append('PuaMode')
droppable_features.append('Census_ProcessorClass')

3.2 Too skewed columns

3.2.1majority category covers more than 99% of occurences


#pd.options.display : 为编码者提供自定i一的格式
'''
    '{:,.4f}'     ： 保留4位小数 
    '{:,.Kf}'     ： 保留K位小数,像c语言一样...
'''
# train[c].nunique() ： 出现了多少种不同的特征值
# .value_counts(normalize=True).values[0]
'''
    value_counts(): 每个特征值出现的次数
    value_counts(normalize=True):每个特征值的计数占比，默认降序排序
    value_counts(normalize=True).values[0]：返回计数占比最大的特征值的计数占比
'''
pd.options.display.float_format = '{:,.4f}'.format
sk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0] * 100} for c in train.columns])
sk_df = sk_df.sort_values('skewness', ascending=False)
sk_df

sk_df output?

column	skewness	uniq
75	Census_IsWIMBootEnabled	100.0000	2
5	IsBeta	99.9992	2
69	Census_IsFlightsDisabled	99.9990	2
68	Census_IsFlightingInternal	99.9986	2
27	AutoSampleOptIn	99.9971	2
71	Census_ThresholdOptIn	99.9749	2
29	SMode	99.9537	2
65	Census_IsPortableOperatingSystem	99.9455	2
28	PuaMode	99.9134	2
35	Census_DeviceFamily	99.8383	3
33	UacLuaenable	99.3925	11
76	Census_IsVirtualDevice	99.2961	2
1	ProductName	98.9356	6
12	HasTpm	98.7971	2
7	IsSxsPassiveMode	98.2666	2
32	Firewall	97.8583	2
11	AVProductsEnabled	97.3984	6
6	RtpStateBitfield	97.3262	7
20	OsVer	96.7613	58
18	Platform	96.6063	4
78	Census_IsPenCapable	96.1929	2
26	IsProtected	94.5624	2
79	Census_IsAlwaysOnAlwaysConnectedCapable	94.2581	2
70	Census_FlightRing	93.6580	10
45	Census_HasOpticalDiskDrive	92.2813	2
55	Census_OSArchitecture	90.8580	3
19	Processor	90.8530	3
66	Census_GenuineStateName	88.2992	5
39	Census_ProcessorManufacturerIdentifier	88.2789	7
77	Census_IsTouchEnabled	87.4457	2
…	…	…	…
57	Census_OSBuildNumber	44.9351	165
64	Census_OSWUAutoUpdateOptionsName	44.3256	6
23	OsPlatformSubRelease	43.8887	9
21	OsBuild	43.8887	76
30	IeVerIdentifier	43.8454	303
2	EngineVersion	43.0990	70
24	OsBuildLab	41.0045	663
59	Census_OSEdition	38.8948	33
60	Census_OSSkuName	38.8934	30
62	Census_OSInstallLanguageIdentifier	35.8777	39
63	Census_OSUILocaleIdentifier	35.5414	147
48	Census_InternalPrimaryDiagonalDisplaySizeInInches	34.3398	785
42	Census_PrimaryDiskTotalCapacity	32.0408	5735
72	Census_FirmwareManufacturerIdentifier	30.8882	712
61	Census_OSInstallTypeName	29.2332	9
17	LocaleEnglishNameIdentifier	23.4780	276
81	Wdft_RegionIdentifier	20.8877	15
16	GeoNameIdentifier	17.1716	292
58	Census_OSBuildRevision	15.8453	285
54	Census_OSVersion	15.8452	469
36	Census_OEMNameIdentifier	14.5850	3832
8	DefaultBrowsersIdentifier	10.6257	2017
13	CountryIdentifier	4.4519	222
37	Census_OEMModelIdentifier	3.4559	175365
40	Census_ProcessorModelIdentifier	3.2576	3428
4	AvSigVersion	1.1469	8531
14	CityIdentifier	1.1030	107366
73	Census_FirmwareVersionIdentifier	1.0228	50494
44	Census_SystemVolumeTotalCapacity	0.5863	536848
0	MachineIdentifier	0.0000	8921483

83 rows × 3 columns

There are 12 categorical columns whose majority category covers more than 99% of occurences, and they are useless, too.

droppable_features.extend(sk_df[sk_df.skewness > 99].column.tolist())
droppable_features

['PuaMode',
 'Census_ProcessorClass',
 'Census_IsWIMBootEnabled',
 'IsBeta',
 'Census_IsFlightsDisabled',
 'Census_IsFlightingInternal',
 'AutoSampleOptIn',
 'Census_ThresholdOptIn',
 'SMode',
 'Census_IsPortableOperatingSystem',
 'PuaMode',
 'Census_DeviceFamily',
 'UacLuaenable',
 'Census_IsVirtualDevice']

可以发现 PuaMode一共出现了两次，因而删除一次。

# PuaMode is duplicated in the two categories.
droppable_features.remove('PuaMode')

# Drop these columns.将目录中的这些特征从数据集中删去
train.drop(droppable_features, axis=1, inplace=True)

3.2.2Fill missing values for columns that have more than 10% of missing values

许多特征的特征值为空，我们接下来会对缺失值超过10%的特征进行填充，对缺失值低于10%的特征进行休整，也就是删除这些特征值为NaN所在的行，记住是删除行，不是列，也就是我们只是删去它的含有NaN值的样本。

那么为什么我们对缺失值超过10%的特征进行填补，而对低于10%的进行删除呢？这是因为：缺失值超过10%的特征，其中含有NaN值的行数太多，我们建一个模型，一定要有足够的数据，而缺失值超过10%的那些特征（有的可能30%，有的可能50%，甚至更多）中含有NaN值的行数加起来估计已经达到过半的样本了，甚至更多，若删去，那么对数据的保存太少，对模型训练不利，故我们只对缺失值超过10%的特征进行填补，低于10%的进行 “样本删除”。

In[9]:

# Nan Values
null_counts = train.isnull().sum()
null_counts = null_counts / train.shape[0]
null_counts[null_counts > 0.1]

out[9]:

DefaultBrowsersIdentifier    0.9514
OrganizationIdentifier       0.3084
SmartScreen                  0.3561
Census_InternalBatteryType   0.7105
dtype: float64

4 columns above should be filled missing values. # 有4个特征需要被填充

Replace missing values with 0.

In [11]: 填补：

'''
.fillna(0,inplece=True) : 对缺失值以0填充，并且在原始数据中进行修改，也就是说缺失值全部都用0替代了
.fillna(0,inplace=False) : 对缺失值以0填充，但能用来打印看一下，并不会改变原始数据，缺失值还是缺失值
'''
train.DefaultBrowsersIdentifier.fillna(0, inplace=True)

In [12]:第二个特征

#.value_counts() : 返回该特征中每种特征值出现的次数
train.SmartScreen.value_counts()

Out[12]:

RequireAdmin    4316183
ExistsNotSet    1046183
Off              186553
Warn             135483
Prompt            34533
Block             22533
off                1350
On                  731
&#x02;              416
&#x01;              335
on                  147
requireadmin         10
OFF                   4
0                     3
Promt                 2
requireAdmin          1
Enabled               1
prompt                1
warn                  1
00000000              1
&#x03;                1
Name: SmartScreen, dtype: int64

In [13]: 'SmartSreen’中的特征值太杂乱，我们给它们赋值为较正规的字符串：

trans_dict = {
    'off': 'Off', '&#x02;': '2', '&#x01;': '1', 'on': 'On', 'requireadmin': 'RequireAdmin', 'OFF': 'Off', 
    'Promt': 'Prompt', 'requireAdmin': 'RequireAdmin', 'prompt': 'Prompt', 'warn': 'Warn', 
    '00000000': '0', '&#x03;': '3', np.nan: 'NoExist'
}
train.replace({'SmartScreen': trans_dict}, inplace=True)

In [14]:

train.SmartScreen.isnull().sum()

Out[14]: 因为所有缺失值都已经赋值为’NoExist’，所以isnull的数量是0

In [15]:第三个特征：

train.OrganizationIdentifier.value_counts()

Out[15]:

27.0000    4196457
18.0000    1764175
48.0000      63845
50.0000      45502
11.0000      19436
37.0000      19398
49.0000      13627
46.0000      10974
14.0000       4713
32.0000       4045
36.0000       3909
52.0000       3043
33.0000       2896
2.0000        2595
5.0000        1990
40.0000       1648
28.0000       1591
4.0000        1385
10.0000       1083
51.0000        917
20.0000        915
1.0000         893
8.0000         723
22.0000        418
39.0000        413
6.0000         412
31.0000        398
21.0000        397
47.0000        385
3.0000         331
16.0000        242
19.0000        172
26.0000        160
44.0000        150
29.0000        135
42.0000        132
7.0000          98
41.0000         77
45.0000         73
30.0000         64
43.0000         60
35.0000         32
23.0000         20
15.0000         13
25.0000         12
12.0000          7
34.0000          2
38.0000          1
17.0000          1
Name: OrganizationIdentifier, dtype: int64

这个特征是用来保存ID的，所以我们可以用0来给缺失值赋值：

train.replace({'OrganizationIdentifier': {np.nan: 0}}, inplace=True)

第四个特征：

In[17]:

pd.options.display.max_rows = 99
train.Census_InternalBatteryType.value_counts()

Out[17]:

lion        2028256
li-i         245617
#            183998
lip           62099
liio          32635
li p           8383
li             6708
nimh           4614
real           2744
bq20           2302
pbac           2274
vbox           1454
unkn            533
lgi0            399
lipo            198
lhp0            182
4cel            170
lipp             83
ithi             79
batt             60
ram              35
bad              33
virt             33
pad0             22
lit              16
ca48             16
a132             10
ots0              9
lai0              8
ÿÿÿÿ              8
lio               5
4lio              4
lio               4
asmb              4
li-p              4
0x0b              3
lgs0              3
icp3              3
3ion              2
a140              2
h00j              2
5nm1              2
lhpo              2
a138              2
lilo              1
li-h              1
lp                1
li？              1
ion               1
pbso              1
3500              1
6ion              1
@i              1
li               1
sams              1
ip               1
8                 1
#TAB#             1
l&#TAB#          1
lio              1
˙˙˙              1
l                1
cl53              1
liÿÿ              1
pa50              1
í-i              1
÷ÿóö              1
li-l              1
h4°s              1
d                 1
lgl0              1
4ion              1
0ts0              1
sail              1
p-sn              1
a130              1
2337              1
lÿÿÿ              1
Name: Census_InternalBatteryType, dtype: int64

Census_InternalBatteryType has 75+% of missing values as well as “˙˙˙” and “unkn” values which seem to mean “unknown”. So replace these values with “unknown”.

这个特征有百分之75以上的缺失值，我们用“unknown”来替代

trans_dict = {
    '˙˙˙': 'unknown', 'unkn': 'unknown', np.nan: 'unknown'
}
train.replace({'Census_InternalBatteryType': trans_dict}, inplace=True)

3.2.3 Remove missing values from the train.

In [19]:

train.shape

Out[19]:

(8921483, 70)

In [20]:

# .dropna(inplace=True):删除含有NaN的所有行，保留原来的索引值不变
train.dropna(inplace=True)
train.shape

Out[20]:

(7667789, 70)

MachineIdentifier is not useful for prediction of malware detection.

MachineIdentifier 是机器标识符（每台机器特有）因而对检测无用。

train.drop('MachineIdentifier', axis=1, inplace=True)

Label Encoding for category columns

为了是数据能够用于机器学习，我们需要把一些数据的类型转化为category类型

train['SmartScreen'] = train.SmartScreen.astype('category')
train['Census_InternalBatteryType'] = train.Census_InternalBatteryType.astype('category')

cate_cols = train.select_dtypes(include='category').columns.tolist()

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in cate_cols:
    train[col] = le.fit_transform(train[col])

Reduce the memory by codes from https://www.kaggle.com/timon88/load-whole-data-without-any-dtypes

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    def reduce_mem_usage(df):
        """ 
        iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
        """
    #.memory_usage() 
        start_mem = df.memory_usage().sum() / 1024**2
        print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
                    else:
            df[col] = df[col].astype('category')
            end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

return df
%time

 train = reduce_mem_usage(train)

以上代码利用选取数据合适的位数，减少空间内存?

tips:116选用8位比用16位的数据类型占用更少内存

result：

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.48 µs
Memory usage of dataframe is 2464.34 MB
Memory usage after optimization is: 965.26 MB
Decreased by 60.8%

3.3 Highly correlated features

由于仍然有太多的特征，一次计算和查看所有的特征比较困难。因此，将它们按10列分组，并考虑它们的相关性，最后计算剩余特征的所有相关性。

cols = train.columns.tolist()

cols = train.columns.tolist()
import seaborn as sns

plt.figure(figsize=(10,10))
co_cols = cols[:10]
co_cols.append('HasDetections')
sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0)
plt.title('Correlation between 1 ~ 10th columns')

plt.show(）！

result:There is no columns which have 0.99+ correlation.

In [27]:

co_cols = cols[10:20]
co_cols.append('HasDetections')
plt.figure(figsize=(10,10))
sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0)
plt.title('Correlation between 11 ~ 20th columns')
plt.show()

[外链图片转存失败(img-pQMbskHf-1564638774762)(https://www.kaggleusercontent.com/kf/11323120/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..13WQiVM3vL17EKhplQUmdQ.OBGwcJN77lqZ66hAZze78wZJb-vqr1ukjpU7zRviLmMBdrBccBetDxh6RYsQ349dsD63pjZNZ2eUcju_SgqEz7s6IOktoPUXTyIuta5MEWAeQCTAa-bTJ4UxCKBqG0Ni4U3AxWHdqs0K5ioYtsJ5dKYJSo8psehjKtdSV5oEJRo.Da7PFBRtcIzbDGh4A-d5iA/__results___files/__results___44_0.png)]

从图中看，有两个特征相关性超过99%，选择并删除其中一个出现次数比较少的

print(train.Platform.nunique())  #3
print(train.OsVer.nunique())     #45

Platform vs OsVer : remove Platform

corr_remove.append('Platform')

重复操作后

现在我们有3个特征要从10组特征的相关性中删除。

最后移除了Platform, Census_OSSkuName, Census_OSInstallLanguageIdentifier三个特征

分析完各组内特征的相关性之后，下面分析各组之间的特征相关性：

corr = train.corr()
high_corr = (corr >= 0.99).astype('uint8')
plt.figure(figsize=(15,15))
sns.heatmap(high_corr, cmap='RdBu_r', annot=True, center=0.0)

出现了2个相关性>=0.99的特征。

print(train.Census_OSArchitecture.nunique())
print(train.Processor.nunique())

3
3

Census_OSArchitecture and Processor have the same length of unique values. Then which one? Let’s compare their correlation to the HasDetections.

train[['Census_OSArchitecture', 'Processor', 'HasDetections']].corr()

	Census_OSArchitecture	Processor	HasDetections
Census_OSArchitecture	1.0000	0.9951	-0.0758
Processor	0.9951	1.0000	-0.0758
HasDetections	-0.0758	-0.0758	1.0000

两个特征与标签HasDetections的相关系数都一样，因此移除哪个都一样，随机选择移除一个特征

corr_remove.append('Processor')

In [43]:

droppable_features.extend(corr_remove)
print(len(droppable_features))
droppable_features

Out[43]:17个特征的名单

17
['Census_ProcessorClass',
 'Census_IsWIMBootEnabled',
 'IsBeta',
 'Census_IsFlightsDisabled',
 'Census_IsFlightingInternal',
 'AutoSampleOptIn',
 'Census_ThresholdOptIn',
 'SMode',
 'Census_IsPortableOperatingSystem',
 'PuaMode',
 'Census_DeviceFamily',
 'UacLuaenable',
 'Census_IsVirtualDevice',
 'Platform',
 'Census_OSSkuName',
 'Census_OSInstallLanguageIdentifier',
 'Processor']
OSArchitecture | 1.0000                | 0.9951    | -0.0758       |
| Processor             | 0.9951                | 1.0000    | -0.0758       |
| HasDetections         | -0.0758               | -0.0758   | 1.0000        |

两个特征与标签HasDetections的相关系数都一样，因此移除哪个都一样，随机选择移除一个特征

```python
corr_remove.append('Processor')

In [43]:

droppable_features.extend(corr_remove)
print(len(droppable_features))
droppable_features

Out[43]:17个特征的名单

17
['Census_ProcessorClass',
 'Census_IsWIMBootEnabled',
 'IsBeta',
 'Census_IsFlightsDisabled',
 'Census_IsFlightingInternal',
 'AutoSampleOptIn',
 'Census_ThresholdOptIn',
 'SMode',
 'Census_IsPortableOperatingSystem',
 'PuaMode',
 'Census_DeviceFamily',
 'UacLuaenable',
 'Census_IsVirtualDevice',
 'Platform',
 'Census_OSSkuName',
 'Census_OSInstallLanguageIdentifier',
 'Processor']

绝不挂科

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Everyone Do this at the Beginning

title：Everyone Do this at the Beginning!!文章链接：传送门文章目录title：Everyone Do this at the Beginning!!1.Summary of content2. Load Data3. Feature Engineering-特征工程3.1 mostly-missing Columns3.2 Too skewed colu...
复制链接

扫一扫