Everyone Do this at the Beginning

title:Everyone Do this at the Beginning!!

文章链接:传送门

1.Summary of content

As the data is highly dimensional in this competition, it is really difficult to do even a little thing.

在这Kaggle竞赛中,数据是高度多维的,所以即使是一件小事也很难做到。所以,在你开始任何工作之前,作者试图通过删除不太有用的列来减小列维度,并选择了17列,您可以在加载数据集后删除这些列。

  • Selected mostly-missing feaures which have more than 99% of missing values. 选择缺失值超过99%的主要缺失特征。

  • Selected too-skewed features whose majority categories cover more than 99% of occurences. 选择的过于倾斜的特征,其大多数类别覆盖99%以上的出现。

  • Selected hightly-correlated features. Tested correlations between columns, picked up pairs whose corr is greater than 0.99, compared the distribution of the features in the pairs and corr with HasDetections, and selected the minor column for elimination.

    选择高度相关的特征。测试列之间的相关性,提取corr大于0.99的对,将特征在对和corr中的分布与哈希检测进行比较,并选择次要列进行消除。

通过以上特称,可以筛选掉17种特征?

1.  (M) PuaMode
2.  (M) Census_ProcessorClass
3.  (S) Census_IsWIMBootEnabled
4.  (S) IsBeta
5.  (S) Census_IsFlightsDisabled
6.  (S) Census_IsFlightingInternal
7.  (S) AutoSampleOptIn
8.  (S) Census_ThresholdOptIn
9.  (S) SMode
10. (S) Census_IsPortableOperatingSystem
11. (S) Census_DeviceFamily
12. (S) UacLuaenable
13. (S) Census_IsVirtualDevice
14. (C) Platform
15. (C) Census_OSSkuName
16. (C) Census_OSInstallLanguageIdentifier
17. (C) Processor

Here, (M) denotes mostly-missing feaures, (S) means too-skewed features, and © indicates hightly-correlated features.

tips:在这个内核中,只使用了训练数据集,但后来使用训练+测试数据集时,结果是相同的。

2. Load Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# referred https://www.kaggle.com/theoviel/load-the-totality-of-the-data
dtypes = {
        'MachineIdentifier':                                    'category',
        'ProductName':                                          'category',
        'EngineVersion':                                        'category',
        'AppVersion':                                           'category',
        'AvSigVersion':                                         'category',
        'IsBeta':                                               'int8',
        'RtpStateBitfield':                                     'float16',
        'IsSxsPassiveMode':                                     'int8',
        'DefaultBrowsersIdentifier':                            'float32',
        'AVProductStatesIdentifier':                            'float32',
        'AVProductsInstalled':                                  'float16',
        'AVProductsEnabled':                                    'float16',
        'HasTpm':                                               'int8',
        'CountryIdentifier':                                    'int16',
        'CityIdentifier':                                       'float32',
        'OrganizationIdentifier':                               'float16',
        'GeoNameIdentifier':                                    'float16',
        'LocaleEnglishNameIdentifier':                          'int16',
        'Platform':                                             'category',
        'Processor':                                            'category',
        'OsVer':                                                'category',
        'OsBuild':                                              'int16',
        'OsSuite':                                              'int16',
        'OsPlatformSubRelease':                                 'category',
        'OsBuildLab':                                           'category',
        'SkuEdition':                                           'category',
        'IsProtected':                                          'float16',
        'AutoSampleOptIn':                                      'int8',
        'PuaMode':                                              'category',
        'SMode':                                                'float16',
        'IeVerIdentifier':                                      'float16',
        'SmartScreen':                                          'category',
        'Firewall':                                             'float16',
        'UacLuaenable':                                         'float32',
        'UacLuaenable':                                         'float64',
        # was 'float32'
        'Census_MDC2FormFactor':                                'category',
        'Census_DeviceFamily':                                  'category',
        'Census_OEMNameIdentifier':                             'float32', 
        # was 'float16'
        'Census_OEMModelIdentifier':                            'float32',
        'Census_ProcessorCoreCount':                            'float16',
        'Census_ProcessorManufacturerIdentifier':               'float16',
        'Census_ProcessorModelIdentifier':                      'float32', 
        # was 'float16'
        'Census_ProcessorClass':                                'category',
        'Census_PrimaryDiskTotalCapacity':                      'float64', 
        # was 'float32'
        'Census_PrimaryDiskTypeName':                           'category',
        'Census_SystemVolumeTotalCapacity':                     'float64',
        # was 'float32'
        'Census_HasOpticalDiskDrive':                           'int8',
        'Census_TotalPhysicalRAM':                              'float32',
        'Census_ChassisTypeName':                               'category',
        'Census_InternalPrimaryDiagonalDisplaySizeInInches':    'float32',
        # was 'float16'
        'Census_InternalPrimaryDisplayResolutionHorizontal':    'float32', 
        # was 'float16'
        'Census_InternalPrimaryDisplayResolutionVertical':      'float32', 
        # was 'float16'
        'Census_PowerPlatformRoleName':                         'category',
        'Census_InternalBatteryType':                           'category',
        'Census_InternalBatteryNumberOfCharges':                'float64',
        # was 'float32'
        'Census_OSVersion':                                     'category',
        'Census_OSArchitecture':                                'category',
        'Census_OSBranch':                                      'category',
        'Census_OSBuildNumber':                                 'int16',
        'Census_OSBuildRevision':                               'int32',
        'Census_OSEdition':                                     'category',
        'Census_OSSkuName':                                     'category',
        'Census_OSInstallTypeName':                             'category',
        'Census_OSInstallLanguageIdentifier':                   'float16',
        'Census_OSUILocaleIdentifier':                          'int16',
        'Census_OSWUAutoUpdateOptionsName':                     'category',
        'Census_IsPortableOperatingSystem':                     'int8',
        'Census_GenuineStateName':                              'category',
        'Census_ActivationChannel':                             'category',
        'Census_IsFlightingInternal':                           'float16',
        'Census_IsFlightsDisabled':                             'float16',
        'Census_FlightRing':                                    'category',
        'Census_ThresholdOptIn':                                'float16',
        'Census_FirmwareManufacturerIdentifier':                'float16',
        'Census_FirmwareVersionIdentifier':                     'float32',
        'Census_IsSecureBootEnabled':                           'int8',
        'Census_IsWIMBootEnabled':                              'float16',
        'Census_IsVirtualDevice':                               'float16',
        'Census_IsTouchEnabled':                                'int8',
        'Census_IsPenCapable':                                  'int8',
        'Census_IsAlwaysOnAlwaysConnectedCapable':              'float16',
        'Wdft_IsGamer':                                         'float16',
        'Wdft_RegionIdentifier':                                'float16',
        'HasDetections':                                        'int8'
        }
train = pd.read_csv('../input/train.csv', dtype=dtypes)
train.shape
#train.shape output
(8921483, 83)  
#The empty list is used to place the name of the feature to be removed.
#定义一个空列表用于放置需要移除的特征名称
droppable_features = []
3. Feature Engineering-特征工程
3.1 mostly-missing Columns
#The counting proportion of missing values of each feature is calculated
(train.isnull().sum()/train.shape[0]).sort_values(ascending=False)
PuaMode                                              0.999741
Census_ProcessorClass                                0.995894
DefaultBrowsersIdentifier                            0.951416
Census_IsFlightingInternal                           0.830440
Census_InternalBatteryType                           0.710468
Census_ThresholdOptIn                                0.635245
Census_IsWIMBootEnabled                              0.634390
SmartScreen                                          0.356108
OrganizationIdentifier                               0.308415
SMode                                                0.060277
CityIdentifier                                       0.036475
Wdft_IsGamer                                         0.034014
Wdft_RegionIdentifier                                0.034014
Census_InternalBatteryNumberOfCharges                0.030124
Census_FirmwareManufacturerIdentifier                0.020541
Census_IsFlightsDisabled                             0.017993
Census_FirmwareVersionIdentifier                     0.017949
Census_OEMModelIdentifier                            0.011459
Census_OEMNameIdentifier                             0.010702
Firewall                                             0.010239
Census_TotalPhysicalRAM                              0.009027
Census_IsAlwaysOnAlwaysConnectedCapable              0.007997
Census_OSInstallLanguageIdentifier                   0.006735
IeVerIdentifier                                      0.006601
Census_PrimaryDiskTotalCapacity                      0.005943
Census_SystemVolumeTotalCapacity                     0.005941
Census_InternalPrimaryDiagonalDisplaySizeInInches    0.005283
Census_InternalPrimaryDisplayResolutionHorizontal    0.005267
Census_InternalPrimaryDisplayResolutionVertical      0.005267
Census_ProcessorModelIdentifier                      0.004634
                                                       ...   
ProductName                                          0.000000
HasTpm                                               0.000000
OsBuild                                              0.000000
IsBeta                                               0.000000
OsSuite                                              0.000000
IsSxsPassiveMode                                     0.000000
HasDetections                                        0.000000
SkuEdition                                           0.000000
Census_OSInstallTypeName                             0.000000
Census_IsPenCapable                                  0.000000
Census_IsTouchEnabled                                0.000000
Census_IsSecureBootEnabled                           0.000000
Census_FlightRing                                    0.000000
Census_ActivationChannel                             0.000000
Census_GenuineStateName                              0.000000
Census_IsPortableOperatingSystem                     0.000000
Census_OSWUAutoUpdateOptionsName                     0.000000
Census_OSUILocaleIdentifier                          0.000000
Census_OSSkuName                                     0.000000
AutoSampleOptIn                                      0.000000
Census_OSEdition                                     0.000000
Census_OSBuildRevision                               0.000000
Census_OSBuildNumber                                 0.000000
Census_OSBranch                                      0.000000
Census_OSArchitecture                                0.000000
Census_OSVersion                                     0.000000
Census_HasOpticalDiskDrive                           0.000000
Census_DeviceFamily                                  0.000000
Census_MDC2FormFactor                                0.000000
MachineIdentifier                                    0.000000
Length: 83, dtype: float64
  • There are 2 columns which have more than 99% of missing values and they are useless.

    (缺失值大于百分之99)

# 将这两个特征放入先前定义的空列表中
droppable_features.append('PuaMode')
droppable_features.append('Census_ProcessorClass')
3.2 Too skewed columns
3.2.1majority category covers more than 99% of occurences

#pd.options.display : 为编码者提供自定i一的格式
'''
    '{:,.4f}'     : 保留4位小数 
    '{:,.Kf}'     : 保留K位小数,像c语言一样...
'''
# train[c].nunique() : 出现了多少种不同的特征值
# .value_counts(normalize=True).values[0]
'''
    value_counts(): 每个特征值出现的次数
    value_counts(normalize=True):每个特征值的计数占比,默认降序排序
    value_counts(normalize=True).values[0]:返回计数占比最大的特征值的计数占比
'''
pd.options.display.float_format = '{:,.4f}'.format
sk_df = pd.DataFrame([{'column': c, 'uniq': train[c].nunique(), 'skewness': train[c].value_counts(normalize=True).values[0] * 100} for c in train.columns])
sk_df = sk_df.sort_values('skewness', ascending=False)
sk_df

sk_df output?

columnskewnessuniq
75Census_IsWIMBootEnabled100.00002
5IsBeta99.99922
69Census_IsFlightsDisabled99.99902
68Census_IsFlightingInternal99.99862
27AutoSampleOptIn99.99712
71Census_ThresholdOptIn99.97492
29SMode99.95372
65Census_IsPortableOperatingSystem99.94552
28PuaMode99.91342
35Census_DeviceFamily99.83833
33UacLuaenable99.392511
76Census_IsVirtualDevice99.29612
1ProductName98.93566
12HasTpm98.79712
7IsSxsPassiveMode98.26662
32Firewall97.85832
11AVProductsEnabled97.39846
6RtpStateBitfield97.32627
20OsVer96.761358
18Platform96.60634
78Census_IsPenCapable96.19292
26IsProtected94.56242
79Census_IsAlwaysOnAlwaysConnectedCapable94.25812
70Census_FlightRing93.658010
45Census_HasOpticalDiskDrive92.28132
55Census_OSArchitecture90.85803
19Processor90.85303
66Census_GenuineStateName88.29925
39Census_ProcessorManufacturerIdentifier88.27897
77Census_IsTouchEnabled87.44572
57Census_OSBuildNumber44.9351165
64Census_OSWUAutoUpdateOptionsName44.32566
23OsPlatformSubRelease43.88879
21OsBuild43.888776
30IeVerIdentifier43.8454303
2EngineVersion43.099070
24OsBuildLab41.0045663
59Census_OSEdition38.894833
60Census_OSSkuName38.893430
62Census_OSInstallLanguageIdentifier35.877739
63Census_OSUILocaleIdentifier35.5414147
48Census_InternalPrimaryDiagonalDisplaySizeInInches34.3398785
42Census_PrimaryDiskTotalCapacity32.04085735
72Census_FirmwareManufacturerIdentifier30.8882712
61Census_OSInstallTypeName29.23329
17LocaleEnglishNameIdentifier23.4780276
81Wdft_RegionIdentifier20.887715
16GeoNameIdentifier17.1716292
58Census_OSBuildRevision15.8453285
54Census_OSVersion15.8452469
36Census_OEMNameIdentifier14.58503832
8DefaultBrowsersIdentifier10.62572017
13CountryIdentifier4.4519222
37Census_OEMModelIdentifier3.4559175365
40Census_ProcessorModelIdentifier3.25763428
4AvSigVersion1.14698531
14CityIdentifier1.1030107366
73Census_FirmwareVersionIdentifier1.022850494
44Census_SystemVolumeTotalCapacity0.5863536848
0MachineIdentifier0.00008921483

83 rows × 3 columns

  • There are 12 categorical columns whose majority category covers more than 99% of occurences, and they are useless, too.
droppable_features.extend(sk_df[sk_df.skewness > 99].column.tolist())
droppable_features
['PuaMode',
 'Census_ProcessorClass',
 'Census_IsWIMBootEnabled',
 'IsBeta',
 'Census_IsFlightsDisabled',
 'Census_IsFlightingInternal',
 'AutoSampleOptIn',
 'Census_ThresholdOptIn',
 'SMode',
 'Census_IsPortableOperatingSystem',
 'PuaMode',
 'Census_DeviceFamily',
 'UacLuaenable',
 'Census_IsVirtualDevice']

可以发现 PuaMode一共出现了两次,因而删除一次。

# PuaMode is duplicated in the two categories.
droppable_features.remove('PuaMode')

# Drop these columns.将目录中的这些特征从数据集中删去
train.drop(droppable_features, axis=1, inplace=True)
3.2.2Fill missing values for columns that have more than 10% of missing values

许多特征的特征值为空,我们接下来会对缺失值超过10%的特征进行填充,对缺失值低于10%的特征进行休整,也就是删除这些特征值为NaN所在的行,记住是删除行,不是列,也就是我们只是删去它的含有NaN值的样本。

那么为什么我们对缺失值超过10%的特征进行填补,而对低于10%的进行删除呢?这是因为:缺失值超过10%的特征,其中含有NaN值的行数太多,我们建一个模型,一定要有足够的数据,而缺失值超过10%的那些特征(有的可能30%,有的可能50%,甚至更多)中含有NaN值的行数加起来估计已经达到过半的样本了,甚至更多,若删去,那么对数据的保存太少,对模型训练不利,故我们只对缺失值超过10%的特征进行填补,低于10%的进行 “样本删除”。

In[9]:

# Nan Values
null_counts = train.isnull().sum()
null_counts = null_counts / train.shape[0]
null_counts[null_counts > 0.1]

out[9]:

DefaultBrowsersIdentifier    0.9514
OrganizationIdentifier       0.3084
SmartScreen                  0.3561
Census_InternalBatteryType   0.7105
dtype: float64

4 columns above should be filled missing values. # 有4个特征需要被填充

Replace missing values with 0.

In [11]: 填补:

'''
.fillna(0,inplece=True) : 对缺失值以0填充,并且在原始数据中进行修改,也就是说缺失值全部都用0替代了
.fillna(0,inplace=False) : 对缺失值以0填充,但能用来打印看一下,并不会改变原始数据,缺失值还是缺失值
'''
train.DefaultBrowsersIdentifier.fillna(0, inplace=True)

In [12]:第二个特征

#.value_counts() : 返回该特征中每种特征值出现的次数
train.SmartScreen.value_counts()

Out[12]:

RequireAdmin    4316183
ExistsNotSet    1046183
Off              186553
Warn             135483
Prompt            34533
Block             22533
off                1350
On                  731
              416
              335
on                  147
requireadmin         10
OFF                   4
0                     3
Promt                 2
requireAdmin          1
Enabled               1
prompt                1
warn                  1
00000000              1
                1
Name: SmartScreen, dtype: int64

In [13]: 'SmartSreen’中的特征值太杂乱,我们给它们赋值为较正规的字符串:

trans_dict = {
    'off': 'Off', '': '2', '': '1', 'on': 'On', 'requireadmin': 'RequireAdmin', 'OFF': 'Off', 
    'Promt': 'Prompt', 'requireAdmin': 'RequireAdmin', 'prompt': 'Prompt', 'warn': 'Warn', 
    '00000000': '0', '': '3', np.nan: 'NoExist'
}
train.replace({'SmartScreen': trans_dict}, inplace=True)

In [14]:

train.SmartScreen.isnull().sum()

Out[14]: 因为所有缺失值都已经赋值为’NoExist’,所以isnull的数量是0

0

In [15]:第三个特征:

train.OrganizationIdentifier.value_counts()

Out[15]:

27.0000    4196457
18.0000    1764175
48.0000      63845
50.0000      45502
11.0000      19436
37.0000      19398
49.0000      13627
46.0000      10974
14.0000       4713
32.0000       4045
36.0000       3909
52.0000       3043
33.0000       2896
2.0000        2595
5.0000        1990
40.0000       1648
28.0000       1591
4.0000        1385
10.0000       1083
51.0000        917
20.0000        915
1.0000         893
8.0000         723
22.0000        418
39.0000        413
6.0000         412
31.0000        398
21.0000        397
47.0000        385
3.0000         331
16.0000        242
19.0000        172
26.0000        160
44.0000        150
29.0000        135
42.0000        132
7.0000          98
41.0000         77
45.0000         73
30.0000         64
43.0000         60
35.0000         32
23.0000         20
15.0000         13
25.0000         12
12.0000          7
34.0000          2
38.0000          1
17.0000          1
Name: OrganizationIdentifier, dtype: int64

这个特征是用来保存ID的,所以我们可以用0来给缺失值赋值:

train.replace({'OrganizationIdentifier': {np.nan: 0}}, inplace=True)

第四个特征:

In[17]:

pd.options.display.max_rows = 99
train.Census_InternalBatteryType.value_counts()

Out[17]:

lion        2028256
li-i         245617
#            183998
lip           62099
liio          32635
li p           8383
li             6708
nimh           4614
real           2744
bq20           2302
pbac           2274
vbox           1454
unkn            533
lgi0            399
lipo            198
lhp0            182
4cel            170
lipp             83
ithi             79
batt             60
ram              35
bad              33
virt             33
pad0             22
lit              16
ca48             16
a132             10
ots0              9
lai0              8
ÿÿÿÿ              8
lio               5
4lio              4
lio               4
asmb              4
li-p              4
0x0b              3
lgs0              3
icp3              3
3ion              2
a140              2
h00j              2
5nm1              2
lhpo              2
a138              2
lilo              1
li-h              1
lp                1
li?              1
ion               1
pbso              1
3500              1
6ion              1
@i              1
li               1
sams              1
ip               1
8                 1
#TAB#             1
l&#TAB#          1
lio              1
˙˙˙              1
l                1
cl53              1
liÿÿ              1
pa50              1
í-i              1
÷ÿóö              1
li-l              1
h4°s              1
d                 1
lgl0              1
4ion              1
0ts0              1
sail              1
p-sn              1
a130              1
2337              1
lÿÿÿ              1
Name: Census_InternalBatteryType, dtype: int64

Census_InternalBatteryType has 75+% of missing values as well as “˙˙˙” and “unkn” values which seem to mean “unknown”. So replace these values with “unknown”.

这个特征有百分之75以上的缺失值,我们用“unknown”来替代

trans_dict = {
    '˙˙˙': 'unknown', 'unkn': 'unknown', np.nan: 'unknown'
}
train.replace({'Census_InternalBatteryType': trans_dict}, inplace=True)
3.2.3 Remove missing values from the train.

In [19]:

train.shape

Out[19]:

(8921483, 70)

In [20]:

# .dropna(inplace=True):删除含有NaN的所有行,保留原来的索引值不变
train.dropna(inplace=True)
train.shape

Out[20]:

(7667789, 70)

MachineIdentifier is not useful for prediction of malware detection.

MachineIdentifier 是机器标识符(每台机器特有)因而对检测无用。

train.drop('MachineIdentifier', axis=1, inplace=True)

Label Encoding for category columns

为了是数据能够用于机器学习,我们需要把一些数据的类型转化为category类型

train['SmartScreen'] = train.SmartScreen.astype('category')
train['Census_InternalBatteryType'] = train.Census_InternalBatteryType.astype('category')

cate_cols = train.select_dtypes(include='category').columns.tolist()

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in cate_cols:
    train[col] = le.fit_transform(train[col])

Reduce the memory by codes from https://www.kaggle.com/timon88/load-whole-data-without-any-dtypes

def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    def reduce_mem_usage(df):
        """ 
        iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
        """
    #.memory_usage() 
        start_mem = df.memory_usage().sum() / 1024**2
        print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
                    else:
            df[col] = df[col].astype('category')
            end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

return df
%time

 train = reduce_mem_usage(train)

以上代码利用选取数据合适的位数,减少空间内存?

tips:116选用8位比用16位的数据类型占用更少内存

result:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 5.48 µs
Memory usage of dataframe is 2464.34 MB
Memory usage after optimization is: 965.26 MB
Decreased by 60.8%
3.3 Highly correlated features

由于仍然有太多的特征,一次计算和查看所有的特征比较困难。因此,将它们按10列分组,并考虑它们的相关性,最后计算剩余特征的所有相关性。

cols = train.columns.tolist()
cols = train.columns.tolist()
import seaborn as sns

plt.figure(figsize=(10,10))
co_cols = cols[:10]
co_cols.append('HasDetections')
sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0)
plt.title('Correlation between 1 ~ 10th columns')

plt.show()!

result:There is no columns which have 0.99+ correlation.

In [27]:

co_cols = cols[10:20]
co_cols.append('HasDetections')
plt.figure(figsize=(10,10))
sns.heatmap(train[co_cols].corr(), cmap='RdBu_r', annot=True, center=0.0)
plt.title('Correlation between 11 ~ 20th columns')
plt.show()

[外链图片转存失败(img-pQMbskHf-1564638774762)(https://www.kaggleusercontent.com/kf/11323120/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..13WQiVM3vL17EKhplQUmdQ.OBGwcJN77lqZ66hAZze78wZJb-vqr1ukjpU7zRviLmMBdrBccBetDxh6RYsQ349dsD63pjZNZ2eUcju_SgqEz7s6IOktoPUXTyIuta5MEWAeQCTAa-bTJ4UxCKBqG0Ni4U3AxWHdqs0K5ioYtsJ5dKYJSo8psehjKtdSV5oEJRo.Da7PFBRtcIzbDGh4A-d5iA/__results___files/__results___44_0.png)]

从图中看,有两个特征相关性超过99%,选择并删除其中一个出现次数比较少的

print(train.Platform.nunique())  #3
print(train.OsVer.nunique())     #45
  • Platform vs OsVer : remove Platform
corr_remove.append('Platform')

重复操作后

现在我们有3个特征要从10组特征的相关性中删除。

最后移除了Platform, Census_OSSkuName, Census_OSInstallLanguageIdentifier三个特征

分析完各组内特征的相关性之后,下面分析各组之间的特征相关性:

corr = train.corr()
high_corr = (corr >= 0.99).astype('uint8')
plt.figure(figsize=(15,15))
sns.heatmap(high_corr, cmap='RdBu_r', annot=True, center=0.0)

出现了2个相关性>=0.99的特征。

print(train.Census_OSArchitecture.nunique())
print(train.Processor.nunique())
3
3

Census_OSArchitecture and Processor have the same length of unique values. Then which one? Let’s compare their correlation to the HasDetections.

train[['Census_OSArchitecture', 'Processor', 'HasDetections']].corr()
Census_OSArchitectureProcessorHasDetections
Census_OSArchitecture1.00000.9951-0.0758
Processor0.99511.0000-0.0758
HasDetections-0.0758-0.07581.0000

两个特征与标签HasDetections的相关系数都一样,因此移除哪个都一样,随机选择移除一个特征

corr_remove.append('Processor')

In [43]:

droppable_features.extend(corr_remove)
print(len(droppable_features))
droppable_features

Out[43]:17个特征的名单

17
['Census_ProcessorClass',
 'Census_IsWIMBootEnabled',
 'IsBeta',
 'Census_IsFlightsDisabled',
 'Census_IsFlightingInternal',
 'AutoSampleOptIn',
 'Census_ThresholdOptIn',
 'SMode',
 'Census_IsPortableOperatingSystem',
 'PuaMode',
 'Census_DeviceFamily',
 'UacLuaenable',
 'Census_IsVirtualDevice',
 'Platform',
 'Census_OSSkuName',
 'Census_OSInstallLanguageIdentifier',
 'Processor']
OSArchitecture | 1.0000                | 0.9951    | -0.0758       |
| Processor             | 0.9951                | 1.0000    | -0.0758       |
| HasDetections         | -0.0758               | -0.0758   | 1.0000        |

两个特征与标签HasDetections的相关系数都一样,因此移除哪个都一样,随机选择移除一个特征

```python
corr_remove.append('Processor')

In [43]:

droppable_features.extend(corr_remove)
print(len(droppable_features))
droppable_features

Out[43]:17个特征的名单

17
['Census_ProcessorClass',
 'Census_IsWIMBootEnabled',
 'IsBeta',
 'Census_IsFlightsDisabled',
 'Census_IsFlightingInternal',
 'AutoSampleOptIn',
 'Census_ThresholdOptIn',
 'SMode',
 'Census_IsPortableOperatingSystem',
 'PuaMode',
 'Census_DeviceFamily',
 'UacLuaenable',
 'Census_IsVirtualDevice',
 'Platform',
 'Census_OSSkuName',
 'Census_OSInstallLanguageIdentifier',
 'Processor']
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值