My EDA - I want to see all!

title:My EDA - I want to see all!

1.summary of article content

Data visualization

本篇主要目的是为了看清楚数据,数据之间的关系,就是单纯的将数据可视化。

2.module

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import KFold
import warnings
import gc
import time
import sys
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore')
from sklearn import metrics

plt.style.use('seaborn')
sns.set(font_scale=2)
pd.set_option('display.max_columns', 500)

3.read and check dataset -读取跟检查数据集

3.1Read dataset
%time train = pd.read_csv("../input/train.csv", dtype=dtypes)
%time test = pd.read_csv("../input/test.csv", dtype=dtypes)
CPU times: user 2min 50s, sys: 19.3 s, total: 3min 10s
Wall time: 3min 11s
CPU times: user 2min 33s, sys: 10.7 s, total: 2min 44s
Wall time: 2min 44s

In [4]:

print(train.shape, test.shape)
(8921483, 83) (7853253, 82)
  • You can see that the datasets are large.-数据集十分庞大
3.2Check the target

In:

train['HasDetections'].value_counts().plot.bar()
plt.title('HasDetections(target)')

out:

Text(0.5,1,'HasDetections(target)')

[外链图片转存失败(img-hOImtRGf-1564639505714)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564301689813.png)]

  • Wow, very-well balanced target! .

    我们发现他们是非常平衡的目标

3.3Check the dataset
%%time
# checking missing data
total = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending = False)
missing_train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
CPU times: user 46.2 s, sys: 7.41 s, total: 53.6 s
Wall time: 53.4 s

In [7]:

missing_train_data.head(50)

Out[7]:

TotalPercent
PuaMode891917499.974119
Census_ProcessorClass888485299.589407
DefaultBrowsersIdentifier848804595.141637
Census_IsFlightingInternal740875983.044030
Census_InternalBatteryType633842971.046809
Census_ThresholdOptIn566732563.524472
Census_IsWIMBootEnabled565970363.439038
SmartScreen317701135.610795
OrganizationIdentifier275151830.841487
SMode5377596.027686
CityIdentifier3254093.647477
Wdft_IsGamer3034513.401352
Wdft_RegionIdentifier3034513.401352
Census_InternalBatteryNumberOfCharges2687553.012448
Census_FirmwareManufacturerIdentifier1832572.054109
Census_IsFlightsDisabled1605231.799286
Census_FirmwareVersionIdentifier1601331.794915
Census_OEMModelIdentifier1022331.145919
Census_OEMNameIdentifier954781.070203
Firewall913501.023933
Census_TotalPhysicalRAM805330.902686
Census_IsAlwaysOnAlwaysConnectedCapable713430.799676
Census_OSInstallLanguageIdentifier600840.673475
IeVerIdentifier588940.660137
Census_PrimaryDiskTotalCapacity530160.594251
Census_SystemVolumeTotalCapacity530020.594094
Census_InternalPrimaryDiagonalDisplaySizeInInches471340.528320
Census_InternalPrimaryDisplayResolutionHorizontal469860.526661
Census_InternalPrimaryDisplayResolutionVertical469860.526661
Census_ProcessorModelIdentifier413430.463410
Census_ProcessorManufacturerIdentifier413130.463073
Census_ProcessorCoreCount413060.462995
AVProductsEnabled362210.405998
AVProductsInstalled362210.405998
AVProductStatesIdentifier362210.405998
IsProtected360440.404014
RtpStateBitfield323180.362249
Census_IsVirtualDevice159530.178816
Census_PrimaryDiskTypeName128440.143967
UacLuaenable108380.121482
Census_ChassisTypeName6230.006983
GeoNameIdentifier2130.002387
Census_PowerPlatformRoleName550.000616
OsBuildLab210.000235
LocaleEnglishNameIdentifier00.000000
AvSigVersion00.000000
OsPlatformSubRelease00.000000
Processor00.000000
OsVer00.000000
AppVersion00.000000
  • PuaMode, Census_ProcessorClass, DefaultBrowsersIdentifier, Census_IsFlightingInternal and Census_InternalBatteryType have over 70% null data.

    以上几个特征超过了百分之七十的数据是空值

  • Let’s check their distribution regarding to the target.

    让我们检查一下他们对目标的分布情况。

  • Because datasets are large, let’s compare the distributions using 10% of train.

    因为数据集很大,让我们用10%的训练来比较分布。

train_small = train # train.sample(frac=0.2).copy() # not small for now
3.3.1 PuaMode

In [9]:

print(train_small['PuaMode'].dtypes)
category

In [10]:

# sns.countplot()计数直方图函数
sns.countplot(x='PuaMode', hue='HasDetections',data=train_small)
plt.show()

[外链图片转存失败(img-fvkM88gs-1564639505716)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564294839819.png)]

Some difference exists there. But, the samples are quite few, so remove this feature.

存在一些差异。但是,样本很少,所以删除这个特性。

3.3.2 Census_ProcessorClass
print(train_small['Census_ProcessorClass'].dtypes)
category

In [12]:

sns.countplot(x='Census_ProcessorClass', hue='HasDetections',data=train_small)
plt.show()

[外链图片转存失败(img-Zk1VIyL0-1564639505716)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302660508.png)]

  • The meaning of ‘Census_ProcessorClassr’ is ‘Number of logical cores in the processor’.

    这个特征的意思是“处理器中逻辑核心的数量”

  • You can check that the more logical cores, the more probable infection with malwares.

    你可以发现逻辑核心越多,恶意软件感染的可能性就越大

  • This feature could be a good features only or component for the combinations with other features. Keep this and think it!

    此功能可能只是一个好功能,也可能是与其他功能组合的组件。留着这个,然后思考它

3.3.3 DefaultBrosersIdentifier
print(train_small['DefaultBrowsersIdentifier'].dtypes)
float16

In [14]:

fig, ax = plt.subplots(1, 2, figsize=(16, 8))
sns.kdeplot(train_small.loc[train['HasDetections'] == 0, 'DefaultBrowsersIdentifier'], ax=ax[0], label='NoDetection(0)')
sns.kdeplot(train_small.loc[train['HasDetections'] == 1, 'DefaultBrowsersIdentifier'], ax=ax[0], label='HasDetection(1)')

train_small.loc[train['HasDetections'] == 0, 'DefaultBrowsersIdentifier'].hist(ax=ax[1])
train_small.loc[train['HasDetections'] == 1, 'DefaultBrowsersIdentifier'].hist(ax=ax[1])
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])

plt.show()

[外链图片转存失败(img-xUzDuA9J-1564639505717)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302690782.png)]

  • DefaultBrowsersIdentifier means ’ ID for the machine’s default browser’.

    这个特征意思是“机器默认浏览器的标识”

  • Is this feature meaningful?

    这个功能有意义吗?

3.3.4 Census_IsFightingInternal

In [15]:

print(train_small['Census_IsFlightingInternal'].dtypes)
float16

In [16]:

fig, ax = plt.subplots(1, 2, figsize=(16, 8))
sns.kdeplot(train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'], ax=ax[0], label='NoDetection(0)')
sns.kdeplot(train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'], ax=ax[0], label='HasDetection(1)')

train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'].hist(ax=ax[1])
train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'].hist(ax=ax[1])
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])

plt.show()

[外链图片转存失败(img-D46N3oxC-1564639505718)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302772704.png)]

train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'].value_counts()

Out[17]:

0.0    737583
1.0        13
Name: Census_IsFlightingInternal, dtype: int64

In [18]:

train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'].value_counts()

Out[18]:

0.0    775120
1.0         8
Name: Census_IsFlightingInternal, dtype: int64
  • As you can see, almost value of ‘Census_IsFlightingInternal’ is 0.0. Just remove.

    就像你看到的一样,这个特征的取值几乎是0,因而我们将他移除

3.3.5Census_InternalBatteryType
print(train_small['Census_InternalBatteryType'].dtypes)
category

In [20]:

train_small['Census_InternalBatteryType'].value_counts()

Out[20]:

lion        2028256
li-i         245617
#            183998
lip           62099
liio          32635
li p           8383
li             6708
nimh           4614
real           2744
bq20           2302
pbac           2274
vbox           1454
unkn            533
lgi0            399
lipo            198
lhp0            182
4cel            170
lipp             83
ithi             79
batt             60
ram              35
bad              33
virt             33
pad0             22
lit              16
ca48             16
a132             10
ots0              9
lai0              8
ÿÿÿÿ              8
             ...   
ion              1
pbso              1
3500              1
6ion              1
@i              1
li               1
sams              1
ip               1
8                 1
#TAB#             1
l&#TAB#          1
lio              1
˙˙˙              1
l                1
cl53              1
liÿÿ              1
pa50              1
í-i              1
÷ÿóö              1
li-l              1
h4°s              1
d                 1
lgl0              1
4ion              1
0ts0              1
sail              1
p-sn              1
a130              1
2337              1
lÿÿÿ              1
Name: Census_InternalBatteryType, Length: 78, dtype: int64
  • I think this feature means the type of batteries of each machine.

    这个特征意味着每台机器的电池类型。

  • Oh, no…These days, most batteries are lithum-ion battery.

    大多数电池都是锂离子电池。

  • So, Let’s group them into lithum-batter group and non0-lithum-battery group

    所以,让我们把它们分成锂蓄电池组和非锂蓄电池组

In [21]:

def group_battery(x):
    x = x.lower()
    if 'li' in x:
        return 1
    else:
        return 0
    
train_small['Census_InternalBatteryType'] = train_small['Census_InternalBatteryType'].apply(group_battery)

In [22]:

sns.countplot(x='Census_InternalBatteryType', hue='HasDetections',data=train_small)
plt.show()

在这里插入图片描述

  • The difference is quite small. Do you think that some malwares recognize and select machine based on the type of battery?

    发现差别很小

  • Battery is very important part for life of machine. I think that malware will focus on other hardware and software parts of machine. remove this.->移除特征

null_cols_to_remove = ['DefaultBrowsersIdentifier', 'PuaMode',
                       'Census_IsFlightingInternal', 'Census_InternalBatteryType']

train.drop(null_cols_to_remove, axis=1, inplace=True)
test.drop(null_cols_to_remove, axis=1, inplace=True)

4Exploratory data analysis

4.1Categorical features
categorical_features = [
        'ProductName',                                          
        'EngineVersion',                                        
        'AppVersion',                                           
        'AvSigVersion',                                         
        'Platform',                                             
        'Processor',                                            
        'OsVer',                                                
        'OsPlatformSubRelease',                                 
        'OsBuildLab',                                           
        'SkuEdition',                                           
        'SmartScreen',                                          
        'Census_MDC2FormFactor',                                
        'Census_DeviceFamily',                                  
        'Census_PrimaryDiskTypeName',                           
        'Census_ChassisTypeName',                               
        'Census_PowerPlatformRoleName',                         
        'Census_OSVersion',                                     
        'Census_OSArchitecture',                                
        'Census_OSBranch',                                      
        'Census_OSEdition',                                     
        'Census_OSSkuName',                                     
        'Census_OSInstallTypeName',                             
        'Census_OSWUAutoUpdateOptionsName',                     
        'Census_GenuineStateName',                              
        'Census_ActivationChannel',                             
        'Census_FlightRing',                                    
]
def plot_category_percent_of_target(col):
    fig, ax = plt.subplots(1, 1, figsize=(15, 10))
    #这个特征中'HasDetections'的值为‘1’的比率
    cat_percent = train_small[[col, 'HasDetections']].groupby(col, as_index=False).mean()
    cat_size = train_small[col].value_counts().reset_index(drop=False)
    cat_size.columns = [col, 'count']
    cat_percent = cat_percent.merge(cat_size, on=col, how='left')
    cat_percent['HasDetections'] = cat_percent['HasDetections'].fillna(0)
    cat_percent = cat_percent.sort_values(by='count', ascending=False)[:20]
    sns.barplot(ax=ax, x='HasDetections', y=col, data=cat_percent, order=cat_percent[col])

    for i, p in enumerate(ax.patches):
        ax.annotate('{}'.format(cat_percent['count'].values[i]), (p.get_width(), p.get_y()+0.5), fontsize=20)

    plt.xlabel('% of HasDetections(target)')
    plt.ylabel(col)
    plt.show()
4.1.1ProductName - Defender state information e.g. win8defender-产品名称-防御者状态信息
col = categorical_features[0]
plot_category_percent_of_target(col)

在这里插入图片描述

4.1.2 EngineVersion - Defender state information e.g. 1.1.12603.0-引擎版本-防御者状态信息
col = categorical_features[1]
plot_category_percent_of_target(col)

在这里插入图片描述

以此类推,剩余特征与上述方法相同

4.2 numeric features

作者定义了两种视图方法,一种与上面分析分类特征的视图一样,还有一种是kdeplot(核密度估计图)

kdeplot(核密度估计图)

核密度估计(kernel density estimation)是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。通过核密度估计图可以比较直观的看出数据样本本身的分布特征。

numeric_features = [
        'IsBeta',                                               
        'RtpStateBitfield',                                     
        'IsSxsPassiveMode',                                     
        'DefaultBrowsersIdentifier',                            
        'AVProductStatesIdentifier',                            
        'AVProductsInstalled',                                  
        'AVProductsEnabled',                                    
        'HasTpm',                                               
        'CountryIdentifier',                                    
        'CityIdentifier',                                       
        'OrganizationIdentifier',                               
        'GeoNameIdentifier',                                    
        'LocaleEnglishNameIdentifier',                          
        'OsBuild',                                              
        'OsSuite',                                              
        'IsProtected',                                          
        'AutoSampleOptIn',                                      
        'SMode',                                                
        'IeVerIdentifier',                                      
        'Firewall',                                             
        'UacLuaenable',                                         
        'Census_OEMNameIdentifier',                             
        'Census_OEMModelIdentifier',                            
        'Census_ProcessorCoreCount',                            
        'Census_ProcessorManufacturerIdentifier',               
        'Census_ProcessorModelIdentifier',                      
        'Census_PrimaryDiskTotalCapacity',                      
        'Census_SystemVolumeTotalCapacity',                     
        'Census_HasOpticalDiskDrive',                           
        'Census_TotalPhysicalRAM',                              
        'Census_InternalPrimaryDiagonalDisplaySizeInInches',    
        'Census_InternalPrimaryDisplayResolutionHorizontal',    
        'Census_InternalPrimaryDisplayResolutionVertical',      
        'Census_InternalBatteryNumberOfCharges',                
        'Census_OSBuildNumber',                                 
        'Census_OSBuildRevision',                               
        'Census_OSInstallLanguageIdentifier',                   
        'Census_OSUILocaleIdentifier',                          
        'Census_IsPortableOperatingSystem',                     
        'Census_IsFlightsDisabled',                             
        'Census_ThresholdOptIn',                                
        'Census_FirmwareManufacturerIdentifier',                
        'Census_FirmwareVersionIdentifier',                     
        'Census_IsSecureBootEnabled',                           
        'Census_IsWIMBootEnabled',                              
        'Census_IsVirtualDevice',                               
        'Census_IsTouchEnabled',                                
        'Census_IsPenCapable',                                  
        'Census_IsAlwaysOnAlwaysConnectedCapable',              
        'Wdft_IsGamer',                                         
        'Wdft_RegionIdentifier',                                
]
def plot_category_percent_of_target_for_numeric(col):
    fig, ax = plt.subplots(1, 2, figsize=(20, 8))
    cat_percent = train_small[[col, 'HasDetections']].groupby(col, as_index=False).mean()
    cat_size = train_small[col].value_counts().reset_index(drop=False)
    cat_size.columns = [col, 'count']
    cat_percent = cat_percent.merge(cat_size, on=col, how='left')
    cat_percent['HasDetections'] = cat_percent['HasDetections'].fillna(0)
    cat_percent = cat_percent.sort_values(by='count', ascending=False)[:20]
    cat_percent[col] = cat_percent[col].astype('category')
    sns.barplot(ax=ax[0], x='HasDetections', y=col, data=cat_percent,  order=cat_percent[col])

    for i, p in enumerate(ax[0].patches):
        ax[0].annotate('{}'.format(cat_percent['count'].values[i]), (p.get_width(), p.get_y()+0.5), fontsize=20)

    ax[0].set_title('Barplot sorted by count', fontsize=20)

    sns.barplot(ax=ax[1], x='HasDetections', y=col, data=cat_percent)
    for i, p in enumerate(ax[0].patches):
        ax[1].annotate('{}'.format(cat_percent['count'].sort_index().values[i]), (0, p.get_y()+0.6), fontsize=20)
    ax[1].set_title('Barplot sorted by index', fontsize=20)

    plt.xlabel('% of HasDetections(target)')
    plt.ylabel(col)
    plt.subplots_adjust(wspace=0.5, hspace=0)
    plt.show()

def plot_kde_hist_for_numeric(col):
    fig, ax = plt.subplots(1, 2, figsize=(16, 8))
    sns.kdeplot(train_small.loc[train['HasDetections'] == 0, col], ax=ax[0], label='NoDetection(0)')
    sns.kdeplot(train_small.loc[train['HasDetections'] == 1, col], ax=ax[0], label='HasDetection(1)')

    train_small.loc[train['HasDetections'] == 0, col].hist(ax=ax[1], bins=100)
    train_small.loc[train['HasDetections'] == 1, col].hist(ax=ax[1], bins=100)

    plt.suptitle(col, fontsize=30)
    ax[0].set_yscale('log')
    ax[0].set_title('KDE plot')
    
    ax[1].set_title('Histogram')
    ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
    ax[1].set_yscale('log')
    plt.show()

4.2.1 IsBeta - Defender state information e.g. false

col = numeric_features[0]

plot_kde_hist_for_numeric(col)
plot_category_percent_of_target_for_numeric(col)
HasDetections'] == 1, col], ax=ax[0], label='HasDetection(1)')

    train_small.loc[train['HasDetections'] == 0, col].hist(ax=ax[1], bins=100)
    train_small.loc[train['HasDetections'] == 1, col].hist(ax=ax[1], bins=100)

    plt.suptitle(col, fontsize=30)
    ax[0].set_yscale('log')
    ax[0].set_title('KDE plot')
    
    ax[1].set_title('Histogram')
    ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
    ax[1].set_yscale('log')
    plt.show()

4.2.1 IsBeta - Defender state information e.g. false

col = numeric_features[0]

plot_kde_hist_for_numeric(col)
plot_category_percent_of_target_for_numeric(col)
  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值