My EDA - I want to see all!

最新推荐文章于 2020-09-10 09:06:26 发布

绝不挂科

最新推荐文章于 2020-09-10 09:06:26 发布

阅读量350

点赞数 2

本文链接：https://blog.csdn.net/weixin_43866317/article/details/98055858

版权

title:My EDA - I want to see all!

文章目录

- title:My EDA - I want to see all!

1.summary of article content

Data visualization

本篇主要目的是为了看清楚数据，数据之间的关系，就是单纯的将数据可视化。

2.module

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import KFold
import warnings
import gc
import time
import sys
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore')
from sklearn import metrics

plt.style.use('seaborn')
sns.set(font_scale=2)
pd.set_option('display.max_columns', 500)

3.read and check dataset -读取跟检查数据集

3.1Read dataset

This parted was taken from the helpful kernel. https://www.kaggle.com/theoviel/load-the-totality-of-the-data

%time train = pd.read_csv("../input/train.csv", dtype=dtypes)
%time test = pd.read_csv("../input/test.csv", dtype=dtypes)

CPU times: user 2min 50s, sys: 19.3 s, total: 3min 10s
Wall time: 3min 11s
CPU times: user 2min 33s, sys: 10.7 s, total: 2min 44s
Wall time: 2min 44s

In [4]:

print(train.shape, test.shape)

(8921483, 83) (7853253, 82)

You can see that the datasets are large.-数据集十分庞大

3.2Check the target

In:

train['HasDetections'].value_counts().plot.bar()
plt.title('HasDetections(target)')

out:

Text(0.5,1,'HasDetections(target)')

[外链图片转存失败(img-hOImtRGf-1564639505714)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564301689813.png)]

Wow, very-well balanced target! .

我们发现他们是非常平衡的目标

3.3Check the dataset

%%time
# checking missing data
total = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending = False)
missing_train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

CPU times: user 46.2 s, sys: 7.41 s, total: 53.6 s
Wall time: 53.4 s

In [7]:

missing_train_data.head(50)

Out[7]:

	Total	Percent
PuaMode	8919174	99.974119
Census_ProcessorClass	8884852	99.589407
DefaultBrowsersIdentifier	8488045	95.141637
Census_IsFlightingInternal	7408759	83.044030
Census_InternalBatteryType	6338429	71.046809
Census_ThresholdOptIn	5667325	63.524472
Census_IsWIMBootEnabled	5659703	63.439038
SmartScreen	3177011	35.610795
OrganizationIdentifier	2751518	30.841487
SMode	537759	6.027686
CityIdentifier	325409	3.647477
Wdft_IsGamer	303451	3.401352
Wdft_RegionIdentifier	303451	3.401352
Census_InternalBatteryNumberOfCharges	268755	3.012448
Census_FirmwareManufacturerIdentifier	183257	2.054109
Census_IsFlightsDisabled	160523	1.799286
Census_FirmwareVersionIdentifier	160133	1.794915
Census_OEMModelIdentifier	102233	1.145919
Census_OEMNameIdentifier	95478	1.070203
Firewall	91350	1.023933
Census_TotalPhysicalRAM	80533	0.902686
Census_IsAlwaysOnAlwaysConnectedCapable	71343	0.799676
Census_OSInstallLanguageIdentifier	60084	0.673475
IeVerIdentifier	58894	0.660137
Census_PrimaryDiskTotalCapacity	53016	0.594251
Census_SystemVolumeTotalCapacity	53002	0.594094
Census_InternalPrimaryDiagonalDisplaySizeInInches	47134	0.528320
Census_InternalPrimaryDisplayResolutionHorizontal	46986	0.526661
Census_InternalPrimaryDisplayResolutionVertical	46986	0.526661
Census_ProcessorModelIdentifier	41343	0.463410
Census_ProcessorManufacturerIdentifier	41313	0.463073
Census_ProcessorCoreCount	41306	0.462995
AVProductsEnabled	36221	0.405998
AVProductsInstalled	36221	0.405998
AVProductStatesIdentifier	36221	0.405998
IsProtected	36044	0.404014
RtpStateBitfield	32318	0.362249
Census_IsVirtualDevice	15953	0.178816
Census_PrimaryDiskTypeName	12844	0.143967
UacLuaenable	10838	0.121482
Census_ChassisTypeName	623	0.006983
GeoNameIdentifier	213	0.002387
Census_PowerPlatformRoleName	55	0.000616
OsBuildLab	21	0.000235
LocaleEnglishNameIdentifier	0	0.000000
AvSigVersion	0	0.000000
OsPlatformSubRelease	0	0.000000
Processor	0	0.000000
OsVer	0	0.000000
AppVersion	0	0.000000

PuaMode, Census_ProcessorClass, DefaultBrowsersIdentifier, Census_IsFlightingInternal and Census_InternalBatteryType have over 70% null data.

以上几个特征超过了百分之七十的数据是空值
Let’s check their distribution regarding to the target.

让我们检查一下他们对目标的分布情况。
Because datasets are large, let’s compare the distributions using 10% of train.

因为数据集很大，让我们用10%的训练来比较分布。

train_small = train # train.sample(frac=0.2).copy() # not small for now

3.3.1 PuaMode

In [9]:

print(train_small['PuaMode'].dtypes)

category

In [10]:

# sns.countplot()计数直方图函数
sns.countplot(x='PuaMode', hue='HasDetections',data=train_small)
plt.show()

[外链图片转存失败(img-fvkM88gs-1564639505716)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564294839819.png)]

Some difference exists there. But, the samples are quite few, so remove this feature.

存在一些差异。但是，样本很少，所以删除这个特性。

3.3.2 Census_ProcessorClass

print(train_small['Census_ProcessorClass'].dtypes)

category

In [12]:

sns.countplot(x='Census_ProcessorClass', hue='HasDetections',data=train_small)
plt.show()

[外链图片转存失败(img-Zk1VIyL0-1564639505716)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302660508.png)]

The meaning of ‘Census_ProcessorClassr’ is ‘Number of logical cores in the processor’.

这个特征的意思是“处理器中逻辑核心的数量”
You can check that the more logical cores, the more probable infection with malwares.

你可以发现逻辑核心越多，恶意软件感染的可能性就越大
This feature could be a good features only or component for the combinations with other features. Keep this and think it!

此功能可能只是一个好功能，也可能是与其他功能组合的组件。留着这个，然后思考它

3.3.3 DefaultBrosersIdentifier

print(train_small['DefaultBrowsersIdentifier'].dtypes)

float16

In [14]:

fig, ax = plt.subplots(1, 2, figsize=(16, 8))
sns.kdeplot(train_small.loc[train['HasDetections'] == 0, 'DefaultBrowsersIdentifier'], ax=ax[0], label='NoDetection(0)')
sns.kdeplot(train_small.loc[train['HasDetections'] == 1, 'DefaultBrowsersIdentifier'], ax=ax[0], label='HasDetection(1)')

train_small.loc[train['HasDetections'] == 0, 'DefaultBrowsersIdentifier'].hist(ax=ax[1])
train_small.loc[train['HasDetections'] == 1, 'DefaultBrowsersIdentifier'].hist(ax=ax[1])
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])

plt.show()

[外链图片转存失败(img-xUzDuA9J-1564639505717)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302690782.png)]

DefaultBrowsersIdentifier means ’ ID for the machine’s default browser’.

这个特征意思是“机器默认浏览器的标识”
Is this feature meaningful?

这个功能有意义吗？

3.3.4 Census_IsFightingInternal

In [15]:

print(train_small['Census_IsFlightingInternal'].dtypes)

float16

In [16]:

fig, ax = plt.subplots(1, 2, figsize=(16, 8))
sns.kdeplot(train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'], ax=ax[0], label='NoDetection(0)')
sns.kdeplot(train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'], ax=ax[0], label='HasDetection(1)')

train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'].hist(ax=ax[1])
train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'].hist(ax=ax[1])
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])

plt.show()

[外链图片转存失败(img-D46N3oxC-1564639505718)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302772704.png)]

train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'].value_counts()

Out[17]:

0.0    737583
1.0        13
Name: Census_IsFlightingInternal, dtype: int64

In [18]:

train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'].value_counts()

Out[18]:

0.0    775120
1.0         8
Name: Census_IsFlightingInternal, dtype: int64

As you can see, almost value of ‘Census_IsFlightingInternal’ is 0.0. Just remove.

就像你看到的一样，这个特征的取值几乎是0，因而我们将他移除

3.3.5Census_InternalBatteryType

print(train_small['Census_InternalBatteryType'].dtypes)

category

In [20]:

train_small['Census_InternalBatteryType'].value_counts()

Out[20]:

lion        2028256
li-i         245617
#            183998
lip           62099
liio          32635
li p           8383
li             6708
nimh           4614
real           2744
bq20           2302
pbac           2274
vbox           1454
unkn            533
lgi0            399
lipo            198
lhp0            182
4cel            170
lipp             83
ithi             79
batt             60
ram              35
bad              33
virt             33
pad0             22
lit              16
ca48             16
a132             10
ots0              9
lai0              8
ÿÿÿÿ              8
             ...   
ion              1
pbso              1
3500              1
6ion              1
@i              1
li               1
sams              1
ip               1
8                 1
#TAB#             1
l&#TAB#          1
lio              1
˙˙˙              1
l                1
cl53              1
liÿÿ              1
pa50              1
í-i              1
÷ÿóö              1
li-l              1
h4°s              1
d                 1
lgl0              1
4ion              1
0ts0              1
sail              1
p-sn              1
a130              1
2337              1
lÿÿÿ              1
Name: Census_InternalBatteryType, Length: 78, dtype: int64

I think this feature means the type of batteries of each machine.

这个特征意味着每台机器的电池类型。
Oh, no…These days, most batteries are lithum-ion battery.

大多数电池都是锂离子电池。
So, Let’s group them into lithum-batter group and non0-lithum-battery group

所以，让我们把它们分成锂蓄电池组和非锂蓄电池组

In [21]:

def group_battery(x):
    x = x.lower()
    if 'li' in x:
        return 1
    else:
        return 0
    
train_small['Census_InternalBatteryType'] = train_small['Census_InternalBatteryType'].apply(group_battery)

In [22]:

sns.countplot(x='Census_InternalBatteryType', hue='HasDetections',data=train_small)
plt.show()

在这里插入图片描述

The difference is quite small. Do you think that some malwares recognize and select machine based on the type of battery?

发现差别很小
Battery is very important part for life of machine. I think that malware will focus on other hardware and software parts of machine. remove this.->移除特征

null_cols_to_remove = ['DefaultBrowsersIdentifier', 'PuaMode',
                       'Census_IsFlightingInternal', 'Census_InternalBatteryType']

train.drop(null_cols_to_remove, axis=1, inplace=True)
test.drop(null_cols_to_remove, axis=1, inplace=True)

4Exploratory data analysis

4.1Categorical features

categorical_features = [
        'ProductName',                                          
        'EngineVersion',                                        
        'AppVersion',                                           
        'AvSigVersion',                                         
        'Platform',                                             
        'Processor',                                            
        'OsVer',                                                
        'OsPlatformSubRelease',                                 
        'OsBuildLab',                                           
        'SkuEdition',                                           
        'SmartScreen',                                          
        'Census_MDC2FormFactor',                                
        'Census_DeviceFamily',                                  
        'Census_PrimaryDiskTypeName',                           
        'Census_ChassisTypeName',                               
        'Census_PowerPlatformRoleName',                         
        'Census_OSVersion',                                     
        'Census_OSArchitecture',                                
        'Census_OSBranch',                                      
        'Census_OSEdition',                                     
        'Census_OSSkuName',                                     
        'Census_OSInstallTypeName',                             
        'Census_OSWUAutoUpdateOptionsName',                     
        'Census_GenuineStateName',                              
        'Census_ActivationChannel',                             
        'Census_FlightRing',                                    
]

def plot_category_percent_of_target(col):
    fig, ax = plt.subplots(1, 1, figsize=(15, 10))
    #这个特征中'HasDetections'的值为‘1’的比率
    cat_percent = train_small[[col, 'HasDetections']].groupby(col, as_index=False).mean()
    cat_size = train_small[col].value_counts().reset_index(drop=False)
    cat_size.columns = [col, 'count']
    cat_percent = cat_percent.merge(cat_size, on=col, how='left')
    cat_percent['HasDetections'] = cat_percent['HasDetections'].fillna(0)
    cat_percent = cat_percent.sort_values(by='count', ascending=False)[:20]
    sns.barplot(ax=ax, x='HasDetections', y=col, data=cat_percent, order=cat_percent[col])

    for i, p in enumerate(ax.patches):
        ax.annotate('{}'.format(cat_percent['count'].values[i]), (p.get_width(), p.get_y()+0.5), fontsize=20)

    plt.xlabel('% of HasDetections(target)')
    plt.ylabel(col)
    plt.show()

4.1.1ProductName - Defender state information e.g. win8defender-产品名称-防御者状态信息

col = categorical_features[0]
plot_category_percent_of_target(col)

在这里插入图片描述

4.1.2 EngineVersion - Defender state information e.g. 1.1.12603.0-引擎版本-防御者状态信息

col = categorical_features[1]
plot_category_percent_of_target(col)

在这里插入图片描述

以此类推,剩余特征与上述方法相同

4.2 numeric features

作者定义了两种视图方法，一种与上面分析分类特征的视图一样，还有一种是kdeplot(核密度估计图)

kdeplot(核密度估计图)

核密度估计(kernel density estimation)是在概率论中用来估计未知的密度函数，属于非参数检验方法之一。通过核密度估计图可以比较直观的看出数据样本本身的分布特征。

numeric_features = [
        'IsBeta',                                               
        'RtpStateBitfield',                                     
        'IsSxsPassiveMode',                                     
        'DefaultBrowsersIdentifier',                            
        'AVProductStatesIdentifier',                            
        'AVProductsInstalled',                                  
        'AVProductsEnabled',                                    
        'HasTpm',                                               
        'CountryIdentifier',                                    
        'CityIdentifier',                                       
        'OrganizationIdentifier',                               
        'GeoNameIdentifier',                                    
        'LocaleEnglishNameIdentifier',                          
        'OsBuild',                                              
        'OsSuite',                                              
        'IsProtected',                                          
        'AutoSampleOptIn',                                      
        'SMode',                                                
        'IeVerIdentifier',                                      
        'Firewall',                                             
        'UacLuaenable',                                         
        'Census_OEMNameIdentifier',                             
        'Census_OEMModelIdentifier',                            
        'Census_ProcessorCoreCount',                            
        'Census_ProcessorManufacturerIdentifier',               
        'Census_ProcessorModelIdentifier',                      
        'Census_PrimaryDiskTotalCapacity',                      
        'Census_SystemVolumeTotalCapacity',                     
        'Census_HasOpticalDiskDrive',                           
        'Census_TotalPhysicalRAM',                              
        'Census_InternalPrimaryDiagonalDisplaySizeInInches',    
        'Census_InternalPrimaryDisplayResolutionHorizontal',    
        'Census_InternalPrimaryDisplayResolutionVertical',      
        'Census_InternalBatteryNumberOfCharges',                
        'Census_OSBuildNumber',                                 
        'Census_OSBuildRevision',                               
        'Census_OSInstallLanguageIdentifier',                   
        'Census_OSUILocaleIdentifier',                          
        'Census_IsPortableOperatingSystem',                     
        'Census_IsFlightsDisabled',                             
        'Census_ThresholdOptIn',                                
        'Census_FirmwareManufacturerIdentifier',                
        'Census_FirmwareVersionIdentifier',                     
        'Census_IsSecureBootEnabled',                           
        'Census_IsWIMBootEnabled',                              
        'Census_IsVirtualDevice',                               
        'Census_IsTouchEnabled',                                
        'Census_IsPenCapable',                                  
        'Census_IsAlwaysOnAlwaysConnectedCapable',              
        'Wdft_IsGamer',                                         
        'Wdft_RegionIdentifier',                                
]

def plot_category_percent_of_target_for_numeric(col):
    fig, ax = plt.subplots(1, 2, figsize=(20, 8))
    cat_percent = train_small[[col, 'HasDetections']].groupby(col, as_index=False).mean()
    cat_size = train_small[col].value_counts().reset_index(drop=False)
    cat_size.columns = [col, 'count']
    cat_percent = cat_percent.merge(cat_size, on=col, how='left')
    cat_percent['HasDetections'] = cat_percent['HasDetections'].fillna(0)
    cat_percent = cat_percent.sort_values(by='count', ascending=False)[:20]
    cat_percent[col] = cat_percent[col].astype('category')
    sns.barplot(ax=ax[0], x='HasDetections', y=col, data=cat_percent,  order=cat_percent[col])

    for i, p in enumerate(ax[0].patches):
        ax[0].annotate('{}'.format(cat_percent['count'].values[i]), (p.get_width(), p.get_y()+0.5), fontsize=20)

    ax[0].set_title('Barplot sorted by count', fontsize=20)

    sns.barplot(ax=ax[1], x='HasDetections', y=col, data=cat_percent)
    for i, p in enumerate(ax[0].patches):
        ax[1].annotate('{}'.format(cat_percent['count'].sort_index().values[i]), (0, p.get_y()+0.6), fontsize=20)
    ax[1].set_title('Barplot sorted by index', fontsize=20)

    plt.xlabel('% of HasDetections(target)')
    plt.ylabel(col)
    plt.subplots_adjust(wspace=0.5, hspace=0)
    plt.show()

def plot_kde_hist_for_numeric(col):
    fig, ax = plt.subplots(1, 2, figsize=(16, 8))
    sns.kdeplot(train_small.loc[train['HasDetections'] == 0, col], ax=ax[0], label='NoDetection(0)')
    sns.kdeplot(train_small.loc[train['HasDetections'] == 1, col], ax=ax[0], label='HasDetection(1)')

    train_small.loc[train['HasDetections'] == 0, col].hist(ax=ax[1], bins=100)
    train_small.loc[train['HasDetections'] == 1, col].hist(ax=ax[1], bins=100)

    plt.suptitle(col, fontsize=30)
    ax[0].set_yscale('log')
    ax[0].set_title('KDE plot')
    
    ax[1].set_title('Histogram')
    ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
    ax[1].set_yscale('log')
    plt.show()

4.2.1 IsBeta - Defender state information e.g. false

col = numeric_features[0]

plot_kde_hist_for_numeric(col)
plot_category_percent_of_target_for_numeric(col)
HasDetections'] == 1, col], ax=ax[0], label='HasDetection(1)')

    train_small.loc[train['HasDetections'] == 0, col].hist(ax=ax[1], bins=100)
    train_small.loc[train['HasDetections'] == 1, col].hist(ax=ax[1], bins=100)

    plt.suptitle(col, fontsize=30)
    ax[0].set_yscale('log')
    ax[0].set_title('KDE plot')
    
    ax[1].set_title('Histogram')
    ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
    ax[1].set_yscale('log')
    plt.show()

4.2.1 IsBeta - Defender state information e.g. false

col = numeric_features[0]

plot_kde_hist_for_numeric(col)
plot_category_percent_of_target_for_numeric(col)

绝不挂科

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
My EDA - I want to see all!

title:My EDA - I want to see all!文章目录title:My EDA - I want to see all!1.summary of article content2.module3.read and check dataset -读取跟检查数据集3.1Read dataset3.2Check the target3.3Check the dataset3.3.1...
复制链接

扫一扫