title:My EDA - I want to see all!
文章目录
1.summary of article content
Data visualization
本篇主要目的是为了看清楚数据,数据之间的关系,就是单纯的将数据可视化。
2.module
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import lightgbm as lgb
from sklearn.model_selection import KFold
import warnings
import gc
import time
import sys
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings('ignore')
from sklearn import metrics
plt.style.use('seaborn')
sns.set(font_scale=2)
pd.set_option('display.max_columns', 500)
3.read and check dataset -读取跟检查数据集
3.1Read dataset
- This parted was taken from the helpful kernel. https://www.kaggle.com/theoviel/load-the-totality-of-the-data
%time train = pd.read_csv("../input/train.csv", dtype=dtypes)
%time test = pd.read_csv("../input/test.csv", dtype=dtypes)
CPU times: user 2min 50s, sys: 19.3 s, total: 3min 10s
Wall time: 3min 11s
CPU times: user 2min 33s, sys: 10.7 s, total: 2min 44s
Wall time: 2min 44s
In [4]:
print(train.shape, test.shape)
(8921483, 83) (7853253, 82)
- You can see that the datasets are large.-数据集十分庞大
3.2Check the target
In:
train['HasDetections'].value_counts().plot.bar()
plt.title('HasDetections(target)')
out:
Text(0.5,1,'HasDetections(target)')
[外链图片转存失败(img-hOImtRGf-1564639505714)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564301689813.png)]
-
Wow, very-well balanced target! .
我们发现他们是非常平衡的目标
3.3Check the dataset
%%time
# checking missing data
total = train.isnull().sum().sort_values(ascending = False)
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending = False)
missing_train_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
CPU times: user 46.2 s, sys: 7.41 s, total: 53.6 s
Wall time: 53.4 s
In [7]:
missing_train_data.head(50)
Out[7]:
Total | Percent | |
---|---|---|
PuaMode | 8919174 | 99.974119 |
Census_ProcessorClass | 8884852 | 99.589407 |
DefaultBrowsersIdentifier | 8488045 | 95.141637 |
Census_IsFlightingInternal | 7408759 | 83.044030 |
Census_InternalBatteryType | 6338429 | 71.046809 |
Census_ThresholdOptIn | 5667325 | 63.524472 |
Census_IsWIMBootEnabled | 5659703 | 63.439038 |
SmartScreen | 3177011 | 35.610795 |
OrganizationIdentifier | 2751518 | 30.841487 |
SMode | 537759 | 6.027686 |
CityIdentifier | 325409 | 3.647477 |
Wdft_IsGamer | 303451 | 3.401352 |
Wdft_RegionIdentifier | 303451 | 3.401352 |
Census_InternalBatteryNumberOfCharges | 268755 | 3.012448 |
Census_FirmwareManufacturerIdentifier | 183257 | 2.054109 |
Census_IsFlightsDisabled | 160523 | 1.799286 |
Census_FirmwareVersionIdentifier | 160133 | 1.794915 |
Census_OEMModelIdentifier | 102233 | 1.145919 |
Census_OEMNameIdentifier | 95478 | 1.070203 |
Firewall | 91350 | 1.023933 |
Census_TotalPhysicalRAM | 80533 | 0.902686 |
Census_IsAlwaysOnAlwaysConnectedCapable | 71343 | 0.799676 |
Census_OSInstallLanguageIdentifier | 60084 | 0.673475 |
IeVerIdentifier | 58894 | 0.660137 |
Census_PrimaryDiskTotalCapacity | 53016 | 0.594251 |
Census_SystemVolumeTotalCapacity | 53002 | 0.594094 |
Census_InternalPrimaryDiagonalDisplaySizeInInches | 47134 | 0.528320 |
Census_InternalPrimaryDisplayResolutionHorizontal | 46986 | 0.526661 |
Census_InternalPrimaryDisplayResolutionVertical | 46986 | 0.526661 |
Census_ProcessorModelIdentifier | 41343 | 0.463410 |
Census_ProcessorManufacturerIdentifier | 41313 | 0.463073 |
Census_ProcessorCoreCount | 41306 | 0.462995 |
AVProductsEnabled | 36221 | 0.405998 |
AVProductsInstalled | 36221 | 0.405998 |
AVProductStatesIdentifier | 36221 | 0.405998 |
IsProtected | 36044 | 0.404014 |
RtpStateBitfield | 32318 | 0.362249 |
Census_IsVirtualDevice | 15953 | 0.178816 |
Census_PrimaryDiskTypeName | 12844 | 0.143967 |
UacLuaenable | 10838 | 0.121482 |
Census_ChassisTypeName | 623 | 0.006983 |
GeoNameIdentifier | 213 | 0.002387 |
Census_PowerPlatformRoleName | 55 | 0.000616 |
OsBuildLab | 21 | 0.000235 |
LocaleEnglishNameIdentifier | 0 | 0.000000 |
AvSigVersion | 0 | 0.000000 |
OsPlatformSubRelease | 0 | 0.000000 |
Processor | 0 | 0.000000 |
OsVer | 0 | 0.000000 |
AppVersion | 0 | 0.000000 |
-
PuaMode, Census_ProcessorClass, DefaultBrowsersIdentifier, Census_IsFlightingInternal and Census_InternalBatteryType have over 70% null data.
以上几个特征超过了百分之七十的数据是空值
-
Let’s check their distribution regarding to the target.
让我们检查一下他们对目标的分布情况。
-
Because datasets are large, let’s compare the distributions using 10% of train.
因为数据集很大,让我们用10%的训练来比较分布。
train_small = train # train.sample(frac=0.2).copy() # not small for now
3.3.1 PuaMode
In [9]:
print(train_small['PuaMode'].dtypes)
category
In [10]:
# sns.countplot()计数直方图函数
sns.countplot(x='PuaMode', hue='HasDetections',data=train_small)
plt.show()
[外链图片转存失败(img-fvkM88gs-1564639505716)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564294839819.png)]
Some difference exists there. But, the samples are quite few, so remove this feature.
存在一些差异。但是,样本很少,所以删除这个特性。
3.3.2 Census_ProcessorClass
print(train_small['Census_ProcessorClass'].dtypes)
category
In [12]:
sns.countplot(x='Census_ProcessorClass', hue='HasDetections',data=train_small)
plt.show()
[外链图片转存失败(img-Zk1VIyL0-1564639505716)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302660508.png)]
-
The meaning of ‘Census_ProcessorClassr’ is ‘Number of logical cores in the processor’.
这个特征的意思是“处理器中逻辑核心的数量”
-
You can check that the more logical cores, the more probable infection with malwares.
你可以发现逻辑核心越多,恶意软件感染的可能性就越大
-
This feature could be a good features only or component for the combinations with other features. Keep this and think it!
此功能可能只是一个好功能,也可能是与其他功能组合的组件。留着这个,然后思考它
3.3.3 DefaultBrosersIdentifier
print(train_small['DefaultBrowsersIdentifier'].dtypes)
float16
In [14]:
fig, ax = plt.subplots(1, 2, figsize=(16, 8))
sns.kdeplot(train_small.loc[train['HasDetections'] == 0, 'DefaultBrowsersIdentifier'], ax=ax[0], label='NoDetection(0)')
sns.kdeplot(train_small.loc[train['HasDetections'] == 1, 'DefaultBrowsersIdentifier'], ax=ax[0], label='HasDetection(1)')
train_small.loc[train['HasDetections'] == 0, 'DefaultBrowsersIdentifier'].hist(ax=ax[1])
train_small.loc[train['HasDetections'] == 1, 'DefaultBrowsersIdentifier'].hist(ax=ax[1])
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
plt.show()
[外链图片转存失败(img-xUzDuA9J-1564639505717)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302690782.png)]
-
DefaultBrowsersIdentifier means ’ ID for the machine’s default browser’.
这个特征意思是“机器默认浏览器的标识”
-
Is this feature meaningful?
这个功能有意义吗?
3.3.4 Census_IsFightingInternal
In [15]:
print(train_small['Census_IsFlightingInternal'].dtypes)
float16
In [16]:
fig, ax = plt.subplots(1, 2, figsize=(16, 8))
sns.kdeplot(train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'], ax=ax[0], label='NoDetection(0)')
sns.kdeplot(train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'], ax=ax[0], label='HasDetection(1)')
train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'].hist(ax=ax[1])
train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'].hist(ax=ax[1])
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
plt.show()
[外链图片转存失败(img-D46N3oxC-1564639505718)(C:\Users\HP\AppData\Roaming\Typora\typora-user-images\1564302772704.png)]
train_small.loc[train['HasDetections'] == 1, 'Census_IsFlightingInternal'].value_counts()
Out[17]:
0.0 737583
1.0 13
Name: Census_IsFlightingInternal, dtype: int64
In [18]:
train_small.loc[train['HasDetections'] == 0, 'Census_IsFlightingInternal'].value_counts()
Out[18]:
0.0 775120
1.0 8
Name: Census_IsFlightingInternal, dtype: int64
-
As you can see, almost value of ‘Census_IsFlightingInternal’ is 0.0. Just remove.
就像你看到的一样,这个特征的取值几乎是0,因而我们将他移除
3.3.5Census_InternalBatteryType
print(train_small['Census_InternalBatteryType'].dtypes)
category
In [20]:
train_small['Census_InternalBatteryType'].value_counts()
Out[20]:
lion 2028256
li-i 245617
# 183998
lip 62099
liio 32635
li p 8383
li 6708
nimh 4614
real 2744
bq20 2302
pbac 2274
vbox 1454
unkn 533
lgi0 399
lipo 198
lhp0 182
4cel 170
lipp 83
ithi 79
batt 60
ram 35
bad 33
virt 33
pad0 22
lit 16
ca48 16
a132 10
ots0 9
lai0 8
ÿÿÿÿ 8
...
ion 1
pbso 1
3500 1
6ion 1
@i 1
li 1
sams 1
ip 1
8 1
#TAB# 1
l&#TAB# 1
lio 1
˙˙˙ 1
l 1
cl53 1
liÿÿ 1
pa50 1
í-i 1
÷ÿóö 1
li-l 1
h4°s 1
d 1
lgl0 1
4ion 1
0ts0 1
sail 1
p-sn 1
a130 1
2337 1
lÿÿÿ 1
Name: Census_InternalBatteryType, Length: 78, dtype: int64
-
I think this feature means the type of batteries of each machine.
这个特征意味着每台机器的电池类型。
-
Oh, no…These days, most batteries are lithum-ion battery.
大多数电池都是锂离子电池。
-
So, Let’s group them into lithum-batter group and non0-lithum-battery group
所以,让我们把它们分成锂蓄电池组和非锂蓄电池组
In [21]:
def group_battery(x):
x = x.lower()
if 'li' in x:
return 1
else:
return 0
train_small['Census_InternalBatteryType'] = train_small['Census_InternalBatteryType'].apply(group_battery)
In [22]:
sns.countplot(x='Census_InternalBatteryType', hue='HasDetections',data=train_small)
plt.show()
-
The difference is quite small. Do you think that some malwares recognize and select machine based on the type of battery?
发现差别很小
-
Battery is very important part for life of machine. I think that malware will focus on other hardware and software parts of machine. remove this.->移除特征
null_cols_to_remove = ['DefaultBrowsersIdentifier', 'PuaMode',
'Census_IsFlightingInternal', 'Census_InternalBatteryType']
train.drop(null_cols_to_remove, axis=1, inplace=True)
test.drop(null_cols_to_remove, axis=1, inplace=True)
4Exploratory data analysis
4.1Categorical features
categorical_features = [
'ProductName',
'EngineVersion',
'AppVersion',
'AvSigVersion',
'Platform',
'Processor',
'OsVer',
'OsPlatformSubRelease',
'OsBuildLab',
'SkuEdition',
'SmartScreen',
'Census_MDC2FormFactor',
'Census_DeviceFamily',
'Census_PrimaryDiskTypeName',
'Census_ChassisTypeName',
'Census_PowerPlatformRoleName',
'Census_OSVersion',
'Census_OSArchitecture',
'Census_OSBranch',
'Census_OSEdition',
'Census_OSSkuName',
'Census_OSInstallTypeName',
'Census_OSWUAutoUpdateOptionsName',
'Census_GenuineStateName',
'Census_ActivationChannel',
'Census_FlightRing',
]
def plot_category_percent_of_target(col):
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
#这个特征中'HasDetections'的值为‘1’的比率
cat_percent = train_small[[col, 'HasDetections']].groupby(col, as_index=False).mean()
cat_size = train_small[col].value_counts().reset_index(drop=False)
cat_size.columns = [col, 'count']
cat_percent = cat_percent.merge(cat_size, on=col, how='left')
cat_percent['HasDetections'] = cat_percent['HasDetections'].fillna(0)
cat_percent = cat_percent.sort_values(by='count', ascending=False)[:20]
sns.barplot(ax=ax, x='HasDetections', y=col, data=cat_percent, order=cat_percent[col])
for i, p in enumerate(ax.patches):
ax.annotate('{}'.format(cat_percent['count'].values[i]), (p.get_width(), p.get_y()+0.5), fontsize=20)
plt.xlabel('% of HasDetections(target)')
plt.ylabel(col)
plt.show()
4.1.1ProductName - Defender state information e.g. win8defender-产品名称-防御者状态信息
col = categorical_features[0]
plot_category_percent_of_target(col)
4.1.2 EngineVersion - Defender state information e.g. 1.1.12603.0-引擎版本-防御者状态信息
col = categorical_features[1]
plot_category_percent_of_target(col)
以此类推,剩余特征与上述方法相同
4.2 numeric features
作者定义了两种视图方法,一种与上面分析分类特征的视图一样,还有一种是kdeplot(核密度估计图)
kdeplot(核密度估计图)
核密度估计(kernel density estimation)是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。通过核密度估计图可以比较直观的看出数据样本本身的分布特征。
numeric_features = [
'IsBeta',
'RtpStateBitfield',
'IsSxsPassiveMode',
'DefaultBrowsersIdentifier',
'AVProductStatesIdentifier',
'AVProductsInstalled',
'AVProductsEnabled',
'HasTpm',
'CountryIdentifier',
'CityIdentifier',
'OrganizationIdentifier',
'GeoNameIdentifier',
'LocaleEnglishNameIdentifier',
'OsBuild',
'OsSuite',
'IsProtected',
'AutoSampleOptIn',
'SMode',
'IeVerIdentifier',
'Firewall',
'UacLuaenable',
'Census_OEMNameIdentifier',
'Census_OEMModelIdentifier',
'Census_ProcessorCoreCount',
'Census_ProcessorManufacturerIdentifier',
'Census_ProcessorModelIdentifier',
'Census_PrimaryDiskTotalCapacity',
'Census_SystemVolumeTotalCapacity',
'Census_HasOpticalDiskDrive',
'Census_TotalPhysicalRAM',
'Census_InternalPrimaryDiagonalDisplaySizeInInches',
'Census_InternalPrimaryDisplayResolutionHorizontal',
'Census_InternalPrimaryDisplayResolutionVertical',
'Census_InternalBatteryNumberOfCharges',
'Census_OSBuildNumber',
'Census_OSBuildRevision',
'Census_OSInstallLanguageIdentifier',
'Census_OSUILocaleIdentifier',
'Census_IsPortableOperatingSystem',
'Census_IsFlightsDisabled',
'Census_ThresholdOptIn',
'Census_FirmwareManufacturerIdentifier',
'Census_FirmwareVersionIdentifier',
'Census_IsSecureBootEnabled',
'Census_IsWIMBootEnabled',
'Census_IsVirtualDevice',
'Census_IsTouchEnabled',
'Census_IsPenCapable',
'Census_IsAlwaysOnAlwaysConnectedCapable',
'Wdft_IsGamer',
'Wdft_RegionIdentifier',
]
def plot_category_percent_of_target_for_numeric(col):
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
cat_percent = train_small[[col, 'HasDetections']].groupby(col, as_index=False).mean()
cat_size = train_small[col].value_counts().reset_index(drop=False)
cat_size.columns = [col, 'count']
cat_percent = cat_percent.merge(cat_size, on=col, how='left')
cat_percent['HasDetections'] = cat_percent['HasDetections'].fillna(0)
cat_percent = cat_percent.sort_values(by='count', ascending=False)[:20]
cat_percent[col] = cat_percent[col].astype('category')
sns.barplot(ax=ax[0], x='HasDetections', y=col, data=cat_percent, order=cat_percent[col])
for i, p in enumerate(ax[0].patches):
ax[0].annotate('{}'.format(cat_percent['count'].values[i]), (p.get_width(), p.get_y()+0.5), fontsize=20)
ax[0].set_title('Barplot sorted by count', fontsize=20)
sns.barplot(ax=ax[1], x='HasDetections', y=col, data=cat_percent)
for i, p in enumerate(ax[0].patches):
ax[1].annotate('{}'.format(cat_percent['count'].sort_index().values[i]), (0, p.get_y()+0.6), fontsize=20)
ax[1].set_title('Barplot sorted by index', fontsize=20)
plt.xlabel('% of HasDetections(target)')
plt.ylabel(col)
plt.subplots_adjust(wspace=0.5, hspace=0)
plt.show()
def plot_kde_hist_for_numeric(col):
fig, ax = plt.subplots(1, 2, figsize=(16, 8))
sns.kdeplot(train_small.loc[train['HasDetections'] == 0, col], ax=ax[0], label='NoDetection(0)')
sns.kdeplot(train_small.loc[train['HasDetections'] == 1, col], ax=ax[0], label='HasDetection(1)')
train_small.loc[train['HasDetections'] == 0, col].hist(ax=ax[1], bins=100)
train_small.loc[train['HasDetections'] == 1, col].hist(ax=ax[1], bins=100)
plt.suptitle(col, fontsize=30)
ax[0].set_yscale('log')
ax[0].set_title('KDE plot')
ax[1].set_title('Histogram')
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
ax[1].set_yscale('log')
plt.show()
4.2.1 IsBeta - Defender state information e.g. false
col = numeric_features[0]
plot_kde_hist_for_numeric(col)
plot_category_percent_of_target_for_numeric(col)
HasDetections'] == 1, col], ax=ax[0], label='HasDetection(1)')
train_small.loc[train['HasDetections'] == 0, col].hist(ax=ax[1], bins=100)
train_small.loc[train['HasDetections'] == 1, col].hist(ax=ax[1], bins=100)
plt.suptitle(col, fontsize=30)
ax[0].set_yscale('log')
ax[0].set_title('KDE plot')
ax[1].set_title('Histogram')
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
ax[1].set_yscale('log')
plt.show()
4.2.1 IsBeta - Defender state information e.g. false
col = numeric_features[0]
plot_kde_hist_for_numeric(col)
plot_category_percent_of_target_for_numeric(col)