这篇文章是为了看清楚数据,数据之间的关系,就是单纯的将数据可视化
代码块:
作者先观察了训练集的样本标签分布,从中得知他们的分布是非常平衡的
train['HasDetections'].value_counts().plot.bar()
plt.title('HasDetections(target)')
接下来检查特征中含有缺失值的情况
# checking missing data
#缺失值的总数
total = train.isnull().sum().sort_values(ascending = False)
#缺失值占特征中的百分比
percent = (train.isnull().sum()/train.isnull().count()*100).sort_values(ascending = False)
missing_train_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
接下来选取其中缺失值超过了70的特征,分析他们对标签的影响
##作者在这里用了一个train_small来代替train。但在这里两者的数据还是一样的##
sns.countplot(x='PuaMode', hue='HasDetections',data=train_small)
plt.show()
从图中可以看出这个标签对于结果来说还是有一定的影响的,但是由于标签中特征值的数量对于总体来说实在是太少了,所以我们丢弃这一特征。
下一个
sns.countplot(x='Census_ProcessorClass', hue='HasDetections',data=train_small)
plt.show()
作者提到
The meaning of 'Census_ProcessorClassr' is 'Number of logical cores in the processor'. You can check that the more logical cores, the more probable infection with malwares. This feature could be a good features only or component for the combinations with other features. Keep this and think it!
“Census_ProcessorClassr”的含义是“处理器中的逻辑核心数”。
您可以检查逻辑核心越多,恶意软件感染的可能性就越大。
此功能可能只是一个很好的功能或与其他功能组合的组件。 保持这个并思考它!
简简简而言之就是这个特征可以保留~
对于特征值类型为float型的特征,作者用了 曲线图+柱状图 来表示他们对标签的作用。
如:
而在分析特征时,发现有些特征值占了特征的绝大多数的时候,也可以直接把这一列特征去掉。
其中Census_InternalBatteryType特征,表示的是电脑中的电池类型。
它的特征值如下:
而因为现在的电脑大多数分为锂电池和其他,所以直接将特征中的特征值二分类,分成锂电池和其他电池
def group_battery(x):
x = x.lower()
if 'li' in x:
return 1
else:
return 0
train_small['Census_InternalBatteryType'] = train_small['Census_InternalBatteryType'].apply(group_battery)
对重新分类完之后的特征再进行一次分析有
作者提出:
The difference is quite small. Do you think that some malwares recognize and select machine based on the type of battery?
Battery is very important part for life of machine. I think that malware will focus on other hardware and software parts of machine. remove this.
既是这个特征对于标签的贡献率很小,而且攻击者也不会根据受害者电脑的电池类型来选择要不要入侵,所以这个特征可以去掉。
作者分析了缺失值>70%的所有特征
接下来分析了特征中的分类特征
既是下面列表中的特征
categorical_features = [
'ProductName',
'EngineVersion',
'AppVersion',
'AvSigVersion',
'Platform',
'Processor',
'OsVer',
'OsPlatformSubRelease',
'OsBuildLab',
'SkuEdition',
'SmartScreen',
'Census_MDC2FormFactor',
'Census_DeviceFamily',
'Census_PrimaryDiskTypeName',
'Census_ChassisTypeName',
'Census_PowerPlatformRoleName',
'Census_OSVersion',
'Census_OSArchitecture',
'Census_OSBranch',
'Census_OSEdition',
'Census_OSSkuName',
'Census_OSInstallTypeName',
'Census_OSWUAutoUpdateOptionsName',
'Census_GenuineStateName',
'Census_ActivationChannel',
'Census_FlightRing',
]
首先定义了一个可以将特征中每个特征值的对标签的比率可视化的函数
def plot_category_percent_of_target(col):
fig, ax = plt.subplots(1, 1, figsize=(15, 10))
#这个特征中'HasDetections'的值为‘1’的比率
cat_percent = train_small[[col, 'HasDetections']].groupby(col, as_index=False).mean()
cat_size = train_small[col].value_counts().reset_index(drop=False)
cat_size.columns = [col, 'count']
cat_percent = cat_percent.merge(cat_size, on=col, how='left')
cat_percent['HasDetections'] = cat_percent['HasDetections'].fillna(0)
cat_percent = cat_percent.sort_values(by='count', ascending=False)[:20]
sns.barplot(ax=ax, x='HasDetections', y=col, data=cat_percent, order=cat_percent[col])
for i, p in enumerate(ax.patches):
ax.annotate('{}'.format(cat_percent['count'].values[i]), (p.get_width(), p.get_y()+0.5), fontsize=20)
plt.xlabel('% of HasDetections(target)')
plt.ylabel(col)
plt.show()
之后将列表中的特征名称一个一个放入函数中,输出每个特征值对标签的影响的大小。
例子
之后分析数字特征
以下是数字特征列表
numeric_features = [
'IsBeta',
'RtpStateBitfield',
'IsSxsPassiveMode',
'DefaultBrowsersIdentifier',
'AVProductStatesIdentifier',
'AVProductsInstalled',
'AVProductsEnabled',
'HasTpm',
'CountryIdentifier',
'CityIdentifier',
'OrganizationIdentifier',
'GeoNameIdentifier',
'LocaleEnglishNameIdentifier',
'OsBuild',
'OsSuite',
'IsProtected',
'AutoSampleOptIn',
'SMode',
'IeVerIdentifier',
'Firewall',
'UacLuaenable',
'Census_OEMNameIdentifier',
'Census_OEMModelIdentifier',
'Census_ProcessorCoreCount',
'Census_ProcessorManufacturerIdentifier',
'Census_ProcessorModelIdentifier',
'Census_PrimaryDiskTotalCapacity',
'Census_SystemVolumeTotalCapacity',
'Census_HasOpticalDiskDrive',
'Census_TotalPhysicalRAM',
'Census_InternalPrimaryDiagonalDisplaySizeInInches',
'Census_InternalPrimaryDisplayResolutionHorizontal',
'Census_InternalPrimaryDisplayResolutionVertical',
'Census_InternalBatteryNumberOfCharges',
'Census_OSBuildNumber',
'Census_OSBuildRevision',
'Census_OSInstallLanguageIdentifier',
'Census_OSUILocaleIdentifier',
'Census_IsPortableOperatingSystem',
'Census_IsFlightsDisabled',
'Census_ThresholdOptIn',
'Census_FirmwareManufacturerIdentifier',
'Census_FirmwareVersionIdentifier',
'Census_IsSecureBootEnabled',
'Census_IsWIMBootEnabled',
'Census_IsVirtualDevice',
'Census_IsTouchEnabled',
'Census_IsPenCapable',
'Census_IsAlwaysOnAlwaysConnectedCapable',
'Wdft_IsGamer',
'Wdft_RegionIdentifier',
]
作者定义了两种视图方法,一种与上面分析分类特征的视图一样,还有一种是kdeplot(核密度估计图)
kdeplot(核密度估计图)
核密度估计(kernel density estimation)是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。通过核密度估计图可以比较直观的看出数据样本本身的分布特征。
视图代码如下:
def plot_category_percent_of_target_for_numeric(col):
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
cat_percent = train_small[[col, 'HasDetections']].groupby(col, as_index=False).mean()
cat_size = train_small[col].value_counts().reset_index(drop=False)
cat_size.columns = [col, 'count']
cat_percent = cat_percent.merge(cat_size, on=col, how='left')
cat_percent['HasDetections'] = cat_percent['HasDetections'].fillna(0)
cat_percent = cat_percent.sort_values(by='count', ascending=False)[:20]
cat_percent[col] = cat_percent[col].astype('category')
sns.barplot(ax=ax[0], x='HasDetections', y=col, data=cat_percent, order=cat_percent[col])
for i, p in enumerate(ax[0].patches):
ax[0].annotate('{}'.format(cat_percent['count'].values[i]), (p.get_width(), p.get_y()+0.5), fontsize=20)
ax[0].set_title('Barplot sorted by count', fontsize=20)
sns.barplot(ax=ax[1], x='HasDetections', y=col, data=cat_percent)
for i, p in enumerate(ax[0].patches):
ax[1].annotate('{}'.format(cat_percent['count'].sort_index().values[i]), (0, p.get_y()+0.6), fontsize=20)
ax[1].set_title('Barplot sorted by index', fontsize=20)
plt.xlabel('% of HasDetections(target)')
plt.ylabel(col)
plt.subplots_adjust(wspace=0.5, hspace=0)
plt.show()
def plot_kde_hist_for_numeric(col):
fig, ax = plt.subplots(1, 2, figsize=(16, 8))
sns.kdeplot(train_small.loc[train['HasDetections'] == 0, col], ax=ax[0], label='NoDetection(0)')
sns.kdeplot(train_small.loc[train['HasDetections'] == 1, col], ax=ax[0], label='HasDetection(1)')
train_small.loc[train['HasDetections'] == 0, col].hist(ax=ax[1], bins=100)
train_small.loc[train['HasDetections'] == 1, col].hist(ax=ax[1], bins=100)
plt.suptitle(col, fontsize=30)
ax[0].set_yscale('log')
ax[0].set_title('KDE plot')
ax[1].set_title('Histogram')
ax[1].legend(['NoDetection(0)', 'HasDetection(1)'])
ax[1].set_yscale('log')
plt.show()
接下来便是对列表中的每一个特征进行分析,如:
col = numeric_features[0]
plot_kde_hist_for_numeric(col)
plot_category_percent_of_target_for_numeric(col)
从中分析了每一个数字特征的含义,本身的分布以及其对标签的作用。
接下来分析相关性,既是每一个特征对于标签的相关性
其相关性的表示先用series表示出来。
corr = train_small.corr()['HasDetections']
abs(corr).sort_values(ascending=False)
HasDetections 1.000000
AVProductsInstalled 0.149626
AVProductStatesIdentifier 0.117404
Census_IsAlwaysOnAlwaysConnectedCapable 0.062780
Census_TotalPhysicalRAM 0.057069
IsProtected 0.057045
Census_ProcessorCoreCount 0.054299
Wdft_IsGamer 0.053891
Census_IsVirtualDevice 0.051464
AVProductsEnabled 0.041985
RtpStateBitfield 0.041486
Census_IsTouchEnabled 0.040410
IsSxsPassiveMode 0.035066
Census_InternalPrimaryDiagonalDisplaySizeInInches 0.034240
Census_InternalPrimaryDisplayResolutionHorizontal 0.031920
Census_OSBuildNumber 0.029486
Census_FirmwareManufacturerIdentifier 0.025924
OsBuild 0.024754
Wdft_RegionIdentifier 0.022855
Census_ProcessorModelIdentifier 0.022711
Census_HasOpticalDiskDrive 0.020842
OsSuite 0.020301
Census_InternalBatteryNumberOfCharges 0.020147
Census_IsPenCapable 0.017177
IeVerIdentifier 0.015907
Census_OEMNameIdentifier 0.015541
SMode 0.014536
Census_SystemVolumeTotalCapacity 0.014481
Census_InternalPrimaryDisplayResolutionVertical 0.013927
LocaleEnglishNameIdentifier 0.009981
Census_OSBuildRevision 0.009342
CountryIdentifier 0.007099
Census_ProcessorManufacturerIdentifier 0.006873
HasTpm 0.005490
Census_OEMModelIdentifier 0.004512
GeoNameIdentifier 0.003975
OrganizationIdentifier 0.003243
Firewall 0.003036
Census_IsFlightsDisabled 0.002807
Census_OSInstallLanguageIdentifier 0.002546
Census_IsPortableOperatingSystem 0.002497
CityIdentifier 0.002282
Census_FirmwareVersionIdentifier 0.002047
Census_OSUILocaleIdentifier 0.001786
Census_IsSecureBootEnabled 0.001711
Census_ThresholdOptIn 0.000757
Census_IsWIMBootEnabled 0.000545
AutoSampleOptIn 0.000502
UacLuaenable 0.000351
Census_PrimaryDiskTotalCapacity 0.000170
IsBeta 0.000040
Name: HasDetections, dtype: float64
用热力图将其表示出来。
热力图代码:
def corr_heatmap(cols):
correlations = train_small[cols+['HasDetections']].corr()
# Create color map ranging between two colors
cmap = sns.diverging_palette(220, 10, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(correlations, cmap=cmap, vmax=1.0, center=0, fmt='.2f',
square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .75})
plt.show()
例子:
然后用下面这代码查看所有特征的相关性:
corr = train_small.corr()
#用上三角矩阵来保存他们的相关系数
upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
接下来用回归模型将特征中相关系数大于0.3的特征对用回归模型绘图
绘图代码:
threshold = 0.3
for i, df in (upper.iterrows()):
for ele in df[df.abs() > threshold].items():
#排除自己
if ele[0] == i:
break
else:
plt.figure(figsize=(7, 7))
sns.lmplot(x=i, y=ele[0], data=train_small[:100000], hue='HasDetections', palette='Set1', scatter_kws={'alpha':0.3})
plt.show()
print('{:50}, {:50} : {}'.format(i, ele[0], ele[1]))
图例
是
作者最后提出
Many features are categorical and the pairs which have high correlations are also composed of categorial features.
I think that the keypoint is to make some features which have categories with high probability of infection from malwares.
some features are redundant.
翻译:
许多特征是分类的,具有高相关性的对也由分类特征组成。
我认为关键点是制作一些具有恶意软件感染概率高的类别的特征。
一些特征是多余的。
我认为作者的意思就是我们可以从已知的特征中构建出一些对标签有着极高的相关性的新的特征。