比赛代码2（Is this Malware? ）

最新推荐文章于 2021-10-10 20:13:12 发布

-Ausen

最新推荐文章于 2021-10-10 20:13:12 发布

阅读量148

点赞数

分类专栏：检测恶意软件

本文链接：https://blog.csdn.net/weixin_43971116/article/details/97304961

版权

检测恶意软件专栏收录该内容

16 篇文章 0 订阅

订阅专栏

代码地址

作者的主要思想: 先将特征一个分析，分析出特征中有一个特征值占90以上，或者空值占了90以上的，就直接把这一列特征去掉，再将筛选后的特征Dataframe分析。而分析思路是先将其中的一个与其他特征的关系很大的特征提出，然后再对其余特征进行一个一个分析。通过图表对其进行分析，若有分类的特征，则将其类型变为’category’型，对应着被统计的变量，若没有其本身没有分类特征，则将其结合之前分离出来的第一个特征进行结合分析，寻找其的分类特征。

（ Categoricals 是 pandas 的一种数据类型，对应着被统计的变量。Categoricals
是由固定的且有限数量的变量组成的。比如：性别、社会阶层、血型、国籍、观察时段、赞美程度等等。
与其它被统计的变量相比，categorical 类型的数据可以具有特定的顺序——比如：按程度来设定，“强烈同意”与“同意”，“首次观察”与“二次观察”，但是不能做按数值来进行排序操作（比如：sort_by
之类的，换句话说，categorical 的顺序是创建时手工设定的，是静态的）
类型数据的每一个元素的值要么是预设好的类型中的某一个，要么是空值（np.nan）。顺序是由预设好的类型集合来决定的，而不是按照类型集合中各个元素的字母顺序排序的。categorical
实例的内部是由类型名字集合和一个整数组成的数组构成的，后者标明了类型集合真正的值。
仔细看 category详细说明）

分析完之后定义一个将特征值按点 ‘ . ’ 切割的函数：

def fe(df):
#通过‘.’将数据切分
    df['EngineVersion_2'] = df['EngineVersion'].apply(lambda x: x.split('.')[2]).astype('category')
    df['EngineVersion_3'] = df['EngineVersion'].apply(lambda x: x.split('.')[3]).astype('category')

    df['AppVersion_1'] = df['AppVersion'].apply(lambda x: x.split('.')[1]).astype('category')
    df['AppVersion_2'] = df['AppVersion'].apply(lambda x: x.split('.')[2]).astype('category')
    df['AppVersion_3'] = df['AppVersion'].apply(lambda x: x.split('.')[3]).astype('category')

    df['AvSigVersion_0'] = df['AvSigVersion'].apply(lambda x: x.split('.')[0]).astype('category')
    df['AvSigVersion_1'] = df['AvSigVersion'].apply(lambda x: x.split('.')[1]).astype('category')
    df['AvSigVersion_2'] = df['AvSigVersion'].apply(lambda x: x.split('.')[2]).astype('category')

    df['OsBuildLab_0'] = df['OsBuildLab'].apply(lambda x: x.split('.')[0]).astype('category')
    df['OsBuildLab_1'] = df['OsBuildLab'].apply(lambda x: x.split('.')[1]).astype('category')
    df['OsBuildLab_2'] = df['OsBuildLab'].apply(lambda x: x.split('.')[2]).astype('category')
    df['OsBuildLab_3'] = df['OsBuildLab'].apply(lambda x: x.split('.')[3]).astype('category')
    # df['OsBuildLab_40'] = df['OsBuildLab'].apply(lambda x: x.split('.')[-1].split('-')[0]).astype('category')
    # df['OsBuildLab_41'] = df['OsBuildLab'].apply(lambda x: x.split('.')[-1].split('-')[1]).astype('category')

    df['Census_OSVersion_0'] = df['Census_OSVersion'].apply(lambda x: x.split('.')[0]).astype('category')
    df['Census_OSVersion_1'] = df['Census_OSVersion'].apply(lambda x: x.split('.')[1]).astype('category')
    df['Census_OSVersion_2'] = df['Census_OSVersion'].apply(lambda x: x.split('.')[2]).astype('category')
    df['Census_OSVersion_3'] = df['Census_OSVersion'].apply(lambda x: x.split('.')[3]).astype('category')

	#根据下面网址的kernel将数据处理
    # https://www.kaggle.com/adityaecdrid/simple-feature-engineering-xd
    df['primary_drive_c_ratio'] = df['Census_SystemVolumeTotalCapacity']/ df['Census_PrimaryDiskTotalCapacity']
    df['non_primary_drive_MB'] = df['Census_PrimaryDiskTotalCapacity'] - df['Census_SystemVolumeTotalCapacity']

    df['aspect_ratio'] = df['Census_InternalPrimaryDisplayResolutionHorizontal']/ df['Census_InternalPrimaryDisplayResolutionVertical']

    df['monitor_dims'] = df['Census_InternalPrimaryDisplayResolutionHorizontal'].astype(str) + '*' + df['Census_InternalPrimaryDisplayResolutionVertical'].astype('str')
    df['monitor_dims'] = df['monitor_dims'].astype('category')

    df['dpi'] = ((df['Census_InternalPrimaryDisplayResolutionHorizontal']**2 + df['Census_InternalPrimaryDisplayResolutionVertical']**2)**.5)/(df['Census_InternalPrimaryDiagonalDisplaySizeInInches'])

    df['dpi_square'] = df['dpi'] ** 2

    df['MegaPixels'] = (df['Census_InternalPrimaryDisplayResolutionHorizontal'] * df['Census_InternalPrimaryDisplayResolutionVertical'])/1e6

    df['Screen_Area'] = (df['aspect_ratio']* (df['Census_InternalPrimaryDiagonalDisplaySizeInInches']**2))/(df['aspect_ratio']**2 + 1)

    df['ram_per_processor'] = df['Census_TotalPhysicalRAM']/ df['Census_ProcessorCoreCount']

    df['new_num_0'] = df['Census_InternalPrimaryDiagonalDisplaySizeInInches'] / df['Census_ProcessorCoreCount']

    df['new_num_1'] = df['Census_ProcessorCoreCount'] * df['Census_InternalPrimaryDiagonalDisplaySizeInInches']
    
    df['Census_IsFlightingInternal'] = df['Census_IsFlightingInternal'].fillna(1)
    df['Census_ThresholdOptIn'] = df['Census_ThresholdOptIn'].fillna(1)
    df['Census_IsWIMBootEnabled'] = df['Census_IsWIMBootEnabled'].fillna(1)
    df['Wdft_IsGamer'] = df['Wdft_IsGamer'].fillna(0)
    
    return df

然后

#之后选择特征中特征值大于1000的列进行基于频率的编码
to_encode = []
for col in cat_cols:
    if train[col].nunique() > 1000:
        print(col, train[col].nunique())
        to_encode.append(col)
#基于频率的编码，对象是特征值频率高的特征列
for col in tqdm_notebook(to_encode):
    freq_enc_dict = frequency_encoding(col)
    train[col] = train[col].map(lambda x: freq_enc_dict.get(x, np.nan))
    test[col] = test[col].map(lambda x: freq_enc_dict.get(x, np.nan))
    cat_cols.remove(col)
    indexer = {}
    
#正常的分类特的编码
for col in cls:
    # print(col)
    _, indexer[col] = pd.factorize(train[col].astype(str), sort=True)
    
for col in tqdm_notebook(cat_cols):
    # print(col)
    train[col] = indexer[col].get_indexer(train[col].astype(str))
    test[col] = indexer[col].get_indexer(test[col].astype(str))
    
    train = reduce_mem_usage(train, verbose=False)
    test = reduce_mem_usage(test, verbose=False)

之后便是基于决策树的建模了

-Ausen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
比赛代码2（Is this Malware? ）

代码地址作者的主要思想: 先将特征一个分析，分析出特征中有一个特征值占90以上，或者空值占了90以上的，就直接把这一列特征去掉，再将筛选后的特征Dataframe分析。而分析思路是先将其中的一个与其他特征的关系很大的特征提出，然后再对其余特征进行一个一个分析。通过图表对其进行分析，若有分类的特征，则将其类型变为’category’型，对应着被统计的变量，若没有其本身没有分类特征，则将其结合之前分...
复制链接

扫一扫

专栏目录