前言
阅读别的的优秀代码有助于提高自己的代码编写能力,从中我们不仅能学习到许多的编程知识,还能借鉴他人优秀的编程习惯,也能学习到别人独特的编程技巧。这篇博客是博主对微软2019恶意软件检测比赛第七名的一些个人总结和看法,有些代码上博主已经给了注释,同时也会额外给代码另外进行注释。由于博主能力有限,错误的出现在所难免,还望技术爱好者们 不吝赐教。
正文
概要
众所周知,机器学习分类模型的构建主要由两部分组成1.数据预处理(包括数据清洗、特征工程等)
2.机器学习模型构建(训练、调参)
,而数据预处理是机器学习模型构建的前期工作,用于训练的数据的质量在很大程度决定了最后的机器学习模型的质量,所以一般的机器学习项目的代码绝大篇幅都是处理数据的代码,这份代码也是如此。 个人认为,这份代码的的数据处理不算很好,但也还算过得去(如果想了解比较有趣的数据预处理代码请看博主的另一篇博客?)。这份代码所使用的机器学习算法是lightGBM。
代码详解
说明:
博主会把代码分开来讲解,但由于设备原因无法把每一步的代码结果显示出来,条件允许的技术爱好者们可以自己复制代码自己去run一下,代码中使用的文件在官网可以下载。虽然是步讲解,但是从上往下把代码拼接起来的是完整的代码。
- 库的导入
#imports
import numpy as np
import pandas as pd
import gc # python 的垃圾收集机制
import time # 貌似在这份代码中没有用......
import random # 随机数
from lightgbm import LGBMClassifier # lightGBM 算法库
from sklearn.metrics import roc_auc_score, roc_curve # AUC ROC 模型分类能力的一种评估标准
from sklearn.model_selection import StratifiedKFold # 训练集和验证集的划分
import matplotlib.pyplot as plot #可视化
import seaborn as sb #可视化
- 实现功能前的预备阶段
#vars
dataFolder = '../input/'
submissionFileName = 'submission'
trainFile='train.csv'
testFile='test.csv'
#used 4000000 nr of rows in stead of 8000000 because of Kernel memory issue
numberOfRows = 4000000
seed = 6001
np.random.seed(seed)
random.seed(seed)
def displayImportances(featureImportanceDf, submissionFileName):
# 根据 importance 的降序排位来给 feature 排序,再将排序后的特征存入 cols (存的特征的名称)
cols = featureImportanceDf[["feature", "importance"]].groupby("feature").mean().sort_values(by = "importance", ascending = False).index
# .loc() 不仅可以索引为参数,也可以以boolean为参数。boolean的操作单位是某个特征的特征值
bestFeatures = featureImportanceDf.loc[featureImportanceDf.feature.isin(cols)] # isin()接受一个列表,判断该列中元素是否在列表中,并返回boolean值
plot.figure(figsize = (14, 14))
sb.barplot(x = "importance", y = "feature", data = bestFeatures.sort_values(by = "importance", ascending = False))
plot.title('LightGBM Features')
plot.tight_layout()
plot.savefig(submissionFileName + '.png')
这一段代码,其实我觉得可以不用把路径用几个变量来表示(或许是代码作者的编程习惯吧)。numberOfRows=4000000
的用法要纵观代码才能知道,是这样的,代码作者把比赛官方给的train和test拼接在了一起,然后再选取前4000000个样例作为训练集(最后被分为训练集和验证集)。seed=6001
及下面两条代码是为了生成随机种子,但博主有个疑惑,为什么用了np.random.seed(seed)
还要用 random.seed(seed)
?,先按住不表,等我查好资料再来补充。至于那个自定义函数,是最后来保存输出结果的。
- 为官方提供的文件中的特征设置类型
就是说原始数据中的特征只有特征值,官方是没有标出它是什么类型的数据,需要自己来设置。
dtypes = {
'MachineIdentifier': 'category',
'ProductName': 'category',
'EngineVersion': 'category',
'AppVersion': 'category',
'AvSigVersion': 'category',
'IsBeta': 'int8',
'RtpStateBitfield': 'float16',
'IsSxsPassiveMode': 'int8',
'DefaultBrowsersIdentifier': 'float16',
'AVProductStatesIdentifier': 'float32',
'AVProductsInstalled': 'float16',
'AVProductsEnabled': 'float16',
'HasTpm': 'int8',
'CountryIdentifier': 'int16',
'CityIdentifier': 'float32',
'OrganizationIdentifier': 'float16',
'GeoNameIdentifier': 'float16',
'LocaleEnglishNameIdentifier': 'int8',
'Platform': 'category',
'Processor': 'category',
'OsVer': 'category',
'OsBuild': 'int16',
'OsSuite': 'int16',
'OsPlatformSubRelease': 'category',
'OsBuildLab': 'category',
'SkuEdition': 'category',
'IsProtected': 'float16',
'AutoSampleOptIn': 'int8',
'PuaMode': 'category',
'SMode': 'float16',
'IeVerIdentifier': 'float16',
'SmartScreen': 'category',
'Firewall': 'float16',
'UacLuaenable': 'float32',
'Census_MDC2FormFactor': 'category',
'Census_DeviceFamily': 'category',
'Census_OEMNameIdentifier': 'float16',
'Census_OEMModelIdentifier': 'float32',
'Census_ProcessorCoreCount': 'float16',
'Census_ProcessorManufacturerIdentifier': 'float16',
'Census_ProcessorModelIdentifier': 'float16',
'Census_ProcessorClass': 'category',
'Census_PrimaryDiskTotalCapacity': 'float32',
'Census_PrimaryDiskTypeName': 'category',
'Census_SystemVolumeTotalCapacity': 'float32',
'Census_HasOpticalDiskDrive': 'int8',
'Census_TotalPhysicalRAM': 'float32',
'Census_ChassisTypeName': 'category',
'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float16',
'Census_InternalPrimaryDisplayResolutionHorizontal': 'float16',
'Census_InternalPrimaryDisplayResolutionVertical': 'float16',
'Census_PowerPlatformRoleName': 'category',
'Census_InternalBatteryType': 'category',
'Census_InternalBatteryNumberOfCharges': 'float32',
'Census_OSVersion': 'category',
'Census_OSArchitecture': 'category',
'Census_OSBranch': 'category',
'Census_OSBuildNumber': 'int16',
'Census_OSBuildRevision': 'int32',
'Census_OSEdition': 'category',
'Census_OSSkuName': 'category',
'Census_OSInstallTypeName': 'category',
'Census_OSInstallLanguageIdentifier': 'float16',
'Census_OSUILocaleIdentifier': 'int16',
'Census_OSWUAutoUpdateOptionsName': 'category',
'Census_IsPortableOperatingSystem': 'int8',
'Census_GenuineStateName': 'category',
'Census_ActivationChannel': 'category',
'Census_IsFlightingInternal': 'float16',
'Census_IsFlightsDisabled': 'float16',
'Census_FlightRing': 'category',
'Census_ThresholdOptIn': 'float16',
'Census_FirmwareManufacturerIdentifier': 'float16',
'Census_FirmwareVersionIdentifier': 'float32',
'Census_IsSecureBootEnabled': 'int8',
'Census_IsWIMBootEnabled': 'float16',
'Census_IsVirtualDevice': 'float16',
'Census_IsTouchEnabled': 'int8',
'Census_IsPenCapable': 'int8',
'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16',
'Wdft_IsGamer': 'float16',
'Wdft_RegionIdentifier': 'float16',
'HasDetections': 'int8'
}
- 特征选择
selectedFeatures = [
'AVProductStatesIdentifier'
,'AVProductsEnabled'
,'IsProtected'
,'Processor'
,'OsSuite'
,'IsProtected'
,'RtpStateBitfield'
,'AVProductsInstalled'
,'Wdft_IsGamer'
,'DefaultBrowsersIdentifier'
,'OsBuild'
,'Wdft_RegionIdentifier'
,'SmartScreen'
,'CityIdentifier'
,'AppVersion'
,'Census_IsSecureBootEnabled'
,'Census_PrimaryDiskTypeName'
,'Census_SystemVolumeTotalCapacity'
,'Census_HasOpticalDiskDrive'
,'Census_IsWIMBootEnabled'
,'Census_IsVirtualDevice'
,'Census_IsTouchEnabled'
,'Census_FirmwareVersionIdentifier'
,'GeoNameIdentifier'
,'IeVerIdentifier'
,'Census_FirmwareManufacturerIdentifier'
,'Census_InternalPrimaryDisplayResolutionHorizontal'
,'Census_InternalPrimaryDisplayResolutionVertical'
,'Census_OEMModelIdentifier'
,'Census_ProcessorModelIdentifier'
,'Census_OSVersion'
,'Census_InternalPrimaryDiagonalDisplaySizeInInches'
,'Census_OEMNameIdentifier'
,'Census_ChassisTypeName'
,'Census_OSInstallLanguageIdentifier'
,'EngineVersion'
,'OrganizationIdentifier'
,'CountryIdentifier'
,'Census_ActivationChannel'
,'Census_ProcessorCoreCount'
,'Census_OSWUAutoUpdateOptionsName'
,'Census_InternalBatteryType'
]
代码作者因为具备非常非常深厚的数据处理技术功底,他可能是根据以前对恶意代码数据处理的经验直接选择了这些特征来给机器学习模型进行训练。所以说,特征是不能乱选的,如果没有代码作者那样的技术,还是借鉴别人的数据预处理方法进行特征筛选吧。
- 载入数据
# Load Data with selected features
trainDf = pd.read_csv(dataFolder + trainFile, dtype=dtypes,
usecols=selectedFeatures,
low_memory=True, nrows = numberOfRows) # 训练集
labels = pd.read_csv(dataFolder + trainFile,
usecols = ['HasDetections'], nrows = numberOfRows) # 标签
testDf = pd.read_csv(dataFolder + testFile,
dtype=dtypes, usecols=selectedFeatures, low_memory=True) #测试集
print('== Dataset Shapes ==')
print('Train : ' + str(trainDf.shape)) # trainDf.shape 是 tuple 类型
print('Labels : ' + str(labels.shape))
print('Test : ' + str(testDf.shape))
# Append Datasets and Cleanup
df = trainDf.append(testDf).reset_index() # 从这里可以看到 .append() 对DataFrame来说一样有效,不仅可以用在 list 上,并且会出现新的‘index’列 (用来保存原来的index)。这里是上下拼接。
del trainDf, testDf # 删除 trainDf testDf 节省内存
gc.collect()
df 是将train和test拼接之后的新的DataFrame。
- 对特征 ‘SmartScreen’ 的特征值进行处理
# Modify SmartScreen Feature
df.loc[df.SmartScreen == 'off', 'SmartScreen'] = 'Off' # df.SmartScreen=='off'是条件
df.loc[df.SmartScreen == 'of', 'SmartScreen'] = 'Off'
df.loc[df.SmartScreen == 'OFF', 'SmartScreen'] = 'Off'
df.loc[df.SmartScreen == '00000000', 'SmartScreen'] = 'Off'
df.loc[df.SmartScreen == '0', 'SmartScreen'] = 'Off'
df.loc[df.SmartScreen == 'ON', 'SmartScreen'] = 'On'
df.loc[df.SmartScreen == 'on', 'SmartScreen'] = 'On'
df.loc[df.SmartScreen == 'Enabled', 'SmartScreen'] = 'On'
df.loc[df.SmartScreen == 'BLOCK', 'SmartScreen'] = 'Block'
df.loc[df.SmartScreen == 'requireadmin', 'SmartScreen'] = 'RequireAdmin'
df.loc[df.SmartScreen == 'requireAdmin', 'SmartScreen'] = 'RequireAdmin'
df.loc[df.SmartScreen == 'RequiredAdmin', 'SmartScreen'] = 'RequireAdmin'
df.loc[df.SmartScreen == 'Promt', 'SmartScreen'] = 'Prompt'
df.loc[df.SmartScreen == 'Promprt', 'SmartScreen'] = 'Prompt'
df.loc[df.SmartScreen == 'prompt', 'SmartScreen'] = 'Prompt'
df.loc[df.SmartScreen == 'warn', 'SmartScreen'] = 'Warn'
df.loc[df.SmartScreen == 'Deny', 'SmartScreen'] = 'Block'
df.loc[df.SmartScreen == '', 'SmartScreen'] = 'Off'
在这里我们能学到一种从某特征中取特定值的方法:通过设定条件来取特征中的目标特征值
- 将每种特征的个特征值出现次数统计出来再生成一个新的DataFrame
#Count Encoding (with exceptions)
for col in [f for f in df.columns if f not in ['index','HasDetections','Census_SystemVolumeTotalCapacity']]:
df[col]=df[col].map(df[col].value_counts()) # col列中的特征值换成该特征值在该特征中出现的次数
dfDummy = pd.get_dummies(df, dummy_na=True) # 对 df 进行独热编码,dummy_na=True 表示考虑缺失值NaN
print('Dummy: ' + str(dfDummy.shape))
# Cleanup
del df
gc.collect()
# Summary Shape
print('== Dataset Shapes ==')
print('Train: ' + str(train.shape))
print('Test: ' + str(test.shape))
# Summary Columns
print('== Dataset Columns ==')
features = [f for f in train.columns if f not in ['index']]
for feature in features:
print(feature)
-
df[col].map(df[col].value_counts())
通过.map()函数将每个特征值的出现次数映射到原来存放特征值的那个位置 (如果是函数意思不懂的话博主建议自己去查一下,这里只给出代码的意义)。这行代码是很有技巧的,因为它只用了一行代码就对每个特征中存放的值从特征值换成了特征值出现次数,也就是所谓的频率(更正式的“频率”应该是出现次数除以100),那为什么要修改为频率呢?那是因为lightGBM算法是基于频率的。 -
feature
在上面我们把 train 和 test 拼接起来的时候使用了函数 .reset_index(),会出现新的一列’index’保存原来的索引,所以在这里我们要not in ['index']
`` -
df[col]=df[col].map(df[col].value_counts())
这行代码比较难,我这里放个例子给大家看看
-
机器学习模型构建部分
- 训练模块
# CV Folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = seed)
# Create arrays and dataframes to store results
oofPreds = np.zeros(train.shape[0]) # numpy.ndarray 类型
subPreds = np.zeros(test.shape[0]) # numpy.ndarray 类型
featureImportanceDf = pd.DataFrame()
# Loop through all Folds.
for n_fold, (trainXId, validXId) in enumerate(folds.split(train[features], labels)): # enumerate 为每个元素标个索引,并且将该索引与相应的值合并为一个元组,这里应该有5个元组,因为折了5次
# Create TrainXY and ValidationXY set based on fold-indexes
trainX, trainY = train[features].iloc[trainXId], labels.iloc[trainXId]
validX, validY = train[features].iloc[validXId], labels.iloc[validXId]
print('== Fold: ' + str(n_fold)) # 强制转化为 str 类型应该是代码作者的习惯,其实直接显示数值也行的
# LightGBM parameters
lgbm = LGBMClassifier(
objective = 'binary',
boosting_type = 'gbdt',
n_estimators = 2500,
learning_rate = 0.05,
num_leaves = 250,
min_data_in_leaf = 125,
bagging_fraction = 0.901,
max_depth = 13,
reg_alpha = 2.5,
reg_lambda = 2.5,
min_split_gain = 0.0001,
min_child_weight = 25,
feature_fraction = 0.5,
silent = -1,
verbose = -1,
#n_jobs is set to -1 instead of 4 otherwise the kernell will time out
n_jobs = -1)
lgbm.fit(trainX, trainY,
eval_set=[(trainX, trainY), (validX, validY)],
eval_metric = 'auc',
verbose = 250,
early_stopping_rounds = 100)
# 通过分类器模型对验证集预测为正样本的概率和验证集的真实标签计算AUC来检测分类器模型的分类效果
oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1] # 验证集中样本预测为1(正样本)的概率
print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(validY, oofPreds[validXId]))) # 通过验证集的标签和预测为正样本的概率计算AUC
# cleanup
print('Cleanup')
del trainX, trainY, validX, validY
gc.collect()
subPreds += lgbm.predict_proba(test[features], num_iteration = lgbm.best_iteration_)[:, 1] / folds.n_splits # 对测试集进行预测,并返回预测为正例的概率, folds.n_splits = 5 (折了5次)
# Feature Importance
fold_importance_df = pd.DataFrame()
fold_importance_df["feature"] = features
fold_importance_df["importance"] = lgbm.feature_importances_ # .feature_importances_:特征重要性,特征越重要该值越大
fold_importance_df["fold"] = n_fold + 1
featureImportanceDf = pd.concat([featureImportanceDf, fold_importance_df], axis=0) # 垂直拼接,并保留原index
# cleanup
print('Cleanup. Post-Fold')
del lgbm
gc.collect()
print('Full AUC score %.6f' % roc_auc_score(labels, oofPreds)) # 全部样本的AUC值
1.oofPreds = np.zeros(train.shape[0])
: 创建一个与 train 行长度相等的元素为0的数组
subPreds = np.zeros(test.shape[0])
: 创建一个与 test 行长度相等的元素为0的数组
2.oofPreds = np.zeros(train.shape[0])
subPreds = np.zeros(test.shape[0])
是 numpy.ndarray类型,因为roc_auc_score()
参数得是array类型。
3.经过训练,我们可以计算AUC值来检测分类效果
oofPreds[validXId] = lgbm.predict_proba(validX, num_iteration = lgbm.best_iteration_)[:, 1]
验证集中样本预测为1(正样本)的概率roc_auc_score(validY, oofPreds[validXId]))
过验证集的标签和预测验证集为正样本的概率计算AUC
- 保存文件、可视化模块(可视化函数在代码最上面定义了)
# Feature Importance
displayImportances(featureImportanceDf, submissionFileName)
# Generate Submission
kaggleSubmission = pd.read_csv(dataFolder + 'sample_submission.csv')
kaggleSubmission['HasDetections'] = subPreds
kaggleSubmission.to_csv(submissionFileName + '.csv', index = False)