【datawhale】Task3 特征工程

最新推荐文章于 2021-12-16 10:05:00 发布

weixin_43954971

最新推荐文章于 2021-12-16 10:05:00 发布

阅读量293

点赞数

本文链接：https://blog.csdn.net/weixin_43954971/article/details/105040615

版权

常见的特征工程包括：

异常处理：

通过箱线图（或 3-Sigma）分析删除异常值；
BOX-COX 转换（处理有偏分布）；
长尾截断；

特征归一化/标准化：

标准化（转换为标准正态分布）；
归一化（抓换到 [0,1] 区间）；
针对幂律分布，可以采用公式：

数据分桶：

等频分桶；
等距分桶；
Best-KS 分桶（类似利用基尼指数进行二分类）；
卡方分桶；

缺失值处理：

不处理（针对类似 XGBoost 等树模型）；
删除（缺失数据太多）；
插值补全，包括均值/中位数/众数/建模预测/多重插补/压缩感知补全/矩阵补全等；
分箱，缺失值一个箱；

特征构造：

构造统计量特征，报告计数、求和、比例、标准差等；
时间特征，包括相对时间和绝对时间，节假日，双休日等；
地理信息，包括分箱，分布编码等方法；
非线性变换，包括 log/ 平方/ 根号等；
特征组合，特征交叉；
仁者见仁，智者见智。

特征筛选

过滤式（filter）：先对数据进行特征选择，然后在训练学习器，常见的方法有 Relief/方差选择发/相关系
数法/卡方检验法/互信息法；
包裹式（wrapper）：直接把最终将要使用的学习器的性能作为特征子集的评价准则，常见方法有LVM（Las Vegas Wrapper）；
嵌入式（embedding）：结合过滤式和包裹式，学习器训练过程中自动进行了特征选择，常见的有 lasso 回归；

降维

PCA/ LDA/ ICA；
特征选择也是一种降维。

数据类型

train：([‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’, ‘gearbox’,‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘regionCode’, ‘seller’,‘offerType’, ‘creatDate’, ‘price’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’,‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’],dtype=‘object’)
test：[‘name’, ‘regDate’, ‘model’, ‘brand’, ‘bodyType’, ‘fuelType’, ‘gearbox’,‘power’, ‘kilometer’, ‘notRepairedDamage’, ‘regionCode’, ‘seller’,‘offerType’, ‘creatDate’, ‘price’, ‘v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’,‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’],dtype=‘object’

字段表

| | ||--|--|| | |

异常值处理

# 这里我包装了一个异常值处理的代码，可以随便调用。
def outliers_proc(data, col_name, scale=3):
"""
用于清洗异常值，默认用 box_plot（scale=3）进行清洗
:param data: 接收 pandas 数据格式
:param col_name: pandas 列名
:param scale: 尺度
:return:
"""
def box_plot_outliers(data_ser, box_scale):
"""
利用箱线图去除异常值
:param data_ser: 接收 pandas.Series 数据格式
:param box_scale: 箱线图尺度，
:return:
"""
iqr = box_scale * (data_ser.quantile(0.75) - data_ser.quantile(0.25))
val_low = data_ser.quantile(0.25) - iqr
val_up = data_ser.quantile(0.75) + iqr
rule_low = (data_ser < val_low)
rule_up = (data_ser > val_up)
return (rule_low, rule_up), (val_low, val_up)
data_n = data.copy()
data_series = data_n[col_name]
rule, value = box_plot_outliers(data_series, box_scale=scale)
index = np.arange(data_series.shape[0])[rule[0] | rule[1]]
print("Delete number is: {}".format(len(index)))
data_n = data_n.drop(index)
data_n.reset_index(drop=True, inplace=True)
print("Now column number is: {}".format(data_n.shape[0]))
index_low = np.arange(data_series.shape[0])[rule[0]]
outliers = data_series.iloc[index_low]
print("Description of data less than the lower bound is:")
print(pd.Series(outliers).describe())
index_up = np.arange(data_series.shape[0])[rule[1]]
outliers = data_series.iloc[index_up]
print("Description of data larger than the upper bound is:")
print(pd.Series(outliers).describe())
fig, ax = plt.subplots(1, 2, figsize=(10, 7))
sns.boxplot(y=data[col_name], data=data, palette="Set1", ax=ax[0])
sns.boxplot(y=data_n[col_name], data=data_n, palette="Set1", ax=ax[1])
return data_n

特征构造

树模型使用数据集

使用时间data[‘used_time’]：data[‘creatDate’] - data[‘regDate’]，售卖时间-注册时间，反应汽车使用时间，一般来说价格与使用时间成反比；
从邮编中提取城市信息，因为是德国的数据，所以参考德国的邮编，相当于加入了先验知识

data['city'] = data['regionCode'].apply(lambda x : str(x)[:-3])

计算某品牌的销售统计量，同学们还可以计算其他特征的统计量

train_gb = train.groupby("brand")
all_info = {}
for kind, kind_data in train_gb:
info = {}
kind_data = kind_data[kind_data['price'] > 0]
info['brand_amount'] = len(kind_data)
info['brand_price_max'] = kind_data.price.max()
info['brand_price_median'] = kind_data.price.median()
info['brand_price_min'] = kind_data.price.min()
info['brand_price_sum'] = kind_data.price.sum()
info['brand_price_std'] = kind_data.price.std()
info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)
all_info[kind] = info
brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"})
data = data.merge(brand_fe, how='left', on='brand')

数据分桶
删除原始数据

data = data.drop(['creatDate', 'regDate', 'regionCode'], axis=1)

# 目前的数据其实已经可以给树模型使用了，所以我们导出一下
data.to_csv('data_for_tree.csv', index=0)

LR NN模型数据集

power，取log，归一化
kilometer，分桶数据，归一
品牌特征统计量，归一化
类别特征，利用pandas的get_dummies 是实现one hot encode，0-1编码

特征筛选

相关性分析（corr，heatmap）
边界效应
嵌入式特征筛选

知识点总结

将数据转化成更好的表示潜在问题的特征
异常值处理（去除噪声），缺失值填补（加入先验知识）
匿名特征，装箱，groupby，agg进行特征统计，log/exp变换，多个特征的四则运算
非匿名特征，基于信号处理，频域提取，丰度，偏度等构建更为有实际意义的特征；在推荐系统中也是这样的，各种类型点击率统计，各时段统计，加用户属性的统计等等，这样一种特征构建往往要深入分析背后的业务逻辑或者说物理原理，从而才能更好的找到 magic