kaggle课程（六）Feature Engineering特征工程

最新推荐文章于 2024-03-02 08:56:03 发布

阿尔基亚

最新推荐文章于 2024-03-02 08:56:03 发布

阅读量907

点赞数 1

分类专栏： kaggle深度学习系列

本文链接：https://blog.csdn.net/freja110/article/details/107735586

版权

kaggle深度学习系列专栏收录该内容

6 篇文章 3 订阅

订阅专栏

在建模前对数据进行处理、转换、筛选的工作被称为特征工程(Feature Engineering),其本质上是对原始数据的再加工,目的是产生进入模型的特征。

一、Baseline Model基准模型

开发一个基线模型，目的是以此模型为基准，比较具有更多特性的模型的性能。

1. TalkingData AdTracking项目
首先使用TalkingData AdTracking项目的数据：

import pandas as pd
click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv',
                         parse_dates=['click_time'])
click_data.head()

在这里插入图片描述

2. 时间戳转换

# Add new columns for timestamp features day, hour, minute, and second
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
# Fill in the rest
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

3. 准备分类变量
现在对于分类变量——‘ip’, ‘app’, ‘device’, ‘os’, ‘channel’——我需要将它们转换成整数，以便我们的模型可以使用数据。为此，我将使用scikit-learn的标签编码器。这将为分类特征的每个值分配一个整数，并将这些值替换为整数。

from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in clicks using preprocessing.LabelEncoder()
label_encoder = preprocessing.LabelEncoder()
for feature in cat_features:
    encoded = label_encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

4. 创建训练、验证和测试集

我们需要为训练、验证和测试创建数据集。我们将使用一种相当简单的方法，并使用片来分割数据。我们将使用10%的数据作为验证集，10%用于测试，剩下的80%用于训练

feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

5. 训练LightGBM模型
我们将使用LightGBM模型。这是一个基于树的模型，它通常提供最好的性能，甚至可以与XGBoost相比。它也相对较快的训练。我们不会做超参数优化因为那不是这门课的目标。所以，我们的模型不会是你能得到的最好的性能。但你仍然会看到模型性能的提高。

import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

6. 评估模型
最后，让我们使用模型对测试集进行预测，看看它的性能如何。

from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")

二、Categorical Encodings分类编码

Keras中的机器学习和深度学习模型一样，要求所有输入和输出变量均为数字。
这意味着，如果你的数据包含分类数据，则必须先将其编码为数字，然后才能拟合和评估模型。
之前已经学过one-hot encoding 和 label encoding，这次学习count encoding, target encoding, 以及 CatBoost encoding。

1. Count Encoding (计数编码)

计数编码将每个分类值替换为它在数据集中出现的次数。例如，如果“GB”值在国家特性中出现了10次，那么每个“GB”都将被数字10代替。

我们将使用category -encodings包来获得这种编码。编码器本身作为CountEncoder可用。这个编码器和其他分类编码的工作原理类似于带有.fit和.transform方法的scikit-learn转换器。代码如下：

import category_encoders as ce
cat_features = ['category', 'currency', 'country']

# Create the encoder
count_enc = ce.CountEncoder()

# Transform the features, rename the columns with the _count suffix, and join to dataframe
count_encoded = count_enc.fit_transform(ks[cat_features])
data = data.join(count_encoded.add_suffix("_count"))

# Train a model 
train, valid, test = get_data_splits(data)
train_model(train, valid)

使用此编码方法，将验证分数从0.7467提高到0.7486，只有轻微的改进。

2. Target Encoding (目标编码)
目标编码将类别值替换为该特性值的目标平均值。例如，给定country值“CA”，您将计算country == 'CA’的所有行的平均结果，约为0.28。这通常与整个数据集上的目标概率混合使用，以减少很少出现的值的方差。

这种技术使用目标来创建新特性。因此，在目标编码中包含验证或测试数据将是目标泄漏的一种形式。相反，您应该只从训练数据集中学习目标编码，并将其应用于其他数据集。

category_encoders包提供了用于目标编码的TargetEncoder。实现类似CountEncoder。

# Create the encoder
target_enc = ce.TargetEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename the columns with _target suffix, and join to dataframe
train_TE = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_TE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

# Train a model
train_model(train_TE, valid_TE)

验证分数再次升高，从0.7467到0.7491。

3. CatBoost编码
CatBoost编码。这与目标编码相似，因为它基于给定值的目标概率。但是使用CatBoost，对于每一行，目标概率仅从其前面的行计算。

# Create the encoder
target_enc = ce.CatBoostEncoder(cols=cat_features)
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename columns with _cb suffix, and join to dataframe
train_CBE = train.join(target_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_CBE = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_cb'))

# Train a model
train_model(train_CBE, valid_CBE)

验证分数为0.7492，只有轻微提高。

三、Feature Generation创建新特征

从原始数据创建新特性是改进模型的最佳方法之一。

1. Interactions交互
创建新特性的最简单方法之一是组合分类变量。例如，如果一个记录有国家“CA”和类别“Music”，您可以创建一个新值“CA_Music”。这是一个新的范畴特征，它可以提供有关范畴变量之间相互关系的信息。这种类型的特性通常称为交互。
通常，您将从所有类别特性对构建交互特性。你也可以从三个或更多的功能中进行交互，但是你会得到递减的回报。
pandas允许我们像普通的Python字符串一样简单地添加字符串列。

interactions = ks['category'] + "_" + ks['country']
print(interactions.head(5))

然后我们对交互特性进行编码，并加入其中。

label_enc = LabelEncoder()
data_interaction = baseline_data.assign(category_country=label_enc.fit_transform(interactions))
data_interaction.head()

2. 过去7天的数据

launched = pd.Series(ks.index, index=ks.launched, name="count_7_days").sort_index()
#数据值为索引， 新的索引为建立的时间，新特征名称，按索引(时间)排序
launched.head(20)

#.rolling('7d')，设置一个窗口
count_7_days = launched.rolling('7d').count()-1 # -1表示不包含当前日期
print(count_7_days.head(20))

%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = 'SimHei'
plt.plot(count_7_days[7:]);
plt.title("最近7天的数据")
plt.show()

#把新特征数据，reindex后，跟原数据合并
count_7_days.index = launched.values
count_7_days = count_7_days.reindex(ks.index)
count_7_days.head(10)

#用join合并
baseline_data.join(count_7_days).head(10)

3. 上一个相同类型的项目的时间
比如，电影之类的上映，如果同类型的扎堆了，可能被对手抢占了份额

def time_since_last_project(series):
    return series.diff().dt.total_seconds()/3600
df = ks[['category','launched']].sort_values('launched')
# 按时间排序
timedeltas = df.groupby('category').transform(time_since_last_project)
# 按分类分组，然后调用函数进行转换，算得上一个同类的时间跟自己的间隔是多少小时
timedeltas.head(20)

#然后跟其他数据合并之前需要把index调整成一致
timedeltas = timedeltas.fillna(timedeltas.median()).reindex(X.index)
timedeltas.head(20)

4. 转换数值特征
Transforming numerical features，一些模型在数据分布是正态分布的时候，工作的很好，所以可以对数据进行开方、取对数转换。

plt.hist(ks.goal, range=(0, 100000), bins=50);
plt.title('Goal');

plt.hist(np.sqrt(ks.goal), range=(0, 400), bins=50);
plt.title('Sqrt(Goal)');

plt.hist(np.log(ks.goal), range=(0, 25), bins=50);
plt.title('Log(Goal)');

log 转换对基于树的模型没有什么用，但是对线性模型或者神经网络有用
我们需要转成新的特征，然后做一些测试，选择效果最好的转换方法。

四、Feature Selection特征选取

通常，在各种编码和特性生成之后，您将拥有数百或数千个特性。这可能导致两个问题。首先，您拥有的特性越多，就越有可能过度适应培训和验证集。这将导致模型在泛化新数据时性能下降。

其次，拥有的特性越多，训练模型和优化超参数所需的时间就越长。此外，在构建面向用户的产品时，您希望尽可能快地进行推理。使用更少的特性可以加快推断速度，但这是以预测性能为代价的。

为了帮助解决这些问题，您需要使用特征选取技术来为您的模型保留最有信息的特性。

1. 单变量特征选择 (Univariate feature selection)
单变量特征选择的原理是分别单独的计算每个变量的某个统计指标，根据该指标来判断哪些指标重要，剔除那些不重要的指标。

对于分类问题(y离散)，可采用：
卡方检验_，f_classif, mutual_info_classif，互信息
对于回归问题(y连续)，可采用：
皮尔森相关系数_，f_regression, mutual_info_regression，最大信息系数

from sklearn.feature_selection import SelectKBest, f_classif

feature_cols = baseline_data.columns.drop('outcome')

# Keep 5 features
selector = SelectKBest(f_classif, k=5)

X_new = selector.fit_transform(baseline_data[feature_cols], baseline_data['outcome'])
X_new

2. L1 regularization (L1正则化)
单变量方法在做选择决策时一次只考虑一个特征。相反，我们可以使用所有的特征来进行选择，方法是将它们包含在一个带有L1正则化的线性模型中。

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

train, valid, _ = get_data_splits(baseline_data)

X, y = train[train.columns.drop("outcome")], train['outcome']

# Set the regularization parameter C=1
logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, y)
model = SelectFromModel(logistic, prefit=True)

X_new = model.transform(X)
X_new

阿尔基亚

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
kaggle课程（六）Feature Engineering特征工程

在建模前对数据进行处理、转换、筛选的工作被称为特征工程(Feature Engineering),其本质上是对原始数据的再加工,目的是产生进入模型的特征。一、Baseline Model基准模型
复制链接

扫一扫