Kaggle Feature Engineering Part II——Creating Features

本文介绍了机器学习中至关重要的特征工程过程,包括标准化、归一化、对数变换、指数运算、独热编码、频率编码、交互特征创建以及基于数据和问题的自定义转换。通过使用Python示例展示了如何运用这些技术处理数据以提升模型性能。
摘要由CSDN通过智能技术生成

In machine learning, feature engineering is the crucial process of transforming raw data into meaningful features that a machine learning model can effectively learn from. It involves creating new features from existing ones to capture more informative patterns and improve model performance.

Common Feature Engineering Techniques in Python:

  1. Mathematical Transforms:

    • Scaling: Standardizing (using StandardScaler from scikit-learn) or normalizing (using MinMaxScaler) features to a common range can improve the performance of some machine learning algorithms.
    • Logarithms: Applying the logarithm (e.g., np.log1p(x)) can be useful for features with skewed distributions.
    • Exponentiation: Raising features to a power (e.g., x**2) can create non-linear relationships.
  2. Interaction with Categorical and Count Features:

    • One-Hot Encoding: Convert categorical features (like colors) into binary columns representing each category (e.g., using pd.get_dummies from pandas).
    • Frequency Encoding: Encode categorical features based on their frequency in the training data (custom implementation or libraries like scikit-learn's CategoryEncoder).
    • Interaction Features: Create new features by multiplying existing features, especially categorical and numerical ones, to capture potential interactions between them (e.g., df['interaction'] = df['categorical_feature'] * df['numerical_feature']).
  3. Breaking Down Categorical Features:

    • If a categorical feature has many categories, it might be beneficial to create separate binary features for each category using one-hot encoding.
    • Consider domain knowledge to create more meaningful sub-features from a categorical feature (e.g., splitting a "city" feature into "country" and "state").
  4. Grouped Transforms:

    • Aggregation: Apply aggregate functions (e.g., meansummax) on groupby operations to create new features based on groups (e.g., average purchase amount per customer).
    • Custom Transformations: Define custom functions to create new features specific to your data and problem (e.g., a function to calculate rolling averages or custom distance metrics).

Example (using pandas and scikit-learn):

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Sample data
data = {'color': ['red', 'green', 'blue', 'red', 'green'],
        'size': [10, 8, 12, 9, 11],
        'count': [2, 1, 3, 4, 1]}
df = pd.DataFrame(data)

# One-hot encode the 'color' feature
encoder = OneHotEncoder(sparse=False)
color_encoded = encoder.fit_transform(df[['color']])
df_encoded = pd.concat([df[['size', 'count']], pd.DataFrame(color_encoded, columns=['red', 'green', 'blue'])], axis=1)

# Scale the 'size' feature
scaler = StandardScaler()
df_encoded['size_scaled'] = scaler.fit_transform(df_encoded[['size']])

# Create an interaction feature
df_encoded['interaction'] = df_encoded['size_scaled'] * df_encoded['count']

# Example of a custom function for a rolling average
def rolling_average(data, window):
    return data.rolling(window=window).mean()

df_encoded['count_rolling_avg'] = rolling_average(df_encoded['count'], window=2)

print(df_encoded)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

P("Struggler") ?

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值