Kaggle Feature Engineering Part II——Creating Features-CSDN博客

本文链接：https://blog.csdn.net/weixin_38233104/article/details/136520390

本文介绍了机器学习中至关重要的特征工程过程，包括标准化、归一化、对数变换、指数运算、独热编码、频率编码、交互特征创建以及基于数据和问题的自定义转换。通过使用Python示例展示了如何运用这些技术处理数据以提升模型性能。

摘要由CSDN通过智能技术生成

In machine learning, feature engineering is the crucial process of transforming raw data into meaningful features that a machine learning model can effectively learn from. It involves creating new features from existing ones to capture more informative patterns and improve model performance.

Common Feature Engineering Techniques in Python:

Mathematical Transforms:
- Scaling: Standardizing (using StandardScaler from scikit-learn) or normalizing (using MinMaxScaler) features to a common range can improve the performance of some machine learning algorithms.
- Logarithms: Applying the logarithm (e.g., np.log1p(x)) can be useful for features with skewed distributions.
- Exponentiation: Raising features to a power (e.g., x**2) can create non-linear relationships.
Interaction with Categorical and Count Features:
- One-Hot Encoding: Convert categorical features (like colors) into binary columns representing each category (e.g., using pd.get_dummies from pandas).
- Frequency Encoding: Encode categorical features based on their frequency in the training data (custom implementation or libraries like scikit-learn's CategoryEncoder).
- Interaction Features: Create new features by multiplying existing features, especially categorical and numerical ones, to capture potential interactions between them (e.g., df['interaction'] = df['categorical_feature'] * df['numerical_feature']).
Breaking Down Categorical Features:
- If a categorical feature has many categories, it might be beneficial to create separate binary features for each category using one-hot encoding.
- Consider domain knowledge to create more meaningful sub-features from a categorical feature (e.g., splitting a "city" feature into "country" and "state").
Grouped Transforms:
- Aggregation: Apply aggregate functions (e.g., mean, sum, max) on groupby operations to create new features based on groups (e.g., average purchase amount per customer).
- Custom Transformations: Define custom functions to create new features specific to your data and problem (e.g., a function to calculate rolling averages or custom distance metrics).

Example (using pandas and scikit-learn):

import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Sample data
data = {'color': ['red', 'green', 'blue', 'red', 'green'],
        'size': [10, 8, 12, 9, 11],
        'count': [2, 1, 3, 4, 1]}
df = pd.DataFrame(data)

# One-hot encode the 'color' feature
encoder = OneHotEncoder(sparse=False)
color_encoded = encoder.fit_transform(df[['color']])
df_encoded = pd.concat([df[['size', 'count']], pd.DataFrame(color_encoded, columns=['red', 'green', 'blue'])], axis=1)

# Scale the 'size' feature
scaler = StandardScaler()
df_encoded['size_scaled'] = scaler.fit_transform(df_encoded[['size']])

# Create an interaction feature
df_encoded['interaction'] = df_encoded['size_scaled'] * df_encoded['count']

# Example of a custom function for a rolling average
def rolling_average(data, window):
    return data.rolling(window=window).mean()

df_encoded['count_rolling_avg'] = rolling_average(df_encoded['count'], window=2)

print(df_encoded)