In machine learning, feature engineering is the crucial process of transforming raw data into meaningful features that a machine learning model can effectively learn from. It involves creating new features from existing ones to capture more informative patterns and improve model performance.
Common Feature Engineering Techniques in Python:
-
Mathematical Transforms:
- Scaling: Standardizing (using
StandardScaler
from scikit-learn) or normalizing (usingMinMaxScaler
) features to a common range can improve the performance of some machine learning algorithms. - Logarithms: Applying the logarithm (e.g.,
np.log1p(x)
) can be useful for features with skewed distributions. - Exponentiation: Raising features to a power (e.g.,
x**2
) can create non-linear relationships.
- Scaling: Standardizing (using
-
Interaction with Categorical and Count Features:
- One-Hot Encoding: Convert categorical features (like colors) into binary columns representing each category (e.g., using
pd.get_dummies
from pandas). - Frequency Encoding: Encode categorical features based on their frequency in the training data (custom implementation or libraries like scikit-learn's
CategoryEncoder
). - Interaction Features: Create new features by multiplying existing features, especially categorical and numerical ones, to capture potential interactions between them (e.g.,
df['interaction'] = df['categorical_feature'] * df['numerical_feature']
).
- One-Hot Encoding: Convert categorical features (like colors) into binary columns representing each category (e.g., using
-
Breaking Down Categorical Features:
- If a categorical feature has many categories, it might be beneficial to create separate binary features for each category using one-hot encoding.
- Consider domain knowledge to create more meaningful sub-features from a categorical feature (e.g., splitting a "city" feature into "country" and "state").
-
Grouped Transforms:
- Aggregation: Apply aggregate functions (e.g.,
mean
,sum
,max
) on groupby operations to create new features based on groups (e.g., average purchase amount per customer). - Custom Transformations: Define custom functions to create new features specific to your data and problem (e.g., a function to calculate rolling averages or custom distance metrics).
- Aggregation: Apply aggregate functions (e.g.,
Example (using pandas and scikit-learn):
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# Sample data
data = {'color': ['red', 'green', 'blue', 'red', 'green'],
'size': [10, 8, 12, 9, 11],
'count': [2, 1, 3, 4, 1]}
df = pd.DataFrame(data)
# One-hot encode the 'color' feature
encoder = OneHotEncoder(sparse=False)
color_encoded = encoder.fit_transform(df[['color']])
df_encoded = pd.concat([df[['size', 'count']], pd.DataFrame(color_encoded, columns=['red', 'green', 'blue'])], axis=1)
# Scale the 'size' feature
scaler = StandardScaler()
df_encoded['size_scaled'] = scaler.fit_transform(df_encoded[['size']])
# Create an interaction feature
df_encoded['interaction'] = df_encoded['size_scaled'] * df_encoded['count']
# Example of a custom function for a rolling average
def rolling_average(data, window):
return data.rolling(window=window).mean()
df_encoded['count_rolling_avg'] = rolling_average(df_encoded['count'], window=2)
print(df_encoded)