特征工程(II)--数据预处理

WilenWu

已于 2024-04-20 23:52:46 修改

阅读量661

点赞数 4

文章标签：机器学习 python

于 2024-04-06 21:00:46 首次发布

本文链接：https://blog.csdn.net/qq_41518277/article/details/136558854

版权

有这么一句话在业界广泛流传：数据和特征决定了机器学习的上限，而模型和算法只是逼近这个上限而已。由此可见，特征工程在机器学习中占有相当重要的地位。在实际应用当中，可以说特征工程是机器学习成功的关键。

特征工程是数据分析中最耗时间和精力的一部分工作，它不像算法和模型那样是确定的步骤，更多是工程上的经验和权衡。因此没有统一的方法。这里只是对一些常用的方法做一个总结。

特征工程包含了 Data PreProcessing（数据预处理）、Feature Extraction（特征提取）、Feature Selection（特征选择）和 Feature construction（特征构造）等子问题。

Jupyter Notebook 代码连接：feature_engineering_demo_p2_preproccessing

导入必要的包

import numpy as np
import pandas as pd
import re
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_X_y, check_is_fitted
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer, make_column_transformer, make_column_selector
from sklearn.pipeline import FeatureUnion, make_union, Pipeline, make_pipeline
from sklearn.model_selection import cross_val_score
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gc

# Setting configuration.
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')

SEED = 42

数据预处理

数据预处理是特征工程的最重要的起始步骤，需要把特征预处理成机器学习模型所能接受的形式，我们可以使用sklearn.preproccessing模块来解决大部分数据预处理问题。

本章使用两条线并行处理数据：

基于pandas的函数封装实现
基于sklearn的pipeline实现

先定义一个计时器，方便后续评估性能。

def timer(func):
    import time
    import functools
    def strfdelta(tdelta, fmt):
        hours, remainder = divmod(tdelta, 3600)
        minutes, seconds = divmod(remainder, 60)
        return fmt.format(hours, minutes, seconds)
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        click = time.time()
        result = func(*args, **kwargs)
        delta = strfdelta(time.time() - click, "{:.0f} hours {:.0f} minutes {:.0f} seconds")
        print(f"{func.__name__} cost time {delta}")
        return result
    return wrapper

数据清洗

数据清洗(Data cleaning)是对数据进行重新审查和校验的过程，目的在于删除重复信息、纠正存在的错误，并提供数据一致性。

数据去重

首先，根据某个/多个特征值构成的样本ID去重

df = pd.read_csv('../datasets/Home-Credit-Default-Risk/application_train.csv')

if df['SK_ID_CURR'].nunique() < df.shape[0]:
    df = df.drop_duplicates(subset=['SK_ID_CURR'], keep='last')

数据类型转换

字符型数字自动转成数字

df.dtypes.value_counts()

float64    65
int64      41
object     16
Name: count, dtype: int64

df.apply(pd.to_numeric,  errors='ignore').dtypes.value_counts()

float64    65
int64      41
object     16
Name: count, dtype: int64

有时，有些数值型特征标识的只是不同类别，其数值的大小并没有实际意义，因此我们将其转化为类别特征。
本项目并无此类特征，以 hours_appr_process_start 为示例：

# df['HOUR_APPR_PROCESS_START '] = df['HOUR_APPR_PROCESS_START'].astype(str)

错误数据清洗

接下来，我们根据业务常识，或者使用但不限于箱型图（Box-plot）发现数据中不合理的特征值进行清洗。
数据探索时，我们注意到，DAYS_BIRTH列（年龄）中的数字是负数，由于它们是相对于当前贷款申请计算的，所以我们将其转化成正数后查看分布

(df['DAYS_BIRTH'] / -365).describe()

count    307511.000000
mean         43.936973
std          11.956133
min          20.517808
25%          34.008219
50%          43.150685
75%          53.923288
max          69.120548
Name: DAYS_BIRTH, dtype: float64

那些年龄看起来合理，没有异常值。
接下来，我们对其他的 DAYS 特征作同样的分析

for feature in ['DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH']:
        print(f'{feature} info: ')
        print((df[feature] / -365).describe() )

DAYS_BIRTH info: 
count    307511.000000
mean         43.936973
std          11.956133
min          20.517808
25%          34.008219
50%          43.150685
75%          53.923288
max          69.120548
Name: DAYS_BIRTH, dtype: float64
DAYS_EMPLOYED info: 
count    307511.000000
mean       -174.835742
std         387.056895
min       -1000.665753
25%           0.791781
50%           3.323288
75%           7.561644
max          49.073973
Name: DAYS_EMPLOYED, dtype: float64
DAYS_REGISTRATION info: 
count    307511.000000
mean         13.660604
std           9.651743
min          -0.000000
25%           5.506849
50%          12.339726
75%          20.491781
max          67.594521
Name: DAYS_REGISTRATION, dtype: float64
DAYS_ID_PUBLISH info: 
count    307511.000000
mean          8.203294
std           4.135481
min          -0.000000
25%           4.712329
50%           8.915068
75%          11.778082
max          19.717808
Name: DAYS_ID_PUBLISH, dtype: float64

pd.cut(df['DAYS_EMPLOYED'] / -365, bins=5).value_counts().sort_index()

DAYS_EMPLOYED
(-1001.715, -790.718]     55374
(-790.718, -580.77]           0
(-580.77, -370.822]           0
(-370.822, -160.874]          0
(-160.874, 49.074]       252137
Name: count, dtype: int64

有超过50000个用户的DAYS_EMPLOYED在1000年上，可以猜测这只是缺失值标记。

# Replace the anomalous values with nan
df_emp = df['DAYS_EMPLOYED'].where(df['DAYS_EMPLOYED'].abs()<365243, np.nan)

df_emp.plot.hist(title = 'Days Employment Histogram')
plt.xlabel('Days Employment')

Text(0.5, 0, 'Days Employment')

可以看到，数据分布基本正常了。
同样，将其他特征的缺失值标记转换成缺失，方便后续统一处理

df.map(lambda x:x=='XNA').sum().sum()/df.size

0.0014761034004861135

布尔特征清洗

for col in df.select_dtypes(exclude="number").columns:
    if df[col].nunique() == 2:
        print(df[col].value_counts())

NAME_CONTRACT_TYPE
Cash loans         278232
Revolving loans     29279
Name: count, dtype: int64
FLAG_OWN_CAR
N    202924
Y    104587
Name: count, dtype: int64
FLAG_OWN_REALTY
Y    213312
N     94199
Name: count, dtype: int64
EMERGENCYSTATE_MODE
No     159428
Yes      2328
Name: count, dtype: int64

df[['FLAG_OWN_CAR', 'FLAG_OWN_REALTY']].replace({'Y': 1, 'N': 0}).value_counts()

FLAG_OWN_CAR  FLAG_OWN_REALTY
0             1                  140952
1             1                   72360
0             0                   61972
1             0                   32227
Name: count, dtype: int64

df['EMERGENCYSTATE_MODE'].replace({'Yes': 1, 'No': 0}).value_counts()

EMERGENCYSTATE_MODE
0.0    159428
1.0      2328
Name: count, dtype: int64

函数封装

最后，使用函数封装以上步骤：

id_col = "SK_ID_CURR"
target = "TARGET"

# Data cleaning
def clean(df):
    # remove duplicates and keep last occurrences
    if df[id_col].nunique() < df.shape[0]:
        df = df.drop_duplicates(subset=[id_col], keep='last')
    
    # convert data to specified dtypes
    df = df.apply(pd.to_numeric,  errors='ignore')
    
    # transform
    for col in ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'EMERGENCYSTATE_MODE']:
        df[col] = df[col].replace({'Y': 1, 'N': 0, 'Yes': 1, 'No': 0})
    
    # Replace the anomalous values with nan
    df['DAYS_EMPLOYED'] = df['DAYS_EMPLOYED'].where(df['DAYS_EMPLOYED'].abs()<365243, np.nan)
    df = df.replace('XNA', np.nan)
    
    X = df.drop([id_col, target], axis=1)
    y = df[target]
    return X, y

X, y = clean(df)

缺失值处理

特征有缺失值是非常常见的，大部分机器学习模型在拟合前需要处理缺失值（Handle Missing Values）。

缺失值处理方法	函数	python包
统计量插补	SimpleImputer	sklearn.impute
统计量/随机插补	df.fillna()	pandas
多重插补	IterativeImputer	sklearn.impute
最近邻插补	KNNImputer	sklearn.impute
缺失值删除	df.dropna()	pandas
缺失值标记	MissingIndicator	sklearn.impute
缺失值标记	df.isna(), df.isnull()	pandas

缺失值统计

# Function to calculate missing values by column
def display_missing(df, threshold=None, verbose=True):
    missing_df = pd.DataFrame({
        "missing_number": df.isna().sum(),  # Total missing values
        "missing_rate": df.isna().mean()   # Proportion of missing values
        }, index=df.columns)
    missing_df = missing_df.query("missing_rate>0").sort_values("missing_rate", ascending=False)
    threshold = 0.25 if threshold is None else threshold
    high_missing = missing_df.query(f"missing_rate>{threshold}")
    # Print some summary information
    if verbose:
        print(f"Your selected dataframe has {missing_df.shape[0]} out of {df.shape[1]} columns that have missing values.")
        print(f"There are {high_missing.shape[0]} columns with more than {threshold:.1%} missing values.")
        print("Columns with high missing rate:", high_missing.index.tolist())
    # Return the dataframe with missing information
    if threshold is None:
        return missing_df
    else:
        return high_missing

# Missing values statistics
print(display_missing(df).head(10))

Your selected dataframe has 67 out of 122 columns that have missing values.
There are 50 columns with more than 25.0% missing values.
Columns with high missing rate: ['COMMONAREA_MEDI', 'COMMONAREA_AVG', 'COMMONAREA_MODE', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_AVG', 'FONDKAPREMONT_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAPARTMENTS_AVG', 'FLOORSMIN_MODE', 'FLOORSMIN_MEDI', 'FLOORSMIN_AVG', 'YEARS_BUILD_MODE', 'YEARS_BUILD_MEDI', 'YEARS_BUILD_AVG', 'OWN_CAR_AGE', 'LANDAREA_AVG', 'LANDAREA_MEDI', 'LANDAREA_MODE', 'BASEMENTAREA_MEDI', 'BASEMENTAREA_AVG', 'BASEMENTAREA_MODE', 'EXT_SOURCE_1', 'NONLIVINGAREA_MEDI', 'NONLIVINGAREA_MODE', 'NONLIVINGAREA_AVG', 'ELEVATORS_MEDI', 'ELEVATORS_MODE', 'ELEVATORS_AVG', 'WALLSMATERIAL_MODE', 'APARTMENTS_MODE', 'APARTMENTS_MEDI', 'APARTMENTS_AVG', 'ENTRANCES_MODE', 'ENTRANCES_AVG', 'ENTRANCES_MEDI', 'LIVINGAREA_MEDI', 'LIVINGAREA_MODE', 'LIVINGAREA_AVG', 'HOUSETYPE_MODE', 'FLOORSMAX_MEDI', 'FLOORSMAX_AVG', 'FLOORSMAX_MODE', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BEGINEXPLUATATION_MODE', 'TOTALAREA_MODE', 'EMERGENCYSTATE_MODE', 'OCCUPATION_TYPE']
                          missing_number  missing_rate
COMMONAREA_MEDI                   214865      0.698723
COMMONAREA_AVG                    214865      0.698723
COMMONAREA_MODE                   214865      0.698723
NONLIVINGAPARTMENTS_MEDI          213514      0.694330
NONLIVINGAPARTMENTS_MODE          213514      0.694330
NONLIVINGAPARTMENTS_AVG           213514      0.694330
FONDKAPREMONT_MODE                210295      0.683862
LIVINGAPARTMENTS_MODE             210199      0.683550
LIVINGAPARTMENTS_MEDI             210199      0.683550
LIVINGAPARTMENTS_AVG              210199      0.683550

可视化缺失率

high_missing = display_missing(X, verbose=False).head(10)

plt.figure(figsize = (8, 4))
sns.barplot(x="missing_rate", y=high_missing.index, data=high_missing)
plt.xlabel('Percent of missing values')
plt.ylabel('Features')
plt.title('Percent missing data by feature')

Text(0.5, 1.0, 'Percent missing data by feature')

缺失值删除

如果某个特征的缺失值超过阈值（例如80%），那么该特征对模型的贡献就会降低，通常就可以考虑删除该特征。

threshold = int(df.shape[0]*0.2)
X.dropna(axis=1, thresh=threshold).shape

(307511, 120)

# Remove variables with high missing rate

def drop_missing_data(X, threshold=0.8):
    X = X.copy()
    # Remove variables with missing more than threshold(default 20%)
    thresh = int(X.shape[0] * threshold)
    X_new = X.dropna(axis=1, thresh=thresh)
    print(f"Removed {X.shape[1]-X_new.shape[1]} variables with missing more than {1 - threshold:.1%}")
    return X_new

drop_missing_data(X, threshold=0.2).shape

Removed 0 variables with missing more than 80.0%

(307511, 120)

缺失值标记

有时，对于每个含有缺失值的列，我们额外添加一列来表示该列中缺失值的位置，在某些应用中，能取得不错的效果。
继续分析之前清洗过的 DAYS_EMPLOYED 异常，我们对缺失数据进行标记，看看他们是否影响客户违约。

y.groupby(X['DAYS_EMPLOYED'].isna()).mean()

DAYS_EMPLOYED
False    0.086600
True     0.053996
Name: TARGET, dtype: float64

发现缺失值的逾期率 5.4% 低于正常值的逾期率 8.66%，与Target的相关性很强，因此新增一列DAYS_EMPLOYED_MISSING 标记。这种处理对线性方法比较有效，而基于树的方法可以自动识别。

# Adds a binary variable to flag missing observations.

from sklearn.feature_selection import chi2
def flag_missing(X, alpha=0.05):
    """
    Adds a binary variable to flag missing observations(one indicator per variable). 
    The added variables (missing indicators) are named with the original variable name plus '_missing'.
    
    Parameters:
    ----------
    alpha: float, default=0.05
        Features with p-values more than alpha are selected.
    """
    X = X.copy()
    
    # Compute chi-squared stats between each missing indicator and y.
    chi2_stats, p_values = chi2(X.isna(), y)
    # find variables for which indicator should be added.
    missing_indicator = X.loc[:, p_values > alpha]
    indicator_names = missing_indicator.columns.map(lambda x: x + "_missing")
    X[indicator_names] = missing_indicator
    print(f"Added {missing_indicator.shape[1]} missing indicators")
    return X

flag_missing(X).shape

Added 6 missing indicators

(307511, 126)

人工插补

根据业务知识来进行人工填充。

若变量是离散型，且不同值较少，可在编码时转换成哑变量。例如，性别变量 code_gender

pd.get_dummies(X["CODE_GENDER"], dummy_na=True).head()

	F	M	NaN
0	False	True	False
1	True	False	False
2	False	True	False
3	True	False	False
4	False	True	False

若变量是布尔型，视情况可统一填充为零

len([col for col in X if set(X[col].unique()) == {0, 1}])

如果我们仔细观察一下字段描述，会发现很多缺失值都有迹可循，比如name_type_suite缺失，表示办理贷款的时候无人陪同，因此可以用 unknow 来填补。客户的社会关系中有30天/60天逾期及申请贷款前1小时/天/周/月/季度/年查询了多少次征信的都可填充为数字0。

def impute_manually(X):
    """
    Replaces missing values by an arbitrary value
    """
    X = X.copy()
    # boolean
    boolean_features = [col for col in X if set(X[col].unique()) == {0, 1}]
    X[boolean_features] = X[boolean_features].fillna(0)
    # fill none
    features_fill_none = ["NAME_TYPE_SUITE", "OCCUPATION_TYPE"]
    X[features_fill_none] = X[features_fill_none].fillna('unknow')
    # fill 0
    features_fill_zero = [
        "OBS_30_CNT_SOCIAL_CIRCLE",  
        "DEF_30_CNT_SOCIAL_CIRCLE",
        "OBS_60_CNT_SOCIAL_CIRCLE",
        "DEF_60_CNT_SOCIAL_CIRCLE",
        "AMT_REQ_CREDIT_BUREAU_HOUR",
        "AMT_REQ_CREDIT_BUREAU_DAY",
        "AMT_REQ_CREDIT_BUREAU_WEEK",
        "AMT_REQ_CREDIT_BUREAU_MON",
        "AMT_REQ_CREDIT_BUREAU_QRT",
        "AMT_REQ_CREDIT_BUREAU_YEAR"
    ]
    X[features_fill_zero] = X[features_fill_zero].fillna(0)
    return X

impute_manually(X).isna().sum().gt(0).sum()

条件平均值填充法

通过之前的相关分析，我们知道AMT_ANNUITY这个特征与AMT_CREDIT和AMT_INCOME_TOTAL有比较大的关系，所以这里用这两个特征分组后的中位数进行插补，称为条件平均值填充法（Conditional Mean Completer）。

X[['AMT_CREDIT', 'AMT_INCOME_TOTAL']].corrwith(X["AMT_ANNUITY"])

AMT_CREDIT          0.770138
AMT_INCOME_TOTAL    0.191657
dtype: float64

# conditional statistic completer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
from sklearn.feature_selection import f_classif, chi2
from sklearn.feature_selection import r_regression, f_regression

def fillna_by_groups(X, threshold=(0.0, 0.8), groupby=None, k=2, min_categories=2, bins=10, verbose=False):
    """
    Replaces missing values by groups.
    
    threshold: float, default=None
        Require that percentage of non-NA values in a column to impute.
    k: int, default=2
        Number of top features to group by.
    min_categories: int, default=2
        Specifies an lower limit to the number of categories for each feature to group by.
    bins: int, default=10
    """
    print("Conditional Mean Completer:")
    lower, upper = threshold
    if 0 <= lower < upper <= 1:
        pass
    else:
        raise ValueError("threshold must be a value between 0 < x <= 1. ")
    
    X = pd.DataFrame(X.copy())
    X_bin = X.copy()
    na_size = X.isna().sum().sum()
    
    features_num = X.select_dtypes(include='number').columns.tolist()
    features_cat = X.select_dtypes(exclude='number').columns.tolist()
    X_bin[features_num] = X_bin[features_num].apply(pd.qcut, q=bins, duplicates="drop")
    X_bin[features_cat] = X_bin[features_cat].astype('category')
    X_bin = X_bin.transform(lambda x: x.cat.codes)
    X_bin = X_bin.transform(lambda x: x - x.min()) # for chi-squared to stats each non-negative feature
    
    if groupby is None:
        features_groupby = X_bin.columns.tolist()
    features_groupby = [colname for colname in features_groupby 
                        if X[colname].nunique()>=min_categories]
    
    # Estimate mutual information for a target variable.
    variables = X.columns[X.notna().mean().between(lower, upper)].tolist()
    for colname in variables:
        other_features = list(set(features_groupby) - {colname})
        if colname in features_num:
            score_func = f_regression
        elif colname in features_cat:
            score_func = chi2
        Xy = pd.concat([X_bin[other_features], X[colname]], axis=1).dropna(axis=0,how='any')
        scores, _ = score_func(Xy[other_features], Xy[colname])
        scores = pd.Series(scores, index=other_features).sort_values(ascending=False)
        vars_top_k = scores[:k].index.tolist()
        groups = [X_bin[col] for col in vars_top_k]
        if colname in features_num:
            # Replaces missing values by the mean or median
            X[colname] = X.groupby(groups)[colname].transform(lambda x:x.fillna(x.median()))
            if verbose:
                print(f"Filling the missing values in {colname} with the medians of {vars_top_k} groups.")
        elif colname in features_cat:
            # Replaces missing values by the most frequent category
            X[colname] = X[colname].groupby(groups).transform(lambda x:x.fillna(x.mode(dropna=False)[0]))
            if verbose:
                print(f"Filling the missing values in {colname} with the modes of {vars_top_k} groups.")

    fillna_size = na_size - X.isna().sum().sum()
    print(f"Filled {fillna_size} missing values ({fillna_size/na_size:.1%}).")
    print(f"Transformed {len(variables)} variables with missing (threshold = [{lower:.1%}, {upper:.1%}]).")
    print(f"And then, there are {X.isna().sum().gt(0).sum()} variables with missing.")
    return X

fillna_by_groups(X).isna().sum().sum()

Conditional Mean Completer:
Filled 759918 missing values (8.2%).
Transformed 50 variables with missing (threshold = [0.0%, 80.0%]).
And then, there are 67 variables with missing.

8503299

简单插补

对于缺失率较低（小于10%）的数值特征可以用中位数或均值插补，缺失率较低（小于10%）的离散型特征，则可以用众数插补。

# Simple imputer

def impute_simply(X, threshold=0.8):
    """
    Univariate imputer for completing missing values with simple strategies.
    """
    print("Simple imputer:")
    X = X.copy()
    variables = X.columns[X.isna().mean().between(0, 1-threshold, "right")].tolist()
    features_num = X[variables].select_dtypes('number').columns.to_list()
    features_cat = X[variables].select_dtypes(exclude='number').columns.to_list()
    # Replaces missing values by the median or mode
    medians = X[features_num].median().to_dict()
    modes = X[features_cat].apply(lambda x: x.mode()[0]).to_dict()
    impute_dict = {**medians, **modes}
    X[variables] = X[variables].fillna(impute_dict)
    print(f"Transformed {len(variables)} variables with missing (threshold={threshold:.1%}).")
    print(f"And then, there are {X.isna().sum().gt(0).sum()} variables with missing.")
    return X

impute_simply(X).isna().sum().gt(0).sum()

Simple imputer:
Transformed 20 variables with missing (threshold=80.0%).
And then, there are 50 variables with missing.

50

Pipeline实现

最后，总结下我们的缺失处理策略：

删除缺失率高于80%特征
添加缺失标记
有业务含义的进行人工插补
缺失率10-80%的特征多重插补或条件平均插补
缺失率低于10%的特征简单统计插补

使用pandas实现

print(X.pipe(drop_missing_data, threshold=0.2) 
     .pipe(flag_missing)
     .pipe(impute_manually)
     .pipe(fillna_by_groups, threshold=(0.2, 0.8)) 
     .pipe(impute_simply, threshold=0.2) 
     .isna().sum().gt(0).sum())

Removed 0 variables with missing more than 80.0%
Added 6 missing indicators
Conditional Mean Completer:
Filled 726563 missing values (8.2%).
Transformed 49 variables with missing (threshold = [20.0%, 80.0%]).
And then, there are 61 variables with missing.
Simple imputer:
Transformed 61 variables with missing (threshold=20.0%).
And then, there are 0 variables with missing.
0

使用sklearn实现

先自定义几个转换器

class DropMissingData(BaseEstimator, TransformerMixin):
    """
    Remove features from data.
    
    Parameters
    ----------
    threshold: float, default=None
        Require that percentage of non-NA values in a column to keep it.
    """
    def __init__(self, threshold=0.8):
        if 0 < threshold <= 1:
            self.threshold = threshold
        else:
            raise ValueError("threshold must be a value between 0 < x <= 1. ")
    
    def fit(self, X, y=None):
        """
        Find the rows for which missing data should be evaluated to decide if a
        variable should be dropped.

        Parameters
        ----------
        X: pandas dataframe of shape = [n_samples, n_features]
            The training data set.

        y: pandas Series, default=None
            y is not needed. You can pass None or y.
        """
        
        # check input dataframe
        # X, y = check_X_y(X, y)
        
        # Get the names and number of features in the train set (the dataframe used during fit).
        self.feature_names_in_ = X.columns.to_list()
        self.n_features_in_ = X.shape[1]
        
        # Find the features to drop
        self.variables = X.columns[X.isna().mean().gt(1-self.threshold)].tolist()
        return self
    
    def transform(self, X, y=None):     
        """
        Remove variables with missing more than threshold.

        Parameters
        ----------
        X: pandas dataframe of shape = [n_samples, n_features]
            The dataframe to be transformed.

        Returns
        -------
        X_new: pandas dataframe
            The complete case dataframe for the selected variables.
        """
        # Remove variables with missing more than threshold.
        print(f"Removed {len(self.variables)} variables with missing more than {1-self.threshold:.1%}")
        return X.drop(self.variables, axis=1)
    
    def get_feature_names_out(self, input_features=None):
        """
        Get output feature names for transformation. In other words, returns the
        variable names of transformed dataframe.

        Parameters
        ----------
        input_features : array or list, default=None
            This parameter exits only for compatibility with the Scikit-learn pipeline.

            - If `None`, then `feature_names_in_` is used as feature names in.
            - If an array or list, then `input_features` must match `feature_names_in_`.

        Returns
        -------
        feature_names_out: list
            Transformed feature names.
        """
        check_is_fitted(self)
        
        if input_features is None:
            feature_names_in = self.feature_names_in_
        elif len(input_features) == self.n_features_in_:
            # If the input was an array, we let the user enter the variable names.
            feature_names_in = list(input_features)
        else:
            raise ValueError(
                "The number of input_features does not match the number of "
                "features seen in the dataframe used in fit."
                )      
        
        # Remove features.
        feature_names_out = [var for var in feature_names_in if var not in self.variables]
        return feature_names_out

DropMissingData(threshold=0.4).fit(X).variables

['OWN_CAR_AGE',
 'YEARS_BUILD_AVG',
 'COMMONAREA_AVG',
 'FLOORSMIN_AVG',
 'LIVINGAPARTMENTS_AVG',
 'NONLIVINGAPARTMENTS_AVG',
 'YEARS_BUILD_MODE',
 'COMMONAREA_MODE',
 'FLOORSMIN_MODE',
 'LIVINGAPARTMENTS_MODE',
 'NONLIVINGAPARTMENTS_MODE',
 'YEARS_BUILD_MEDI',
 'COMMONAREA_MEDI',
 'FLOORSMIN_MEDI',
 'LIVINGAPARTMENTS_MEDI',
 'NONLIVINGAPARTMENTS_MEDI',
 'FONDKAPREMONT_MODE']

整合到pipeline中

# Transformers for missing value imputation

from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.impute import MissingIndicator

constant_imputer = FunctionTransformer(
    func=impute_manually,
    accept_sparse=True,
    feature_names_out="one-to-one")

conditional_statistic_imputer = FunctionTransformer(
    func=lambda X:fillna_by_groups(X, threshold=(0.2, 0.8)), 
    accept_sparse=True,
    feature_names_out="one-to-one")

simple_imputer = make_column_transformer(
    (SimpleImputer(strategy="median"), make_column_selector(dtype_include=np.number)),
    (SimpleImputer(strategy="most_frequent"), make_column_selector(dtype_include=object)),
    remainder="passthrough",
    verbose_feature_names_out=False,
    verbose=True)

missing_imputation =  make_pipeline(
    DropMissingData(threshold=0.2),
    constant_imputer,
    conditional_statistic_imputer,
    simple_imputer,
    verbose=True
)

# find variables for which indicator should be added.

add_missing_indicator = make_pipeline(
    MissingIndicator(features='all', sparse=False),
    SelectKBest(k=10),
    verbose=True
)

features = X.columns.tolist()
handle_missing = make_column_transformer(
    (missing_imputation, features),
    (add_missing_indicator, features),
    remainder="passthrough",
    verbose_feature_names_out=False,
    verbose=True
)

由于FeatureUnion在调用feature_names_out方法时会在特征名上加前缀，因此我们使用ColumnTransformer，也可以达到同样的效果

X_imputed = handle_missing.fit_transform(X, y)
X_imputed = pd.DataFrame(
    X_imputed,
    columns=handle_missing.get_feature_names_out()
)

X_imputed = X_imputed.astype('category')
X_imputed = X_imputed.apply(pd.to_numeric, errors='ignore')

Removed 0 variables with missing more than 80.0%
[Pipeline] ... (step 1 of 4) Processing dropmissingdata, total=   0.2s
[Pipeline]  (step 2 of 4) Processing functiontransformer-1, total=   0.6s
Conditional Mean Completer:
Filled 726563 missing values (8.2%).
Transformed 49 variables with missing (threshold = [20.0%, 80.0%]).
And then, there are 55 variables with missing.
[Pipeline]  (step 3 of 4) Processing functiontransformer-2, total=  10.8s
[ColumnTransformer]  (1 of 2) Processing simpleimputer-1, total=   2.3s
[ColumnTransformer]  (2 of 2) Processing simpleimputer-2, total=   0.5s
[Pipeline] . (step 4 of 4) Processing columntransformer, total=   3.2s
[ColumnTransformer] .... (1 of 2) Processing pipeline-1, total=  14.9s
[Pipeline] .. (step 1 of 2) Processing missingindicator, total=   1.9s
[Pipeline] ....... (step 2 of 2) Processing selectkbest, total=   0.1s
[ColumnTransformer] .... (2 of 2) Processing pipeline-2, total=   2.0s

确认缺失值是否已全部处理完毕：

X_imputed.isna().sum().sum()

特征重编码

有很多机器学习算法只能接受数值型特征的输入，不能处理离散值特征，比如线性回归，逻辑回归等线性模型，那么我们需要将离散特征重编码成数值变量。

方法	函数	python包
顺序编码	OrdinalEncoder	sklearn.preprocessing
顺序编码	CategoricalDtype	pandas.api.types
哑变量编码	OneHotEncoder	sklearn.preprocessing
哑变量编码	pd.get_dummies	pandas
平均数编码	MeanEncoder	feature_engine.encoding import

不同类型的离散特征有不同的编码方式。

X.dtypes.value_counts()

float64    67
int64      40
object     13
Name: count, dtype: int64

features = X.columns.tolist() 
numeric_cols = X.select_dtypes("number").columns.tolist()
categorical_cols = X.select_dtypes(exclude="number").columns.tolist()

len(features) == len(numeric_cols + categorical_cols)

True

int_cols = X.select_dtypes("int64").columns.tolist()
float_cols = X.select_dtypes("float64").columns.tolist()

顺序编码

多数情况下，整型变量都存储了有限的离散值。

有序分类特征实际上表征着潜在的排序关系，我们将这些特征的类别映射成有大小的数字，因此可以用顺序编码。

让我们从分类特征中手动提取有序级别：

# The ordinal (ordered) categorical features
# Pandas calls the categories "levels"

ordered_levels = {
    "NAME_EDUCATION_TYPE": ["Lower secondary", 
                            "Secondary / secondary special", 
                            "Incomplete higher", 
                            "Higher education"]
}

from pandas.api.types import CategoricalDtype

def ordinal_encode(X, levels: dict = None):
    X = X.copy()
    if levels is None:
        variables = X.select_dtypes(exclude="number").columns
        X[variables] = X[variables].astype("category")
    else:
        variables = list(levels)
        dtypes = {name: CategoricalDtype(levels[name], ordered=True) for name in levels}
        X[variables] = X[variables].astype(dtypes)
    
    # The `cat.codes` attribute holds the category levels.
    # For missing values, -1 is the default code.
    X[variables] = X[variables].transform(lambda x: x.cat.codes)
    print(f'{len(variables):d} columns were ordinal encoded')
    return X

ordinal_encode(X, ordered_levels)[list(ordered_levels)].value_counts()

1 columns were ordinal encoded

NAME_EDUCATION_TYPE
 1                     218391
 3                      74863
 2                      10277
 0                       3816
-1                        164
Name: count, dtype: int64

哑变量编码

无序分类特征对于树集成模型（tree-ensemble like XGBoost）是可用的，但对于线性模型（like Lasso or Ridge）则必须使用one-hot重编码。

现在我们来看看每个分类特征的类别数：

# The nominative (unordered) categorical features
nominal_categories = [col for col in categorical_cols if col not in ordered_levels.keys()]

X[nominal_categories].nunique()

NAME_CONTRACT_TYPE             2
CODE_GENDER                    2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             57
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
dtype: int64

# Using pandas to encode categorical features
from pandas.api.types import CategoricalDtype

def onehot_encode(X, variables=None, dummy_na=True):
    """
    Replace the categorical variables by the binary variables.
    
    Parameters
    ----------
    X: pd.DataFrame of shape = [n_samples, n_features]
        The data to encode.
        Can be the entire dataframe, not just seleted variables.
    variables: list, default=None
        The list of categorical variables that will be encoded. 
        If None, the encoder will find and encode all variables of type object or categorical by default.
    dummy_na: boolean, default=True

    Returns
    -------
    X_new: pd.DataFrame.
        The encoded dataframe. The shape of the dataframe will be different from
        the original as it includes the dummy variables in place of the of the
        original categorical ones.
    """
    
    # pd.get_dummies automatically convert the categorical column into dummy variables
    if variables is None:
        variables = X.select_dtypes(exclude='number').columns.tolist()
        X = pd.get_dummies(X, dummy_na=True)
    else:
        X_dummy = pd.get_dummies(X[variables].astype(str), dummy_na=True)
        X = pd.concat([X, X_dummy], axis=1)
        # drop the original non-encoded variables.
        X = X.drop(variables, axis=1)
    print(f'{len(variables):d} columns were one-hot encoded')
    print(f'Dataset shape: {X.shape}')
    return X

print(X.pipe(onehot_encode, variables=nominal_categories)
       .pipe(ordinal_encode, levels=ordered_levels)
       .dtypes.value_counts())

12 columns were one-hot encoded
Dataset shape: (307511, 254)
1 columns were ordinal encoded
bool       146
float64     67
int64       40
int8         1
Name: count, dtype: int64

平均数编码

一般情况下，针对分类特征，我们只需要使用sklearn的OneHotEncoder或OrdinalEncoder进行编码，这类简单的预处理能够满足大多数数据挖掘算法的需求。如果某一个分类特征的可能值非常多（高基数 high cardinality），那么再使用one-hot编码往往会出现维度爆炸。平均数编码（mean encoding）是一种高效的编码方式，在实际应用中，能极大提升模型的性能。

我们可以使用 feature-engine开源包实现平均数编码。feature-engine将特征工程中常用的方法进行了封装。

其中变量 OCCUPATION_TYPE （职业类型）和 ORGANIZATION_TYPE类别数较多，准备使用平均数编码

# replace categories by the mean value of the target for each category.

from feature_engine.encoding import MeanEncoder
mean_encoder = MeanEncoder(
    missing_values='ignore', 
    ignore_format=True, 
    unseen='ignore'
)
mean_encoder.fit_transform(X[['OCCUPATION_TYPE', 'ORGANIZATION_TYPE']], y).head()

	OCCUPATION_TYPE	ORGANIZATION_TYPE
0	0.105788	0.092996
1	0.063040	0.059148
2	0.105788	0.069781
3	0.105788	0.092996
4	0.063040	0.058824

连续特征分箱

Binning Continuous Features

在实际的模型训练过程中，我们也经常对连续特征进行离散化处理，这样能消除特征量纲的影响，同时还能极大减少异常值的影响，增加特征的稳定性。

方法	函数	python包
二值化	Binarizer	sklearn.preprocessing
分箱	KBinsDiscretizer	sklearn.preprocessing
等频分箱	pd.qcut	pandas
等宽分箱	pd.cut	pandas

分箱主要分为等频分箱、等宽分箱和聚类分箱三种。等频分箱会一定程度受到异常值的影响，而等宽分箱又容易完全忽略异常值信息，从而一定程度上导致信息损失，若要更好的兼顾变量的原始分布，则可以考虑聚类分箱。所谓聚类分箱，指的是先对某连续变量进行聚类（往往是 k-Means 聚类），然后使用样本所属类别。

以年龄对还款的影响为例

# Find the correlation of the positive days since birth and target
X['DAYS_BIRTH'].abs().corr(y)

-0.07823930830982709

可见，客户年龄与目标意义呈负相关关系，即随着客户年龄的增长，他们往往会更经常地按时偿还贷款。我们接下来将制作一个核心密度估计图（KDE），直观地观察年龄对目标的影响。

plt.figure(figsize = (5, 3))
sns.kdeplot(x=X['DAYS_BIRTH'] / -365, hue=y, common_norm=False)
plt.xlabel('Age (years)')
plt.ylabel('Density')
plt.title('Distribution of Ages')

Text(0.5, 1.0, 'Distribution of Ages')

如果我们把年龄分箱：

# Bin the age data
age_binned = pd.cut(X['DAYS_BIRTH']/-365, bins = np.linspace(20, 70, num = 11))
age_groups  = y.groupby(age_binned).mean()

plt.figure(figsize = (8, 3))
# Graph the age bins and the average of the target as a bar plot
sns.barplot(x=age_groups.index, y=age_groups*100)
# Plot labeling
plt.xticks(rotation = 30)
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Age Group');

有一个明显的趋势：年轻的申请人更有可能不偿还贷款！年龄最小的三个年龄组的失败率在10％以上，最老的年龄组为5％。

# Using pandas
def discretize(X, variables=None, bins=10, strategy="uniform", encoding=None):
    """
    Parameters
    ----------
    bucket_labels: dict, default=None
    """
    X = X.copy()
    
    if strategy not in ["uniform", "quantile"]:
        raise ValueError("strategy takes only values 'uniform' or 'quantile'")
    if encoding not in [None, "onehot", "ordinal"]:
        raise ValueError("encoding takes only values None, 'onehot' or 'ordinal'")
    
    if variables is None:
        variables = X.select_dtypes("number").columns.tolist()
    
    if strategy == "uniform":
        X_binned = X[variables].apply(pd.cut, bins=bins, duplicates='drop')
    elif strategy == "quantile":
        X_binned = X[variables].apply(pd.qcut, q=bins, duplicates='drop')
    
    if encoding == "onehot":
        X_binned = pd.get_dummies(X_binned, dummy_na=True)
    elif encoding == "ordinal":
        X_binned = X_binned.apply(lambda x: x.cat.codes)

    X = pd.concat([X.drop(variables, axis=1), X_binned], axis=1)
    return X

discretize(X)[['DAYS_BIRTH', 'DAYS_EMPLOYED']].head()

	DAYS_BIRTH	DAYS_EMPLOYED
0	(-11037.0, -9263.0]	(-1791.2, 0.0]
1	(-18133.0, -16359.0]	(-1791.2, 0.0]
2	(-19907.0, -18133.0]	(-1791.2, 0.0]
3	(-19907.0, -18133.0]	(-3582.4, -1791.2]
4	(-21681.0, -19907.0]	(-3582.4, -1791.2]

sklearn.preprocessing 模块中的 KBinsDiscretizer 可以实现等频分箱、等宽分箱或聚类分箱，同时还可以对分箱后的离散特征进一步进行one-hot编码或顺序编码。

from sklearn.preprocessing import KBinsDiscretizer

equal_frequency_discretiser = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
equal_width_discretiser = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
kmeans_cluster_discretiser = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='kmeans')

equal_width_discretiser.fit(X[['DAYS_BIRTH', 'DAYS_EMPLOYED']].fillna(0))
for i, col in enumerate(equal_width_discretiser.get_feature_names_out()):
    print(f"{col}'s bin_edges: ")
    print(equal_width_discretiser.bin_edges_[i])

DAYS_BIRTH's bin_edges: 
[-25197.  -22162.  -20463.  -18866.  -17208.  -15740.  -14411.  -13129.7
 -11679.  -10275.   -7489. ]
DAYS_EMPLOYED's bin_edges: 
[-17912.  -4880.  -3239.  -2367.  -1699.  -1221.   -828.   -463.   -147.
      0.]

Pipeline实现

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

# The ordinal (ordered) categorical features
# Pandas calls the categories "levels"
ordered_levels = {
    "NAME_EDUCATION_TYPE": ["Lower secondary", 
                            "Secondary / secondary special", 
                            "Incomplete higher", 
                            "Higher education"]
}

ordinal_encoder = OrdinalEncoder(
    categories=[np.array(levels) for levels in ordered_levels.values()],
    handle_unknown='use_encoded_value', 
    unknown_value=-1,
    encoded_missing_value=-1)

# replace categories by the mean value of the target for each category.
mean_encoder = MeanEncoder(
    missing_values='ignore', 
    ignore_format=True, 
    unseen='ignore')

# The nominative (unordered) categorical features
nominal_categories = [col for col in categorical_cols if col not in ordered_levels]
features_onehot = [col for col in nominal_categories if col not in ['OCCUPATION_TYPE', 'ORGANIZATION_TYPE']]

onehot_encoder = OneHotEncoder(
    drop='if_binary', 
    min_frequency=0.02, 
    max_categories=20, 
    sparse_output=False,
    handle_unknown='ignore')

# Encode categorical features
categorical_encoding = make_column_transformer(
    (mean_encoder, ['OCCUPATION_TYPE', 'ORGANIZATION_TYPE']),
    (ordinal_encoder, list(ordered_levels)), 
    (onehot_encoder, features_onehot),
    remainder='passthrough', 
    verbose_feature_names_out=False,
    verbose=True
)

X_encoded = categorical_encoding.fit_transform(X_imputed, y)
X_encoded = pd.DataFrame(
    X_encoded, 
    columns=categorical_encoding.get_feature_names_out()
)

# X_encoded = X_encoded.astype('category')
X_encoded = X_encoded.apply(pd.to_numeric, errors='ignore')

[ColumnTransformer] ... (1 of 4) Processing meanencoder, total=   0.0s
[ColumnTransformer]  (2 of 4) Processing ordinalencoder, total=   0.1s
[ColumnTransformer] . (3 of 4) Processing onehotencoder, total=   1.7s
[ColumnTransformer] ..... (4 of 4) Processing remainder, total=   0.0s

X_encoded.dtypes.value_counts()

float64    146
bool        10
Name: count, dtype: int64

异常值检测

我们在实际项目中拿到的数据往往有不少异常数据，这些异常数据很可能让我们模型有很大的偏差。异常检测的方法有很多，例如3倍标准差、箱线法的单变量标记，或者聚类、iForest和LocalOutlierFactor等无监督学习方法。

方法	python模块
箱线图检测	feature_engine.outliers.Winsorizer
3倍标准差原则	feature_engine.outliers.Winsorizer
聚类检测	self-define
One Class SVM	sklearn.svm.OneClassSVM
Elliptic Envelope	sklearn.linear_model.SGDOneClassSVM
Elliptic Envelope	sklearn.covariance.EllipticEnvelope
Isolation Forest	sklearn.ensemble.IsolationForest
LOF	sklearn.neighbors.LocalOutlierFactor

箱线图检测

箱线图检测根据四分位点判断是否异常。四分位数具有鲁棒性，不受异常值的干扰。通常认为小于 $Q_1-1.5*IQR$ 或大于 $Q_3+1.5*IQR$ 的点为离群点。

X_outlier = X[['DAYS_EMPLOYED', 'AMT_CREDIT']]

fig = plt.figure(figsize=(8,3))
for i, col in enumerate(X_outlier.columns.tolist()):
    ax = fig.add_subplot(1, 2, i+1)
    sns.boxplot(data=X_outlier, y=col, ax=ax)

3倍标准差原则

3倍标准差原则：假设数据满足正态分布，通常定义偏离均值的 $3\sigma$ 之外内的点为离群点， $\mathbb P(|X-\mu|<3\sigma)=99.73\%$ 。如果数据不服从正态分布，也可以用远离平均值的多少倍标准差来描述。

使用 pandas 实现，并封装在 transformer 中

class OutlierCapper(BaseEstimator, TransformerMixin):
    """
    Caps maximum and/or minimum values of a variable at automatically
    determined values.
    Works only with numerical variables. A list of variables can be indicated. 
    
    Parameters
    ----------
    method: str, 'gaussian' or 'iqr', default='iqr'
        If method='gaussian': 
            - upper limit: mean + 3 * std
            - lower limit: mean - 3 * std
        If method='iqr': 
            - upper limit: 75th quantile + 3 * IQR
            - lower limit: 25th quantile - 3 * IQR
            where IQR is the inter-quartile range: 75th quantile - 25th quantile.
    fold: int, default=3   
        You can select how far out to cap the maximum or minimum values.
    """

    def __init__(self, method='iqr', fold=3, variables=None):
        self.method = method
        self.fold = fold
        self.variables = variables

    def fit(self, X, y=None):
        """
        Learn the values that should be used to replace outliers.

        Parameters
        ----------
        X : pandas dataframe of shape = [n_samples, n_features]
            The training input samples.

        y : pandas Series, default=None
            y is not needed in this transformer. You can pass y or None.
        """
        
        
        # Get the names and number of features in the train set.
        self.feature_names_in_ = X.columns.to_list()
        self.n_features_in_ = X.shape[1]
        
        # find or check for numerical variables
        numeric_vars = X.select_dtypes("number").columns.tolist()
        if self.variables is None:
            self.variables = numeric_vars
        else:
            self.variables = list(set(numeric_vars) & set(self.variables))

        if self.method == "gaussian":
            mean = X[self.variables].mean()
            bias= [mean, mean]
            scale = X[self.variables].std(ddof=0)
        elif self.method == "iqr":
            Q1 = X[self.variables].quantile(q=0.25)
            Q3 = X[self.variables].quantile(q=0.75)
            bias = [Q1, Q3]
            scale = Q3 - Q1         
        
        # estimate the end values
        if (scale == 0).any():
            raise ValueError(
                f"Input columns {scale[scale == 0].index.tolist()!r}"
                f" have low variation for method {self.method!r}."
                f" Try other capping methods or drop these columns."
            )
        else:
            self.upper_limit = bias[1] + self.fold * scale
            self.lower_limit = bias[0] - self.fold * scale   

        self.feature_names_in_ = X.columns.to_list()
        self.n_features_in_ = X.shape[1]

        return self # always return self!

    def transform(self, X, y=None):
        """
        Cap the variable values.

        Parameters
        ----------
        X: pandas dataframe of shape = [n_samples, n_features]
            The data to be transformed.

        Returns
        -------
        X_new: pandas dataframe of shape = [n_samples, n_features]
            The dataframe with the capped variables.
        """
        X = X.copy()
        
        # check if class was fitted
        check_is_fitted(self)
        
        outiers = (X[self.variables].gt(self.upper_limit) | 
                   X[self.variables].lt(self.lower_limit))
        n = outiers.sum().gt(0).sum()
        print(f"Your selected dataframe has {n} out of {outiers.shape[1]} columns that have outliers.")
        
        # replace outliers
        X[self.variables] = X[self.variables].clip(
            axis=1,
            upper=self.upper_limit,
            lower=self.lower_limit
        )
        return X

    def get_feature_names_out(self, input_features=None):
        check_is_fitted(self)

        if input_features is None:
            return self.feature_names_in_
        elif len(input_features) == self.n_features_in_:
            # If the input was an array, we let the user enter the variable names.
            return list(input_features)
        else:
            raise ValueError(
                "The number of input_features does not match the number of "
                "features seen in the dataframe used in fit."
                )

outlier_capper = OutlierCapper()
X_capped = outlier_capper.fit_transform(X_outlier)

fig = plt.figure(figsize=(8,3))
for i, col in enumerate(X_capped.columns.tolist()):
    ax = fig.add_subplot(1, 2, i+1)
    sns.boxplot(data=X_capped, y=col, ax=ax)

Your selected dataframe has 2 out of 2 columns that have outliers.

outlier_capper = OutlierCapper(method="gaussian")
outlier_capper.fit_transform(X.select_dtypes('number')).shape

Your selected dataframe has 92 out of 107 columns that have outliers.

(307511, 107)

sklearn异常检测算法

sklearn 包目前支持的异常检测算法：

One Class SVM：基于 SVM (使用高斯内核) 的思想在特征空间中训练一个超球面，边界外的点即为异常值。
Elliptic Envelope：假设数据满足正态分布，训练一个椭圆包络线，边界外的点则为离群点。
Isolation Forest：是一种高效的异常检测算法，它和随机森林类似，但每次分裂特征和划分点（值）时都是随机的，而不是根据信息增益或基尼指数来选择。
LOF：基于密度的异常检测算法。离群点的局部密度显著低于大部分近邻点，适用于非均匀的数据集。
聚类检测：常用KMeans聚类将训练样本分成若干个簇，如果某一个簇里的样本数很少，而且簇质心和其他所有的簇都很远，那么这个簇里面的样本极有可能是异常特征样本了。

Pipeline实现

筛选出来的异常样本需要根据实际含义处理：

根据异常点的数量和影响，考虑是否将该条记录删除。
对数据做 log-scale 变换后消除异常值。
通过数据分箱来平滑异常值。
使用均值/中位数/众数来修正替代异常点，简单高效。
标记异常值或新增异常值得分列。
树模型对离群点的鲁棒性较高，可以选择忽略异常值。

我们接下来考虑对数值型变量计算iForest得分并标记异常样本。

from sklearn.ensemble import IsolationForest

class CustomIsolationForest(IsolationForest, TransformerMixin):
    """
    Isolation Forest Algorithm.
    Compute the anomaly score of each sample using the IsolationForest algorithm.
    """
    def __init__(self, drop_outliers=False, **kwargs):
        super().__init__(**kwargs)
        self.drop_outliers = drop_outliers
    def transform(self, X, y=None):  
        anomaly_scores = super().decision_function(X)
        pred = super().predict(X)
        n_outiers = pred[pred == -1].size
        if self.drop_outliers:
            print(f"Remove {n_outiers} outliers from the dataset")
            return X.loc[pred == 1,:]
        else:
            # Return average anomaly score of X.
            print(f"The number of outiers: {n_outiers} ({n_outiers/X.size:.1%})")
            return anomaly_scores.reshape(-1, 1)
    def get_feature_names_out(self, input_features=None):
        if self.drop:
            return self.feature_names_in_
        else:
            return ["anomaly_score"]

# fit the model for anomaly detection
iforest = CustomIsolationForest()
anomaly_score = pd.DataFrame(
    iforest.fit_transform(X_encoded),
    columns=["anomaly_score"]
)
anomaly_score.head()

The number of outiers: 14477 (0.0%)

	anomaly_score
0	0.061131
1	0.055532
2	0.086801
3	0.140955
4	0.111094

标准化/归一化

数据标准化和归一化可以提高一些算法的准确度，也能加速梯度下降收敛速度。也有不少模型不需要做标准化和归一化，主要是基于概率分布的模型，比如决策树大家族的CART，随机森林等。

z-score标准化是最常见的特征预处理方式，基本所有的线性模型在拟合的时候都会做标准化。前提是假设特征服从正态分布，标准化后，其转换成均值为0标准差为1的标准正态分布。
max-min标准化也称为离差标准化，预处理后使特征值映射到[0,1]之间。这种方法的问题就是如果测试集或者预测数据里的特征有小于min，或者大于max的数据，会导致max和min发生变化，需要重新计算。所以实际算法中，除非你对特征的取值区间有需求，否则max-min标准化没有 z-score标准化好用。
L1/L2范数标准化：如果我们只是为了统一量纲，那么通过L2范数整体标准化。

sklearn.preprocessing	说明
StandardScaler()	z-score标准化
Normalizer(norm=‘l2’)	使用`l1`、`l2`或`max`范数归一化
MinMaxScaler()	min-max归一化
MaxAbsScaler()	Max-abs归一化，缩放稀疏数据的推荐方法
RobustScaler()	分位数归一化，推荐缩放有离群值的数据

pandas实现z-score标准化和分位数归一化在之前检测离群值函数里已有，其他标准化方法不太常用。

由于数据集中依然存在一定的离群点，我们可以用RobustScaler对数据进行标准化处理。

from sklearn.preprocessing import RobustScaler
pd.DataFrame(RobustScaler().fit_transform(X[['DAYS_EMPLOYED', 'AMT_CREDIT']])).describe()

	0	1
count	252137.000000	307511.000000
mean	-0.305718	0.158721
std	0.971080	0.747221
min	-6.754153	-0.869825
25%	-0.634136	-0.452114
50%	0.000000	0.000000
75%	0.365864	0.547886
max	0.684385	6.565430

正态变换

偏度

在许多回归算法中，尤其是线性模型，常常假设数值型特征服从正态分布。我们先来计算一下各个数值特征的偏度：

# Check the skew of all numerical features
skewness = X.select_dtypes('number').skew().sort_values()
skewness[abs(skewness) > 0.75].head(20)

FLAG_MOBIL                     -554.536744
FLAG_CONT_MOBILE                -23.081172
YEARS_BEGINEXPLUATATION_MEDI    -15.573124
YEARS_BEGINEXPLUATATION_AVG     -15.515264
YEARS_BEGINEXPLUATATION_MODE    -14.755318
DAYS_EMPLOYED                    -1.968316
FLAG_EMP_PHONE                   -1.664886
YEARS_BUILD_MODE                 -1.002305
YEARS_BUILD_MEDI                 -0.962784
YEARS_BUILD_AVG                  -0.962485
FLAG_DOCUMENT_3                  -0.925725
FLAG_OWN_REALTY                  -0.840293
EXT_SOURCE_2                     -0.793576
FLOORSMIN_AVG                     0.954197
FLOORSMIN_MEDI                    0.960226
FLOORSMIN_MODE                    0.963835
FLAG_PHONE                        0.974083
CNT_FAM_MEMBERS                   0.987543
FLOORSMAX_AVG                     1.226454
AMT_CREDIT                        1.234778
dtype: float64

可以看到这些特征的偏度较高，因此我们尝试变换，让数据接近正态分布。

QQ图

以AMT_CREDIT特征为例，我们画出分布图和QQ图（使用之前定义的函数）。

Quantile-Quantile图是一种常用的统计图形，用来比较两个数据集之间的分布。它是由标准正态分布的分位数为横坐标，样本值为纵坐标的散点图。如果QQ图上的点在一条直线附近，则说明数据近似于正态分布，且该直线的斜率为标准差，截距为均值。

import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import probplot, norm

def norm_comparison_plot(series):
    series = pd.Series(series)
    mu, sigma = norm.fit(series)
    kurt, skew = series.kurt(), series.skew()
    print(f"Kurtosis: {kurt:.2f}", f"Skewness: {skew:.2f}", sep='\t')
    
    fig = plt.figure(figsize=(10, 4))
    # Now plot the distribution
    ax1 = fig.add_subplot(121)
    ax1.set_title('Distribution')
    ax1.set_ylabel('Frequency')
    sns.distplot(series, fit=norm, ax=ax1)
    ax1.legend(['dist','kde','norm'],f'Normal dist. ($\mu=$ {mu:.2f} and $\sigma=$ {sigma:.2f} )', loc='best')
    # Get also the QQ-plot
    ax2 = fig.add_subplot(122)
    probplot(series, plot=plt)

norm_comparison_plot(X['AMT_CREDIT'])
plt.show()

Kurtosis: 1.93	Skewness: 1.23

非线性变换

sklearn.preprocessing模块目前支持的非线性变换：

方法	说明
QuantileTransformer	分位数变换，映射到[0,1]之间的均匀分布，或正态分布
PowerTransformer	幂变换，将数据从任何分布映射到尽可能接近高斯分布，以稳定方差并最小化倾斜度

此外，最常用的是log变换。对于含有负数的特征，可以先min-max缩放到[0,1]之间后再做变换。

这里我们对AMT_CREDIT特征做Box-Cox变换

1964年提出的Box-Cox变换可以使得线性回归模型满足线性性、独立性、方差齐次性和正态性的同时又不丢失信息，

from sklearn.preprocessing import PowerTransformer
# Box Cox Transformation of skewed features (instead of log-transformation)
norm_trans = PowerTransformer("box-cox")

amt_credit_transformed = norm_trans.fit_transform(X['AMT_INCOME_TOTAL'].values.reshape(-1, 1))
norm_comparison_plot(amt_credit_transformed[:,0])
plt.show()

Kurtosis: 0.49	Skewness: -0.01

可以看到经过Box-Cox变换后，基本符合正态分布了。

Baseline

至此，数据预处理已经基本完毕

X_prepared = pd.concat([X_encoded, anomaly_score], axis=1) 
print(X_prepared.shape)
print(X_prepared.dtypes.value_counts())

(307511, 157)
float64    147
bool        10
Name: count, dtype: int64

规范特征名

X_prepared.columns = X_prepared.columns.str.replace('/','or').str.replace(' ','_').str.replace(',','_or')

交叉验证

我们可以选择模型开始训练了。我们准备选择LightGBM模型训练结果作为baseline。

定义数据集评估函数：

def score_dataset(X, y, nfold=5):
    # Create Dataset object for lightgbm
    dtrain = lgb.Dataset(X, label=y, free_raw_data=False)
    
    #  Use a dictionary to set Parameters.
    params = dict(
        objective='binary',
        is_unbalance=True,
        metric='auc',
        n_estimators=500,
        verbose=0
    )
    
    # Training with 5-fold CV:
    print('Starting training...')
    eval_results = lgb.cv(
        params, 
        dtrain, 
        nfold=nfold,
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(50)],
        return_cvbooster=True
    )
    boosters = eval_results['cvbooster'].boosters
    # Initialize an empty dataframe to hold feature importances
    feature_importances = pd.DataFrame(index=X.columns)
    for i in range(nfold):
        feature_importances[f'cv_{i}'] = boosters[i].feature_importance()
    feature_importances['score'] = feature_importances.mean(axis=1)
    # Sort features according to importance
    feature_importances = feature_importances.sort_values('score', ascending=False)
    return eval_results, feature_importances

eval_results, feature_importances = score_dataset(X_prepared, y)

Starting training...
Training until validation scores don't improve for 50 rounds
[50]	cv_agg's valid auc: 0.754568 + 0.00339662
[100]	cv_agg's valid auc: 0.756576 + 0.00312062
[150]	cv_agg's valid auc: 0.756414 + 0.00303009
Early stopping, best iteration is:
[124]	cv_agg's valid auc: 0.756627 + 0.00292137

prepared_data = pd.concat([df[id_col], X_prepared, y], axis=1)
prepared_data.to_csv('../datasets/Home-Credit-Default-Risk/prepared_data.csv', index=False)

特征重要性

feature_importances['score'].head(20)

EXT_SOURCE_3                  230.4
EXT_SOURCE_1                  227.6
AMT_CREDIT                    213.2
EXT_SOURCE_2                  205.4
AMT_ANNUITY                   188.0
DAYS_BIRTH                    172.8
AMT_GOODS_PRICE               156.2
DAYS_ID_PUBLISH               140.0
DAYS_EMPLOYED                 137.0
DAYS_LAST_PHONE_CHANGE        121.6
DAYS_REGISTRATION             103.6
ORGANIZATION_TYPE              97.8
AMT_INCOME_TOTAL               90.0
REGION_POPULATION_RELATIVE     77.4
anomaly_score                  74.2
OWN_CAR_AGE                    63.2
OCCUPATION_TYPE                58.8
HOUR_APPR_PROCESS_START        48.0
CODE_GENDER_M                  46.8
NAME_EDUCATION_TYPE            42.6
Name: score, dtype: float64xxxxxxxxxx clf.fit(X_prepared, y, categorical_feature='name:' + ','.join(categorical_cols))pd.Series(    clf.feature_importances_,    index=X_prepared.columns).sort_values(ascending=False).head(10)python