python进阶指南_Python特性工程动手指南

最新推荐文章于 2022-03-31 09:27:13 发布

weixin_26752765

最新推荐文章于 2022-03-31 09:27:13 发布

阅读量490

点赞数

文章标签： python

原文链接：https://towardsdatascience.com/hands-on-guide-to-feature-engineering-de793efc785

版权

python进阶指南

介绍 (Introduction)

In this guide, I will walk through how to utilize data manipulating to extract features manually.

在本指南中，我将逐步介绍如何利用数据处理来手动提取特征。

Manual feature engineering could be exhausting and needs plenty of time, experience, and domain knowledge experience to develop the right features. There are many automatic feature engineering tools available, like the FeatureTools and the AutoFeat. However, manual feature engineering is essential to understand those advanced tools. Furthermore, it would help build a robust and generic model. I will use the home-credit-default-risk dataset available on the Kaggle platform. I will use only two tables bureauand bureau_balancefrom the main folder. According to the dataset description on the competition page, the tables are the following:

手动要素工程可能会很累，并且需要大量的时间，经验和领域知识经验才能开发出正确的要素。有许多可用的自动功能工程工具，例如FeatureTools和AutoFeat。但是，手动功能工程对于理解这些高级工具至关重要。此外，这将有助于构建健壮且通用的模型。我将使用Kaggle平台上可用的home-credit-default-risk数据集。我将只使用主文件夹中的两个表bureau和bureau_balance 。根据比赛页面上的数据集描述，下表如下：

bureau.csv

Bureau.csv

This table includes all clients’ previous credits from other financial institutions that reported to the Credit Bureau.
该表包括已向信用局报告的所有其他金融机构客户以前的信用。

bureau_balance.csv

Bureau_balance.csv

Monthly balances of earlier loans in the Credit Bureau.
信用局中较早贷款的每月余额。
This table has one row for each month of the history of every previous loan reported to the Credit Bureau.
对于向信用局报告的每笔先前贷款的历史记录，此表每个月都有一行。

本教程将涵盖主题 (Topics will be covered in this tutorial)

Reading and Munging the data — customizing the KDE plot
读取和修改数据-自定义KDE图
Investigate correlation
研究相关性
Aggregate numeric columns
汇总数字列
Get stats for the bureau_balance
获取Bureau_balance的统计信息
Investigating the categorical variables
调查类别变量
Insert computed feature into train dataset
将计算出的特征插入训练数据集中
Check the missing data
检查丢失的数据
Correlations
相关性
Collinearity
共线性

1.读取和整理数据 (1. Reading and Munging the data)

I will start by importing some important libraries that would help in understanding the data.

我将首先导入一些有助于理解数据的重要库。

# pandas and numpy for data manipulationimport pandas as pd
import numpy as np# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

I will start analyzing the bureau.csv first:

我将首先开始分析Bureau.csv：

# Read in bureau
bureau = pd.read_csv('../input/home-credit-default-risk/bureau.csv')
bureau.head()

This table has 1716428 observations and 17 feature.

该表具有1716428个观测值和17个功能。

SK_ID_CURR                  int64
SK_ID_BUREAU                int64
CREDIT_ACTIVE              object
CREDIT_CURRENCY            object
DAYS_CREDIT                 int64
CREDIT_DAY_OVERDUE          int64
DAYS_CREDIT_ENDDATE       float64
DAYS_ENDDATE_FACT         float64
AMT_CREDIT_MAX_OVERDUE    float64
CNT_CREDIT_PROLONG          int64
AMT_CREDIT_SUM            float64
AMT_CREDIT_SUM_DEBT       float64
AMT_CREDIT_SUM_LIMIT      float64
AMT_CREDIT_SUM_OVERDUE    float64
CREDIT_TYPE                object
DAYS_CREDIT_UPDATE          int64
AMT_ANNUITY               float64
dtype: object

We need to get how many previous loans per client id which is SK_ID_CURR. We can get that using pandas aggregation functions groupby and count(). Then store the new results in a new dataframe after renaming the SK_ID_BUREAU into previous_loan_count for readability.

我们需要获取每个客户ID以前有多少笔贷款，即SK_ID_CURR 。我们可以使用pandas聚合函数groupby和count().得到它count(). 然后，将SK_ID_BUREAU重命名为previous_loan_count以提高可读性后，将新结果存储在新数据SK_ID_BUREAU 。

# groupby client-id, count #previous loansfrom pandas import DataFrame
prev_loan_count = bureau.groupby('SK_ID_CURR', as_index = False).count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_count'})

The new prev_loan_count has only 305811 observations. Now, I will merge the prev_loan_count dataframe into the train dataset through the client id SK_ID_CURR then fill the missing values with 0. Finally, check if the new column has been added using the dtypes function.

新的prev_loan_count只有305811个观测值。现在，我将通过客户端ID SK_ID_CURR将prev_loan_count数据帧合并到train数据集中，然后用0填充缺少的值。最后，检查是否已使用dtypes函数添加了新列。

# join with the training dataframe
# read train.csvpd.set_option('display.max_column', None)
train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
train = train.merge(prev_loan_count, on = 'SK_ID_CURR', how = 'left')# fill the missing values with 0train['previous_loan_count'] = train['previous_loan_count'].fillna(0)
train['previous_loan_count'].dtypesdtype('float64')

It is already there!

它已经在那里！

2.研究相关性 (2. Investigate correlation)

The next step is to explore the Pearson correlation value or (r-value) between attributes through feature importance. It is not a measure of importance for new variables; however, it provides a reference of whether a variable will be helpful to the model or not.

下一步是通过特征重要性探索属性之间的皮尔逊相关值或( r值 )。它不是衡量新变量重要性的方法；但是，它提供了变量是否对模型有用的参考。

Higher correlation with respect to the dependent variable means any change in that variable would lead to a significant change in the dependent variable. So, in the next step, I would look into the highest absolute value of r-value relative to the dependent variable.

相对于因变量的更高相关性意味着该变量的任何变化都将导致因变量的重大变化。因此，在下一步中，我将研究r值相对于因变量的最高绝对值。

The Kernel Density Estimator (KDE) is the best to describe relation between dependent and independent variable.

核密度估计器(KDE)最能描述因变量和自变量之间的关系。

# Plots the disribution of a variable colored by value of the dependent variabledef kde_target(var_name, df):
    
    # Calculate the correlation coefficient between the new variable and the target
    corr = df['TARGET'].corr(df[var_name])
    
    # Calculate medians for repaid vs not repaid
    avg_repaid = df.loc[df['TARGET'] == 0, var_name].median()
    avg_not_repaid = df.loc[df['TARGET'] == 1, var_name].median()
    
    plt.figure(figsize = (12, 6))
    
    # Plot the distribution for target == 0 and target == 1
    sns.kdeplot(df.loc[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
    sns.kdeplot(df.loc[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
    
    # label the plot
    plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
    plt.legend();
    
    # print out the correlation
    print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr))    # Print out average values
    print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid)    print('Median value for loan that was repaid =     %0.4f' % avg_repaid)

Then check the distribution of the previous_loan_count against Target

然后针对Target检查previous_loan_count的分布

kde_target('previous_loan_count', train)

Image for post — The KDE plot for the previous_loan_count

It is hard to see any significant correlation between the TARGETand the previous_loan_count . There is no significant correlation can be detected from the diagram. So, more variables need to be investigated using aggregation functions.

很难看到TARGET和previous_loan_count之间有任何显着相关性。从图中无法检测到明显的相关性。因此，需要使用聚合函数研究更多变量。

3.汇总数字列 (3. Aggregate numeric columns)

I will pick the numeric columns grouped by client id then apply the statistics functions min, max, sum, mean, and count to get a summary statistics for per numeric feature.

我将选择按客户ID分组的数字列，然后应用统计函数min, max, sum, mean, and count以获得每个数字功能的摘要统计信息。

# Group by the client id, calculate aggregation statistics
bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'min','max','sum']).reset_index()

Creating a new name for each columns for readability sake. Then merge with the train dataset.

为便于阅读，请为每列创建一个新名称。然后与train数据集合并。

# List of column names
columns = ['SK_ID_CURR']# Iterate through the variables namesfor var in bureau_agg.columns.levels[0]:
    # Skip the id name
    if var != 'SK_ID_CURR':
        
        # Iterate through the stat names
        for stat in bureau_agg.columns.levels[1][:-1]:
            # Make a new column name for the variable and stat
            columns.append('bureau_%s_%s' % (var, stat))# Assign the list of columns names as the dataframe column names
bureau_agg.columns = columns# merge with the train dataset
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

Getting the correlation with the TARGET variable then sort the correlations by the absolute value using the sort_values()Python function.

获取与TARGET变量的相关性，然后使用sort_values() Python函数按绝对值对相关性进行排序。

# Calculate correlation between variables and the dependent variable
# Sort the correlations by the absolute valuenew_corrs = train.drop(columns=['TARGET']).corrwith(train['TARGET']).sort_values(ascending=False)
new_corrs[:15]

Now check the KDE plot for the newly created variables

现在检查KDE图以了解新创建的变量

kde_target('bureau_DAYS_CREDIT_mean', train)

As illustrated, again the correlation is very weak and could be just noise. Furthermore, a larger negative number indicates the loan was further before the current loan application.

如图所示，相关性再次非常弱，可能仅仅是噪声。此外，较大的负数表示该笔贷款比当前的贷款申请还早。

4.获取bureau_balance的统计信息 (4. Get stats for the bureau_balance)

bureau_balance = pd.read_csv('../input/home-credit-default-risk/bureau_balance.csv')bureau_balance.head()

5.调查分类变量 (5. Investigating the categorical variables)

The following function iterate over the dataframe and pick the categorical column and create a dummy variable to it.

以下函数遍历数据框并选择类别列，并为其创建一个虚拟变量。

def process_categorical(df, group_var, col_name):
    """Computes counts and normalized counts for each observation
    of `group_var` of each unique category in every categorical variable
    
    Parameters
    --------
    df : dataframe 
        The dataframe to calculate the value counts for.
        
    group_var : string
        The variable by which to group the dataframe. For each unique
        value of this variable, the final dataframe will have one row
        
    col_name : string
        Variable added to the front of column names to keep track of columnsReturn
    --------
    categorical : dataframe
        A dataframe with counts and normalized counts of each unique category in every categorical variable
        with one row for every unique value of the `group_var`.
        
    """
    # pick the categorical column 
    categorical = pd.get_dummies(df.select_dtypes('O'))
    
    # put an id for each column
    categorical[group_var] = df[group_var]
    
    # aggregate the group_var
    categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
    
    columns_name = []
    
    # iterate over the columns in level 0
    for var in categorical.columns.levels[0]:
        # iterate through level 1 for stats
        for stat in ['count', 'count_norm']:
            # make new column name
            columns_name.append('%s_%s_%s' %(col_name, var, stat))
            
    categorical.columns = columns_name
    
    return categorical

This function will return a stats of sum and mean for each categorical column.

此函数将为每个分类列返回sum和mean的统计信息。

bureau_count = process_categorical(bureau, group_var = 'SK_ID_CURR',col_name = 'bureau')

Do the same for bureau_balance

对bureau_balance执行相同的操作

bureau_balance_counts = process_categorical(df = bureau_balance, group_var = 'SK_ID_BUREAU', col_name = 'bureau_balance')

Now, we have the calculations on each loan. We need to aggregate for each client. I will merging all the previous dataframes together then aggregate the statistics again grouped by the SK_ID_CURR.

现在，我们有了每笔贷款的计算。我们需要为每个客户汇总。我将所有先前的数据帧合并在一起，然后再次汇总按SK_ID_CURR分组的统计信息。

# dataframe grouped by the loan 
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')# Merge to include the SK_ID_CURR
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', col_name = 'client')

6.将计算出的特征插入训练数据集中 (6. Insert computed feature into train dataset)

original_features = list(train.columns)
print('Original Number of Features: ', len(original_features))

The output : Original Number of Features: 122

输出：原始功能数量：122

# Merge with the value counts of bureau
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')new_features = list(train.columns)
print('Number of features using previous loans from other institutions data: ', len(new_features))# Number of features using previous loans from other institutions data:  333

Output is: Number of features using previous loans from other institutions data: 333

输出为：使用先前从其他机构获得的数据的要素数量：333

7.检查丢失的数据 (7. Check the missing data)

It is very important to check missing data in the training set after merging the new features.

合并新功能后，检查训练集中的缺失数据非常重要。

# Function to calculate missing values by column# Funct 
def missing_percent(df):"""Computes counts and normalized counts for each observation
    of `group_var` of each unique category in every categorical variable
    
    Parameters
    --------
    df : dataframe 
        The dataframe to calculate the value counts for.Return
    --------
    mis_column : dataframe
        A dataframe with missing information .
        
    """
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_table = pd.concat([mis_val, mis_percent], axis=1)
        
        # Rename the columns
        mis_columns = mis_table.rename(
        columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_columns = mis_columns[
            mis_columns.iloc[:,1] != 0].sort_values(
        'Percent of Total Values', ascending=False).round(2)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_columnstrain_missing = missing_percent(train)
train_missing.head()

There are a quite number of columns that have a plenty of missing data. I am going to drop any column that have missing data than 90%.

有相当多的列缺少大量数据。我将删除所有缺少数据超过90％的列。

missing_vars_train = train_missing.loc[train_missing['Percent of Total Values'] > 90, 'Percent of Total Values']
len(missing_vars_train)
# 0

I will do the same for the test data

我将对测试数据进行相同的操作

# Read in the test dataframe
test = pd.read_csv('../input/home-credit-default-risk/application_test.csv')# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

Then, will align the train and test dataset together and check their shape and same columns.

然后，将train和test数据集对齐，并检查它们的形状和相同的列。

# create a train target label 
train_label = train['TARGET']# align both dataframes, this will remove TARGET column
train, test = train.align(test, join='inner', axis = 1)
train['TARGET'] = train_label
print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)#Training Data Shape:  (307511, 333)
#Testing Data Shape:  (48744, 332)

Let’s check the missing percent on the test set.

让我们检查test集上丢失的百分比。

test_missing = missing_percent(test) 
test_missing.head()

8.相关性 (8. Correlations)

I will check the correlation with the TARGET variable and the newly created features.

我将检查与TARGET变量和新创建的功能的相关性。

# calculate correlation for all dataframes
corr_train = train.corr()# Sort the resulted values in an ascending order
corr_train = corr_train.sort_values('TARGET', ascending = False)# show the ten most positive correlations
pd.DataFrame(corr_train['TARGET'].head(10))

As observed from the sample above, the most correlated variables are the variables that were engineered earlier. However, correlation doesn’t mean causation that’s why we need to assess those correlations and pick the variables that have deeper influence on the TARGET . To do so, I will stick with the KDE plot.

从上面的样本中可以看出，最相关的变量是较早设计的变量。但是，相关性并不意味着因果关系，这就是为什么我们需要评估那些相关性并选择对TARGET有更深影响的变量。为此，我将坚持使用KDE图。

kde_target('bureau_DAYS_CREDIT_mean', train)

The plot says that the applicants with a greater number of monthly record per loan tends to repay the new loan. Let’s look more into the bureau_CREDIT_ACTIVE_Active_count_norm variable to see if this is true.

情节说，每笔贷款的每月记录数量较多的申请人倾向于偿还新的贷款。让我们进一步看一下bureau_CREDIT_ACTIVE_Active_count_norm变量，看是否为真。

kde_target('bureau_CREDIT_ACTIVE_Active_count_norm', train)

The correlation here is very weak, we can’t notice any significance.

这里的相关性很弱，我们没有注意到任何意义。

9.共线性 (9. Collinearity)

I will set a threshold of 80% to remove any highly correlated variables with the TARGET

我将阈值设置为80％，以使用TARGET删除所有高度相关的变量

# Set the threshold
threshold = 0.8# Empty dictionary to hold correlated variables
above_threshold_vars = {}# For each column, record the variables that are above the thresholdfor col in corr_train:
    above_threshold_vars[col] = list(corr_train.index[corr_train[col] > threshold])# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []# Iterate through columns and correlated columnsfor key, value in above_threshold_vars.items():
    # Keep track of columns already examined
    cols_seen.append(key)
    for x in value:
        if x == key:
            next
        else:
            # Only want to remove one in a pair
            if x not in cols_seen:
                cols_to_remove.append(x)
                cols_to_remove_pair.append(key)
            
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))

The output is: Number of columns to remove: 134

输出为：要删除的列数：134

Then, we can remove those column from the dataset as a preparation step to use for the model building

然后，我们可以从数据集中删除这些列，作为准备步骤以用于模型构建

rain_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)
print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)

Training Corrs Removed Shape: (307511, 199)Testing Corrs Removed Shape: (48744, 198)

训练芯去除形状：(307511，199)测试芯去除形状：(48744，198)