python进阶指南
介绍 (Introduction)
In this guide, I will walk through how to utilize data manipulating to extract features manually.
在本指南中,我将逐步介绍如何利用数据处理来手动提取特征。
Manual feature engineering could be exhausting and needs plenty of time, experience, and domain knowledge experience to develop the right features. There are many automatic feature engineering tools available, like the FeatureTools and the AutoFeat. However, manual feature engineering is essential to understand those advanced tools. Furthermore, it would help build a robust and generic model. I will use the home-credit-default-risk dataset available on the Kaggle platform. I will use only two tables bureau
and bureau_balance
from the main folder. According to the dataset description on the competition page, the tables are the following:
手动要素工程可能会很累,并且需要大量的时间,经验和领域知识经验才能开发出正确的要素。 有许多可用的自动功能工程工具,例如FeatureTools和AutoFeat。 但是,手动功能工程对于理解这些高级工具至关重要。 此外,这将有助于构建健壮且通用的模型。 我将使用Kaggle平台上可用的home-credit-default-risk数据集。 我将只使用主文件夹中的两个表bureau
和bureau_balance
。 根据比赛页面上的数据集描述,下表如下:
bureau.csv
Bureau.csv
- This table includes all clients’ previous credits from other financial institutions that reported to the Credit Bureau. 该表包括已向信用局报告的所有其他金融机构客户以前的信用。
bureau_balance.csv
Bureau_balance.csv
- Monthly balances of earlier loans in the Credit Bureau. 信用局中较早贷款的每月余额。
- This table has one row for each month of the history of every previous loan reported to the Credit Bureau. 对于向信用局报告的每笔先前贷款的历史记录,此表每个月都有一行。
本教程将涵盖主题 (Topics will be covered in this tutorial)
- Reading and Munging the data — customizing the KDE plot 读取和修改数据-自定义KDE图
- Investigate correlation 研究相关性
- Aggregate numeric columns 汇总数字列
- Get stats for the bureau_balance 获取Bureau_balance的统计信息
- Investigating the categorical variables 调查类别变量
- Insert computed feature into train dataset 将计算出的特征插入训练数据集中
- Check the missing data 检查丢失的数据
- Correlations 相关性
- Collinearity 共线性
1.读取和整理数据 (1. Reading and Munging the data)
I will start by importing some important libraries that would help in understanding the data.
我将首先导入一些有助于理解数据的重要库。
# pandas and numpy for data manipulationimport pandas as pd
import numpy as np# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
I will start analyzing the bureau.csv first:
我将首先开始分析Bureau.csv:
# Read in bureau
bureau = pd.read_csv('../input/home-credit-default-risk/bureau.csv')
bureau.head()
This table has 1716428 observations and 17 feature.
该表具有1716428个观测值和17个功能。
SK_ID_CURR int64
SK_ID_BUREAU int64
CREDIT_ACTIVE object
CREDIT_CURRENCY object
DAYS_CREDIT int64
CREDIT_DAY_OVERDUE int64
DAYS_CREDIT_ENDDATE float64
DAYS_ENDDATE_FACT float64
AMT_CREDIT_MAX_OVERDUE float64
CNT_CREDIT_PROLONG int64
AMT_CREDIT_SUM float64
AMT_CREDIT_SUM_DEBT float64
AMT_CREDIT_SUM_LIMIT float64
AMT_CREDIT_SUM_OVERDUE float64
CREDIT_TYPE object
DAYS_CREDIT_UPDATE int64
AMT_ANNUITY float64
dtype: object
We need to get how many previous loans per client id which is SK_ID_CURR
. We can get that using pandas aggregation functions groupby
and count().
Then store the new results in a new dataframe after renaming the SK_ID_BUREAU
into previous_loan_count
for readability.
我们需要获取每个客户ID以前有多少笔贷款,即SK_ID_CURR
。 我们可以使用pandas聚合函数groupby
和count().
得到它count().
然后,将SK_ID_BUREAU
重命名为previous_loan_count
以提高可读性后,将新结果存储在新数据SK_ID_BUREAU
。
# groupby client-id, count #previous loansfrom pandas import DataFrame
prev_loan_count = bureau.groupby('SK_ID_CURR', as_index = False).count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_count'})
The new prev_loan_count
has only 305811 observations. Now, I will merge the prev_loan_count
dataframe into the train
dataset through the client id SK_ID_CURR
then fill the missing values with 0. Finally, check if the new column has been added using the dtypes
function.
新的prev_loan_count
只有305811个观测值。 现在,我将通过客户端ID SK_ID_CURR
将prev_loan_count
数据帧合并到train
数据集中,然后用0填充缺少的值。最后,检查是否已使用dtypes
函数添加了新列。
# join with the training dataframe
# read train.csvpd.set_option('display.max_column', None)
train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
train = train.merge(prev_loan_count, on = 'SK_ID_CURR', how = 'left')# fill the missing values with 0train['previous_loan_count'] = train['previous_loan_count'].fillna(0)
train['previous_loan_count'].dtypesdtype('float64')
It is already there!
它已经在那里!
2.研究相关性 (2. Investigate correlation)
The next step is to explore the Pearson correlation value or (r-value) between attributes through feature importance. It is not a measure of importance for new variables; however, it provides a reference of whether a variable will be helpful to the model or not.
下一步是通过特征重要性探索属性之间的皮尔逊相关值或( r值 )。 它不是衡量新变量重要性的方法; 但是,它提供了变量是否对模型有用的参考。
Higher correlation with respect to the dependent variable means any change in that variable would lead to a significant change in the dependent variable. So, in the next step, I would look into the highest absolute value of r-value relative to the dependent variable.
相对于因变量的更高相关性意味着该变量的任何变化都将导致因变量的重大变化。 因此,在下一步中,我将研究r值相对于因变量的最高绝对值。
The Kernel Density Estimator (KDE) is the best to describe relation between dependent and independent variable.
核密度估计器(KDE)最能描述因变量和自变量之间的关系。
# Plots the disribution of a variable colored by value of the dependent variabledef kde_target(var_name, df):
# Calculate the correlation coefficient between the new variable and the target
corr = df['TARGET'].corr(df[var_name])
# Calculate medians for repaid vs not repaid
avg_repaid = df.loc[df['TARGET'] == 0, var_name].median()
avg_not_repaid = df.loc[df['TARGET'] == 1, var_name].median()
plt.figure(figsize = (12, 6))
# Plot the distribution for target == 0 and target == 1
sns.kdeplot(df.loc[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
sns.kdeplot(df.loc[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
# label the plot
plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
plt.legend();
# print out the correlation
print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr)) # Print out average values
print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid) print('Median value for loan that was repaid = %0.4f' % avg_repaid)
Then check the distribution of the previous_loan_count
against Target
然后针对Target
检查previous_loan_count
的分布
kde_target('previous_loan_count', train)

It is hard to see any significant correlation between the TARGET
and the previous_loan_count
. There is no significant correlation can be detected from the diagram. So, more variables need to be investigated using aggregation functions.
很难看到TARGET
和previous_loan_count
之间有任何显着相关性。 从图中无法检测到明显的相关性。 因此,需要使用聚合函数研究更多变量。
3.汇总数字列 (3. Aggregate numeric columns)
I will pick the numeric columns grouped by client id then apply the statistics functions min, max, sum, mean, and count
to get a summary statistics for per numeric feature.
我将选择按客户ID分组的数字列,然后应用统计函数min, max, sum, mean, and count
以获得每个数字功能的摘要统计信息。
# Group by the client id, calculate aggregation statistics
bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'min','max','sum']).reset_index()
Creating a new name for each columns for readability sake. Then merge with the train
dataset.
为便于阅读,请为每列创建一个新名称。 然后与train
数据集合并。
# List of column names
columns = ['SK_ID_CURR']# Iterate through the variables namesfor var in bureau_agg.columns.levels[0]:
# Skip the id name
if var != 'SK_ID_CURR':
# Iterate through the stat names
for stat in bureau_agg.columns.levels[1][:-1]:
# Make a new column name for the variable and stat
columns.append('bureau_%s_%s' % (var, stat))# Assign the list of columns names as the dataframe column names
bureau_agg.columns = columns# merge with the train dataset
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
Getting the correlation with the TARGET
variable then sort the correlations by the absolute value using the sort_values()
Python function.
获取与TARGET
变量的相关性,然后使用sort_values()
Python函数按绝对值对相关性进行排序。
# Calculate correlation between variables and the dependent variable
# Sort the correlations by the absolute valuenew_corrs = train.drop(columns=['TARGET']).corrwith(train['TARGET']).sort_values(ascending=False)
new_corrs[:15]

Now check the KDE plot for the newly created variables
现在检查KDE图以了解新创建的变量
kde_target('bureau_DAYS_CREDIT_mean', train)

As illustrated, again the correlation is very weak and could be just noise. Furthermore, a larger negative number indicates the loan was further before the current loan application.
如图所示,相关性再次非常弱,可能仅仅是噪声。 此外,较大的负数表示该笔贷款比当前的贷款申请还早。
4.获取bureau_balance的统计信息 (4. Get stats for the bureau_balance)
bureau_balance = pd.read_csv('../input/home-credit-default-risk/bureau_balance.csv')bureau_balance.head()

5.调查分类变量 (5. Investigating the categorical variables)
The following function iterate over the dataframe and pick the categorical column and create a dummy variable to it.
以下函数遍历数据框并选择类别列,并为其创建一个虚拟变量。
def process_categorical(df, group_var, col_name):
"""Computes counts and normalized counts for each observation
of `group_var` of each unique category in every categorical variable
Parameters
--------
df : dataframe
The dataframe to calculate the value counts for.
group_var : string
The variable by which to group the dataframe. For each unique
value of this variable, the final dataframe will have one row
col_name : string
Variable added to the front of column names to keep track of columnsReturn
--------
categorical : dataframe
A dataframe with counts and normalized counts of each unique category in every categorical variable
with one row for every unique value of the `group_var`.
"""
# pick the categorical column
categorical = pd.get_dummies(df.select_dtypes('O'))
# put an id for each column
categorical[group_var] = df[group_var]
# aggregate the group_var
categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
columns_name = []
# iterate over the columns in level 0
for var in categorical.columns.levels[0]:
# iterate through level 1 for stats
for stat in ['count', 'count_norm']:
# make new column name
columns_name.append('%s_%s_%s' %(col_name, var, stat))
categorical.columns = columns_name
return categorical
This function will return a stats of sum
and mean
for each categorical column.
此函数将为每个分类列返回sum
和mean
的统计信息。
bureau_count = process_categorical(bureau, group_var = 'SK_ID_CURR',col_name = 'bureau')
Do the same for bureau_balance
对bureau_balance执行相同的操作
bureau_balance_counts = process_categorical(df = bureau_balance, group_var = 'SK_ID_BUREAU', col_name = 'bureau_balance')
Now, we have the calculations on each loan. We need to aggregate for each client. I will merging all the previous dataframes together then aggregate the statistics again grouped by the SK_ID_CURR
.
现在,我们有了每笔贷款的计算。 我们需要为每个客户汇总。 我将所有先前的数据帧合并在一起,然后再次汇总按SK_ID_CURR
分组的统计信息。
# dataframe grouped by the loan
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')# Merge to include the SK_ID_CURR
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', col_name = 'client')
6.将计算出的特征插入训练数据集中 (6. Insert computed feature into train dataset)
original_features = list(train.columns)
print('Original Number of Features: ', len(original_features))
The output : Original Number of Features: 122
输出:原始功能数量:122
# Merge with the value counts of bureau
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')new_features = list(train.columns)
print('Number of features using previous loans from other institutions data: ', len(new_features))# Number of features using previous loans from other institutions data: 333
Output is: Number of features using previous loans from other institutions data: 333
输出为:使用先前从其他机构获得的数据的要素数量:333
7.检查丢失的数据 (7. Check the missing data)
It is very important to check missing data in the training set after merging the new features.
合并新功能后,检查训练集中的缺失数据非常重要。
# Function to calculate missing values by column# Funct
def missing_percent(df):"""Computes counts and normalized counts for each observation
of `group_var` of each unique category in every categorical variable
Parameters
--------
df : dataframe
The dataframe to calculate the value counts for.Return
--------
mis_column : dataframe
A dataframe with missing information .
"""
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_table = pd.concat([mis_val, mis_percent], axis=1)
# Rename the columns
mis_columns = mis_table.rename(
columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
# Sort the table by percentage of missing descending
mis_columns = mis_columns[
mis_columns.iloc[:,1] != 0].sort_values(
'Percent of Total Values', ascending=False).round(2)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_columnstrain_missing = missing_percent(train)
train_missing.head()

There are a quite number of columns that have a plenty of missing data. I am going to drop any column that have missing data than 90%.
有相当多的列缺少大量数据。 我将删除所有缺少数据超过90%的列。
missing_vars_train = train_missing.loc[train_missing['Percent of Total Values'] > 90, 'Percent of Total Values']
len(missing_vars_train)
# 0
I will do the same for the test data
我将对测试数据进行相同的操作
# Read in the test dataframe
test = pd.read_csv('../input/home-credit-default-risk/application_test.csv')# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
Then, will align the train
and test
dataset together and check their shape and same columns.
然后,将train
和test
数据集对齐,并检查它们的形状和相同的列。
# create a train target label
train_label = train['TARGET']# align both dataframes, this will remove TARGET column
train, test = train.align(test, join='inner', axis = 1)
train['TARGET'] = train_label
print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)#Training Data Shape: (307511, 333)
#Testing Data Shape: (48744, 332)
Let’s check the missing percent on the test
set.
让我们检查test
集上丢失的百分比。
test_missing = missing_percent(test)
test_missing.head()

8.相关性 (8. Correlations)
I will check the correlation with the TARGET
variable and the newly created features.
我将检查与TARGET
变量和新创建的功能的相关性。
# calculate correlation for all dataframes
corr_train = train.corr()# Sort the resulted values in an ascending order
corr_train = corr_train.sort_values('TARGET', ascending = False)# show the ten most positive correlations
pd.DataFrame(corr_train['TARGET'].head(10))

As observed from the sample above, the most correlated variables are the variables that were engineered earlier. However, correlation doesn’t mean causation that’s why we need to assess those correlations and pick the variables that have deeper influence on the TARGET
. To do so, I will stick with the KDE
plot.
从上面的样本中可以看出,最相关的变量是较早设计的变量。 但是,相关性并不意味着因果关系,这就是为什么我们需要评估那些相关性并选择对TARGET
有更深影响的变量。 为此,我将坚持使用KDE
图。
kde_target('bureau_DAYS_CREDIT_mean', train)

The plot says that the applicants with a greater number of monthly record per loan tends to repay the new loan. Let’s look more into the bureau_CREDIT_ACTIVE_Active_count_norm
variable to see if this is true.
情节说,每笔贷款的每月记录数量较多的申请人倾向于偿还新的贷款。 让我们进一步看一下bureau_CREDIT_ACTIVE_Active_count_norm
变量,看是否为真。
kde_target('bureau_CREDIT_ACTIVE_Active_count_norm', train)

The correlation here is very weak, we can’t notice any significance.
这里的相关性很弱,我们没有注意到任何意义。
9.共线性 (9. Collinearity)
I will set a threshold of 80% to remove any highly correlated variables with the TARGET
我将阈值设置为80%,以使用TARGET
删除所有高度相关的变量
# Set the threshold
threshold = 0.8# Empty dictionary to hold correlated variables
above_threshold_vars = {}# For each column, record the variables that are above the thresholdfor col in corr_train:
above_threshold_vars[col] = list(corr_train.index[corr_train[col] > threshold])# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []# Iterate through columns and correlated columnsfor key, value in above_threshold_vars.items():
# Keep track of columns already examined
cols_seen.append(key)
for x in value:
if x == key:
next
else:
# Only want to remove one in a pair
if x not in cols_seen:
cols_to_remove.append(x)
cols_to_remove_pair.append(key)
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))
The output is: Number of columns to remove: 134
输出为:要删除的列数:134
Then, we can remove those column from the dataset as a preparation step to use for the model building
然后,我们可以从数据集中删除这些列,作为准备步骤以用于模型构建
rain_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)
print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)
Training Corrs Removed Shape: (307511, 199)Testing Corrs Removed Shape: (48744, 198)
训练芯去除形状:(307511,199)测试芯去除形状:(48744,198)
摘要 (Summary)
The purpose of this tutorial was to introduce you to many concepts that may seem confusing at the beginning:
本教程的目的是向您介绍许多在开始时可能会令人困惑的概念:
- Feature engineering using pandas functions. 使用熊猫功能进行特征工程。
- Customizing the kernel density estimator plot. 自定义内核密度估计器图。
- Assessing the newly extracted features 评估新提取的功能
- Eliminate collinearity in the data 消除数据中的共线性
翻译自: https://towardsdatascience.com/hands-on-guide-to-feature-engineering-de793efc785
python进阶指南