python进阶指南_Python特性工程动手指南

python进阶指南

介绍 (Introduction)

In this guide, I will walk through how to utilize data manipulating to extract features manually.

在本指南中,我将逐步介绍如何利用数据处理来手动提取特征。

Manual feature engineering could be exhausting and needs plenty of time, experience, and domain knowledge experience to develop the right features. There are many automatic feature engineering tools available, like the FeatureTools and the AutoFeat. However, manual feature engineering is essential to understand those advanced tools. Furthermore, it would help build a robust and generic model. I will use the home-credit-default-risk dataset available on the Kaggle platform. I will use only two tables bureauand bureau_balancefrom the main folder. According to the dataset description on the competition page, the tables are the following:

手动要素工程可能会很累,并且需要大量的时间,经验和领域知识经验才能开发出正确的要素。 有许多可用的自动功能工程工具,例如FeatureTools和AutoFeat。 但是,手动功能工程对于理解这些高级工具至关重要。 此外,这将有助于构建健壮且通用的模型。 我将使用Kaggle平台上可用的home-credit-default-risk数据集。 我将只使用主文件夹中的两个表bureaubureau_balance 。 根据比赛页面上的数据集描述,下表如下:

bureau.csv

Bureau.csv

  • This table includes all clients’ previous credits from other financial institutions that reported to the Credit Bureau.

    该表包括已向信用局报告的所有其他金融机构客户以前的信用。

bureau_balance.csv

Bureau_balance.csv

  • Monthly balances of earlier loans in the Credit Bureau.

    信用局中较早贷款的每月余额。
  • This table has one row for each month of the history of every previous loan reported to the Credit Bureau.

    对于向信用局报告的每笔先前贷款的历史记录,此表每个月都有一行。

本教程将涵盖主题 (Topics will be covered in this tutorial)

  1. Reading and Munging the data — customizing the KDE plot

    读取和修改数据-自定义KDE图
  2. Investigate correlation

    研究相关性
  3. Aggregate numeric columns

    汇总数字列
  4. Get stats for the bureau_balance

    获取Bureau_balance的统计信息
  5. Investigating the categorical variables

    调查类别变量
  6. Insert computed feature into train dataset

    将计算出的特征插入训练数据集中
  7. Check the missing data

    检查丢失的数据
  8. Correlations

    相关性
  9. Collinearity

    共线性

1.读取和整理数据 (1. Reading and Munging the data)

I will start by importing some important libraries that would help in understanding the data.

我将首先导入一些有助于理解数据的重要库。

# pandas and numpy for data manipulationimport pandas as pd
import numpy as np# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')

I will start analyzing the bureau.csv first:

我将首先开始分析Bureau.csv:

# Read in bureau
bureau = pd.read_csv('../input/home-credit-default-risk/bureau.csv')
bureau.head()

This table has 1716428 observations and 17 feature.

该表具有1716428个观测值和17个功能。

SK_ID_CURR                  int64
SK_ID_BUREAU int64
CREDIT_ACTIVE object
CREDIT_CURRENCY object
DAYS_CREDIT int64
CREDIT_DAY_OVERDUE int64
DAYS_CREDIT_ENDDATE float64
DAYS_ENDDATE_FACT float64
AMT_CREDIT_MAX_OVERDUE float64
CNT_CREDIT_PROLONG int64
AMT_CREDIT_SUM float64
AMT_CREDIT_SUM_DEBT float64
AMT_CREDIT_SUM_LIMIT float64
AMT_CREDIT_SUM_OVERDUE float64
CREDIT_TYPE object
DAYS_CREDIT_UPDATE int64
AMT_ANNUITY float64
dtype: object

We need to get how many previous loans per client id which is SK_ID_CURR. We can get that using pandas aggregation functions groupby and count(). Then store the new results in a new dataframe after renaming the SK_ID_BUREAU into previous_loan_count for readability.

我们需要获取每个客户ID以前有多少笔贷款,即SK_ID_CURR 。 我们可以使用pandas聚合函数groupbycount().得到它count(). 然后,将SK_ID_BUREAU重命名为previous_loan_count以提高可读性后,将新结果存储在新数据SK_ID_BUREAU

# groupby client-id, count #previous loansfrom pandas import DataFrame
prev_loan_count = bureau.groupby('SK_ID_CURR', as_index = False).count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_count'})

The new prev_loan_count has only 305811 observations. Now, I will merge the prev_loan_count dataframe into the train dataset through the client id SK_ID_CURR then fill the missing values with 0. Finally, check if the new column has been added using the dtypes function.

新的prev_loan_count只有305811个观测值。 现在,我将通过客户端ID SK_ID_CURRprev_loan_count数据帧合并到train数据集中,然后用0填充缺少的值。最后,检查是否已使用dtypes函数添加了新列。

# join with the training dataframe
# read train.csvpd.set_option('display.max_column', None)
train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
train = train.merge(prev_loan_count, on = 'SK_ID_CURR', how = 'left')# fill the missing values with 0train['previous_loan_count'] = train['previous_loan_count'].fillna(0)
train['previous_loan_count'].dtypesdtype('float64')

It is already there!

它已经在那里!

2.研究相关性 (2. Investigate correlation)

The next step is to explore the Pearson correlation value or (r-value) between attributes through feature importance. It is not a measure of importance for new variables; however, it provides a reference of whether a variable will be helpful to the model or not.

下一步是通过特征重要性探索属性之间的皮尔逊相关值或( r值 )。 它不是衡量新变量重要性的方法; 但是,它提供了变量是否对模型有用的参考。

Higher correlation with respect to the dependent variable means any change in that variable would lead to a significant change in the dependent variable. So, in the next step, I would look into the highest absolute value of r-value relative to the dependent variable.

相对于因变量的更高相关性意味着该变量的任何变化都将导致因变量的重大变化。 因此,在下一步中,我将研究r值相对于因变量的最高绝对值。

The Kernel Density Estimator (KDE) is the best to describe relation between dependent and independent variable.

核密度估计器(KDE)最能描述因变量和自变量之间的关系。

# Plots the disribution of a variable colored by value of the dependent variabledef kde_target(var_name, df):

# Calculate the correlation coefficient between the new variable and the target
corr = df['TARGET'].corr(df[var_name])

# Calculate medians for repaid vs not repaid
avg_repaid = df.loc[df['TARGET'] == 0, var_name].median()
avg_not_repaid = df.loc[df['TARGET'] == 1, var_name].median()

plt.figure(figsize = (12, 6))

# Plot the distribution for target == 0 and target == 1
sns.kdeplot(df.loc[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
sns.kdeplot(df.loc[df['TARGET'] == 1, var_name], label = 'TARGET == 1')

# label the plot
plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
plt.legend();

# print out the correlation
print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr)) # Print out average values
print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid) print('Median value for loan that was repaid = %0.4f' % avg_repaid)

Then check the distribution of the previous_loan_count against Target

然后针对Target检查previous_loan_count的分布

kde_target('previous_loan_count', train)
Image for post
The KDE plot for the previous_loan_count
previous_loan_count的KDE图

It is hard to see any significant correlation between the TARGETand the previous_loan_count . There is no significant correlation can be detected from the diagram. So, more variables need to be investigated using aggregation functions.

很难看到TARGETprevious_loan_count之间有任何显着相关性。 从图中无法检测到明显的相关性。 因此,需要使用聚合函数研究更多变量。

3.汇总数字列 (3. Aggregate numeric columns)

I will pick the numeric columns grouped by client id then apply the statistics functions min, max, sum, mean, and count to get a summary statistics for per numeric feature.

我将选择按客户ID分组的数字列,然后应用统计函数min, max, sum, mean, and count以获得每个数字功能的摘要统计信息。

# Group by the client id, calculate aggregation statistics
bureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'min','max','sum']).reset_index()

Creating a new name for each columns for readability sake. Then merge with the train dataset.

为便于阅读,请为每列创建一个新名称。 然后与train数据集合并。

# List of column names
columns = ['SK_ID_CURR']# Iterate through the variables namesfor var in bureau_agg.columns.levels[0]:
# Skip the id name
if var != 'SK_ID_CURR':

# Iterate through the stat names
for stat in bureau_agg.columns.levels[1][:-1]:
# Make a new column name for the variable and stat
columns.append('bureau_%s_%s' % (var, stat))# Assign the list of columns names as the dataframe column names
bureau_agg.columns = columns# merge with the train dataset
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')

Getting the correlation with the TARGET variable then sort the correlations by the absolute value using the sort_values()Python function.

获取与TARGET变量的相关性,然后使用sort_values() Python函数按绝对值对相关性进行排序。

# Calculate correlation between variables and the dependent variable
# Sort the correlations by the absolute valuenew_corrs = train.drop(columns=['TARGET']).corrwith(train['TARGET']).sort_values(ascending=False)
new_corrs[:15]
Image for post
correlation with the TARGET variable
与TARGET变量的相关性

Now check the KDE plot for the newly created variables

现在检查KDE图以了解新创建的变量

kde_target('bureau_DAYS_CREDIT_mean', train)
Image for post
The correlation between bureau_DAYS_CREDIT_mean and the TARGET
Bureau_DAYS_CREDIT_mean与TARGET之间的相关性

As illustrated, again the correlation is very weak and could be just noise. Furthermore, a larger negative number indicates the loan was further before the current loan application.

如图所示,相关性再次非常弱,可能仅仅是噪声。 此外,较大的负数表示该笔贷款比当前的贷款申请还早。

4.获取bureau_balance的统计信息 (4. Get stats for the bureau_balance)

bureau_balance = pd.read_csv('../input/home-credit-default-risk/bureau_balance.csv')bureau_balance.head()
Image for post
bureau_balance.csv
Bureau_balance.csv

5.调查分类变量 (5. Investigating the categorical variables)

The following function iterate over the dataframe and pick the categorical column and create a dummy variable to it.

以下函数遍历数据框并选择类别列,并为其创建一个虚拟变量。

def process_categorical(df, group_var, col_name):
"""Computes counts and normalized counts for each observation
of `group_var` of each unique category in every categorical variable

Parameters
--------
df : dataframe
The dataframe to calculate the value counts for.

group_var : string
The variable by which to group the dataframe. For each unique
value of this variable, the final dataframe will have one row

col_name : string
Variable added to the front of column names to keep track of columnsReturn
--------
categorical : dataframe
A dataframe with counts and normalized counts of each unique category in every categorical variable
with one row for every unique value of the `group_var`.

"""
# pick the categorical column
categorical = pd.get_dummies(df.select_dtypes('O'))

# put an id for each column
categorical[group_var] = df[group_var]

# aggregate the group_var
categorical = categorical.groupby(group_var).agg(['sum', 'mean'])

columns_name = []

# iterate over the columns in level 0
for var in categorical.columns.levels[0]:
# iterate through level 1 for stats
for stat in ['count', 'count_norm']:
# make new column name
columns_name.append('%s_%s_%s' %(col_name, var, stat))

categorical.columns = columns_name

return categorical

This function will return a stats of sum and mean for each categorical column.

此函数将为每个分类列返回summean的统计信息。

bureau_count = process_categorical(bureau, group_var = 'SK_ID_CURR',col_name = 'bureau')

Do the same for bureau_balance

对bureau_balance执行相同的操作

bureau_balance_counts = process_categorical(df = bureau_balance, group_var = 'SK_ID_BUREAU', col_name = 'bureau_balance')

Now, we have the calculations on each loan. We need to aggregate for each client. I will merging all the previous dataframes together then aggregate the statistics again grouped by the SK_ID_CURR.

现在,我们有了每笔贷款的计算。 我们需要为每个客户汇总。 我将所有先前的数据帧合并在一起,然后再次汇总按SK_ID_CURR分组的统计信息。

# dataframe grouped by the loan 
bureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')# Merge to include the SK_ID_CURR
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', col_name = 'client')

6.将计算出的特征插入训练数据集中 (6. Insert computed feature into train dataset)

original_features = list(train.columns)
print('Original Number of Features: ', len(original_features))

The output : Original Number of Features: 122

输出:原始功能数量:122

# Merge with the value counts of bureau
train = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')new_features = list(train.columns)
print('Number of features using previous loans from other institutions data: ', len(new_features))# Number of features using previous loans from other institutions data: 333

Output is: Number of features using previous loans from other institutions data: 333

输出为:使用先前从其他机构获得的数据的要素数量:333

7.检查丢失的数据 (7. Check the missing data)

It is very important to check missing data in the training set after merging the new features.

合并新功能后,检查训练集中的缺失数据非常重要。

# Function to calculate missing values by column# Funct 
def missing_percent(df):"""Computes counts and normalized counts for each observation
of `group_var` of each unique category in every categorical variable

Parameters
--------
df : dataframe
The dataframe to calculate the value counts for.Return
--------
mis_column : dataframe
A dataframe with missing information .

"""
# Total missing values
mis_val = df.isnull().sum()

# Percentage of missing values
mis_percent = 100 * df.isnull().sum() / len(df)

# Make a table with the results
mis_table = pd.concat([mis_val, mis_percent], axis=1)

# Rename the columns
mis_columns = mis_table.rename(
columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})

# Sort the table by percentage of missing descending
mis_columns = mis_columns[
mis_columns.iloc[:,1] != 0].sort_values(
'Percent of Total Values', ascending=False).round(2)

# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_columns.shape[0]) +
" columns that have missing values.")

# Return the dataframe with missing information
return mis_columnstrain_missing = missing_percent(train)
train_missing.head()
Image for post
train_missing
train_missing

There are a quite number of columns that have a plenty of missing data. I am going to drop any column that have missing data than 90%.

有相当多的列缺少大量数据。 我将删除所有缺少数据超过90%的列。

missing_vars_train = train_missing.loc[train_missing['Percent of Total Values'] > 90, 'Percent of Total Values']
len(missing_vars_train)
# 0

I will do the same for the test data

我将对测试数据进行相同的操作

# Read in the test dataframe
test = pd.read_csv('../input/home-credit-default-risk/application_test.csv')# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')

Then, will align the train and test dataset together and check their shape and same columns.

然后,将traintest数据集对齐,并检查它们的形状和相同的列。

# create a train target label 
train_label = train['TARGET']# align both dataframes, this will remove TARGET column
train, test = train.align(test, join='inner', axis = 1)
train['TARGET'] = train_label
print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)#Training Data Shape: (307511, 333)
#Testing Data Shape: (48744, 332)

Let’s check the missing percent on the test set.

让我们检查test集上丢失的百分比。

test_missing = missing_percent(test) 
test_missing.head()
Image for post
test_missing.head()
test_missing.head()

8.相关性 (8. Correlations)

I will check the correlation with the TARGET variable and the newly created features.

我将检查与TARGET变量和新创建的功能的相关性。

# calculate correlation for all dataframes
corr_train = train.corr()# Sort the resulted values in an ascending order
corr_train = corr_train.sort_values('TARGET', ascending = False)# show the ten most positive correlations
pd.DataFrame(corr_train['TARGET'].head(10))
Image for post
the top 10 correlated feature with the target variable
与目标变量相关的前10个相关特征

As observed from the sample above, the most correlated variables are the variables that were engineered earlier. However, correlation doesn’t mean causation that’s why we need to assess those correlations and pick the variables that have deeper influence on the TARGET . To do so, I will stick with the KDE plot.

从上面的样本中可以看出,最相关的变量是较早设计的变量。 但是,相关性并不意味着因果关系,这就是为什么我们需要评估那些相关性并选择对TARGET有更深影响的变量。 为此,我将坚持使用KDE图。

kde_target('bureau_DAYS_CREDIT_mean', train)
Image for post
KDE plot for the bureau_DAYS_CREDIT_mean
该局的KDE图_DAYS_CREDIT_mean

The plot says that the applicants with a greater number of monthly record per loan tends to repay the new loan. Let’s look more into the bureau_CREDIT_ACTIVE_Active_count_norm variable to see if this is true.

情节说,每笔贷款的每月记录数量较多的申请人倾向于偿还新的贷款。 让我们进一步看一下bureau_CREDIT_ACTIVE_Active_count_norm变量,看是否为真。

kde_target('bureau_CREDIT_ACTIVE_Active_count_norm', train)
Image for post
KDE plot for the bureau_CREDIT_ACTIVE_Active_count_norm
局的KDE图_CREDIT_ACTIVE_Active_count_norm

The correlation here is very weak, we can’t notice any significance.

这里的相关性很弱,我们没有注意到任何意义。

9.共线性 (9. Collinearity)

I will set a threshold of 80% to remove any highly correlated variables with the TARGET

我将阈值设置为80%,以使用TARGET删除所有高度相关的变量

# Set the threshold
threshold = 0.8# Empty dictionary to hold correlated variables
above_threshold_vars = {}# For each column, record the variables that are above the thresholdfor col in corr_train:
above_threshold_vars[col] = list(corr_train.index[corr_train[col] > threshold])# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []# Iterate through columns and correlated columnsfor key, value in above_threshold_vars.items():
# Keep track of columns already examined
cols_seen.append(key)
for x in value:
if x == key:
next
else:
# Only want to remove one in a pair
if x not in cols_seen:
cols_to_remove.append(x)
cols_to_remove_pair.append(key)

cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))

The output is: Number of columns to remove: 134

输出为:要删除的列数:134

Then, we can remove those column from the dataset as a preparation step to use for the model building

然后,我们可以从数据集中删除这些列,作为准备步骤以用于模型构建

rain_corrs_removed = train.drop(columns = cols_to_remove)
test_corrs_removed = test.drop(columns = cols_to_remove)
print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)

Training Corrs Removed Shape: (307511, 199)Testing Corrs Removed Shape: (48744, 198)

训练芯去除形状:(307511,199)测试芯去除形状:(48744,198)

摘要 (Summary)

The purpose of this tutorial was to introduce you to many concepts that may seem confusing at the beginning:

本教程的目的是向您介绍许多在开始时可能会令人困惑的概念:

  1. Feature engineering using pandas functions.

    使用熊猫功能进行特征工程。
  2. Customizing the kernel density estimator plot.

    自定义内核密度估计器图。
  3. Assessing the newly extracted features

    评估新提取的功能
  4. Eliminate collinearity in the data

    消除数据中的共线性

翻译自: https://towardsdatascience.com/hands-on-guide-to-feature-engineering-de793efc785

python进阶指南

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值