极限编程最佳实践_数据科学编程最佳实践

极限编程最佳实践

The data science life cycle is generally comprised of the following components:

数据科学生命周期通常由以下组件组成:

  • data retrieval
  • data cleaning
  • data exploration and visualization
  • statistical or predictive modeling
  • 资料检索
  • 数据清理
  • 数据探索和可视化
  • 统计或预测模型

While these components are helpful for understanding the different phases, they don’t help us think about our programming workflow.

尽管这些组件有助于理解不同的阶段,但它们并不能帮助我们考虑编程工作流程。

Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script. In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical / predictive modeling.

通常,整个数据科学生命周期最终都以Jupyter Notebook或单个混乱脚本中的任意混乱的笔记本单元结束。 此外,大多数数据科学问题都要求我们在数据检索,数据清理,数据探索,数据可视化以及统计/预测建模之间切换。

But there’s a better way! In this post, I’ll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.

但是有更好的方法! 在本文中,我将介绍大多数人在进行专门针对数据科学的编程工作时会切换的两种思维方式: 原型思维方式和生产思维方式。

Prototype mindset prioritizes: 原型心态优先考虑: Production mindset prioritizes: 生产心态优先:
iteration speed on small pieces of code 一小段代码的迭代速度 iteration speed on the full pipeline 全管道迭代速度
less abstraction (directly modifying code and data objects) 较少的抽象(直接修改代码和数据对象) more abstraction (modifying parameter values instead) 更多抽象(改为修改参数值)
less structure to the code (less modularity) 更少的代码结构(更少的模块化) more structure to the code (more modularity) 代码更具结构性(模块化)
helping you and others understand the code and data 帮助您和其他人了解代码和数据 helping a computer run code automatically 帮助计算机自动运行代码

I personally use JupyterLab for the entire process (both prototyping and productionizing). I recommend using JupyterLab at least for prototyping.

我个人使用JupyterLab进行整个过程(原型设计和生产)。 我建议至少使用JupyterLab 进行原型制作

借贷俱乐部数据 (Lending Club data)

To help more concretely understand the difference between the prototyping and the production mindset, let’s work with some real data. We’ll work with lending data from the peer-to-peer lending site, Lending Club. Unlike a bank, Lending Club doesn’t lend money itself. Lending Club is instead a marketplace for lenders to lend money to individuals who are seeking loans for a variety of reasons (home repairs, wedding costs, etc.). We can use this data to build models that will predict if a given loan application will be successful or not. We won’t dive into building a machine learning pipeline for making predictions in this post, but we cover it in our Machine Learning Project Walkthrough Course.

为了更具体地理解原型和生产思维方式之间的区别,让我们使用一些实际数据。 我们将处理点对点贷款站点Lending Club的贷款数据。 与银行不同,借贷俱乐部本身并不借钱。 相反,借贷俱乐部是借贷者向出于各种原因(房屋维修,婚礼费用等)而寻求贷款的个人借钱的市场。 我们可以使用这些数据来构建模型,以预测给定的贷款申请是否成功。 在本文中,我们不会深入研究用于做出预测的机器学习管道,但会在我们的机器学习项目演练课程中进行介绍

Lending Club offers detailed, historical data on both completed loans (loan applications that were approved by Lending Club and they found lenders) and declined loans (loan application was declined by Lending Club and money never changed hands). Navigate to their data download page and select 2007-2011 under DOWLNOAD LOAN DATA.

Lending Club提供有关已完成贷款(Lending Club批准的贷款申请并找到放款人的贷款)和已拒绝贷款(Lending Club拒绝了贷款申请且资金从未转手)的详细历史数据。 导航到其数据下载页面,然后在DOWLNOAD LOAN DATA下选择2007-2011

lendingclub

原型思维 (Prototype mindset)

In the prototype mindset, we’re interested in quickly iterating and trying to understand some properties and truths about the data. Create a new Jupyter notebook and add a Markdown cell that explains:

在原型心态中,我们对快速迭代并试图了解有关数据的某些属性和事实感兴趣。 创建一个新的Jupyter笔记本并添加一个Markdown单元,该单元说明:

  • Any research you did on Lending Club to better understand the platform
  • Any information on the data set you downloaded
  • 您在Lending Club上进行的任何研究以更好地了解该平台
  • 有关您下载的数据集的任何信息

First things first, let’s read the CSV file into pandas.

首先,让我们将CSV文件读入熊猫。

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv')
loans_2007.head(2)
import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv')
loans_2007.head(2)
 

We get two pieces of output, first a warning.

我们得到两个输出,首先是警告。

Then the first 5 rows of the dataframe, which we’ll avoid showing here (because it’s quite long).

然后是数据帧的前5行,我们将避免在此处显示(因为它很长)。

We also got the following dataframe output:

我们还获得了以下数据帧输出:

Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action) 招股说明书提供的注释(https://www.lendingclub.com/info/prospectus.action)
d d member_id 会员ID loan_amnt loan_amnt funded_amnt funded_amnt funded_amnt_inv funded_amnt_inv term 术语 int_rate int_rate installment 分期付款 grade 年级 sub_grade 次等级 emp_title emp_title emp_length emp_length home_ownership 房产权 annual_inc Annual_inc verification_status 验证状态 issue_d 发行 loan_status 贷款状态 pymnt_plan pymnt_plan url 网址 desc 描述 purpose 目的 title 标题 zip_code 邮政编码 addr_state addr_state dti dti delinq_2yrs delinq_2yrs earliest_cr_line earlyest_cr_line inq_last_6mths inq_last_6mths mths_since_last_delinq mths_since_last_delinq mths_since_last_record mths_since_last_record open_acc open_acc pub_rec pub_rec revol_bal revol_bal revol_util revol_util total_acc total_acc initial_list_status initial_list_status out_prncp out_prncp out_prncp_inv out_prncp_inv total_pymnt total_pymnt total_pymnt_inv total_pymnt_inv total_rec_prncp total_rec_prncp total_rec_int total_rec_int total_rec_late_fee total_rec_late_fee recoveries 回收率 collection_recovery_fee collection_recovery_fee last_pymnt_d last_pymnt_d last_pymnt_amnt last_pymnt_amnt next_pymnt_d next_pymnt_d last_credit_pull_d last_credit_pull_d collections_12_mths_ex_med collections_12_mths_ex_med mths_since_last_major_derog mths_since_last_major_derog policy_code policy_code application_type 应用类型 annual_inc_joint Annual_inc_joint dti_joint dti_joint verification_status_joint Verification_status_joint acc_now_delinq acc_now_delinq tot_coll_amt tot_coll_amt tot_cur_bal tot_cur_bal open_acc_6m open_acc_6m open_act_il open_act_il open_il_12m open_il_12m open_il_24m open_il_24m mths_since_rcnt_il mths_since_rcnt_il total_bal_il total_bal_il il_util il_util open_rv_12m open_rv_12m open_rv_24m open_rv_24m max_bal_bc max_bal_bc all_util all_util total_rev_hi_lim total_rev_hi_lim inq_fi inq_fi total_cu_tl total_cu_tl inq_last_12m inq_last_12m acc_open_past_24mths acc_open_past_24mths avg_cur_bal avg_cur_bal bc_open_to_buy bc_open_to_buy bc_util bc_util chargeoff_within_12_mths chargeoff_within_12_mths delinq_amnt delinq_amnt mo_sin_old_il_acct mo_sin_old_il_acct mo_sin_old_rev_tl_op mo_sin_old_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_rev_tl_op mo_sin_rcnt_tl mo_sin_rcnt_tl mort_acc mort_acc mths_since_recent_bc mths_since_recent_bc mths_since_recent_bc_dlq mths_since_recent_bc_dlq mths_since_recent_inq mths_since_recent_inq mths_since_recent_revol_delinq mths_since_recent_revol_delinq num_accts_ever_120_pd num_accts_ever_120_pd num_actv_bc_tl num_actv_bc_tl num_actv_rev_tl num_actv_rev_tl num_bc_sats num_bc_sats num_bc_tl num_bc_tl num_il_tl num_il_tl num_op_rev_tl num_op_rev_tl num_rev_accts num_rev_accts num_rev_tl_bal_gt_0 num_rev_tl_bal_gt_0 num_sats num_sats num_tl_120dpd_2m num_tl_120dpd_2m num_tl_30dpd num_tl_30dpd num_tl_90g_dpd_24m num_tl_90g_dpd_24m num_tl_op_past_12m num_tl_op_past_12m pct_tl_nvr_dlq pct_tl_nvr_dlq percent_bc_gt_75 percent_bc_gt_75 pub_rec_bankruptcies pub_rec_bankruptcies tax_liens tax_liens tot_hi_cred_lim tot_hi_cred_lim total_bal_ex_mort total_bal_ex_mort total_bc_limit total_bc_limit total_il_high_credit_limit total_il_high_credit_limit revol_bal_joint revol_bal_joint sec_app_earliest_cr_line sec_app_earliest_cr_line sec_app_inq_last_6mths sec_app_inq_last_6mths sec_app_mort_acc sec_app_mort_acc sec_app_open_acc sec_app_open_acc sec_app_revol_util sec_app_revol_util sec_app_open_act_il sec_app_open_act_il sec_app_num_rev_accts sec_app_num_rev_accts sec_app_chargeoff_within_12_mths sec_app_chargeoff_within_12_mths sec_app_collections_12_mths_ex_med sec_app_collections_12_mths_ex_med sec_app_mths_since_last_major_derog sec_app_mths_since_last_major_derog hardship_flag hardship_flag hardship_type hardship_type hardship_reason 困难原因 hardship_status hardship_status deferral_term deferral_term hardship_amount hardship_amount hardship_start_date hardship_start_date hardship_end_date hardship_end_date payment_plan_start_date Payment_plan_start_date hardship_length hardship_length hardship_dpd hardship_dpd hardship_loan_status hardship_loan_status orig_projected_additional_accrued_interest orig_projected_additional_accrued_interest hardship_payoff_balance_amount hardship_payoff_balance_amount hardship_last_payment_amount hardship_last_payment_amount disbursement_method 支出方式 debt_settlement_flag 债务清算标志 debt_settlement_flag_date 债务结算日期 settlement_status 结算状态 settlement_date 结算日期 settlement_amount 结算金额 settlement_percentage 结算百分比 settlement_term 结算期限

The warning lets us know that the type inferencing of pandas for each column would be improved if we set the low_memory parameter to False when calling pandas.read_csv().

该警告让我们知道,如果在调用pandas.read_csv()时将low_memory参数设置为False ,则会改善每一列的熊猫类型推断。

The second output is more problematic because the way the DataFrame is storing the data has issues. JupyterLab has a terminal environment built in, so we can open it and use the bash command head to observe the first two lines of the raw file:

第二个输出更成问题,因为DataFrame存储数据的方式存在问题。 JupyterLab内置了一个终端环境,因此我们可以打开它并使用bash命令head观察原始文件的前两行:

head -2 LoanStats3a.csv
head -2 LoanStats3a.csv
 

While the second line contains the column names as we expect in a CSV file, it looks like the first line is throwing off the formatting of the DataFrame when pandas tries to parse the file:

尽管第二行包含了我们在CSV文件中所期望的列名,但是当熊猫尝试解析文件时,第一行看起来似乎抛弃了DataFrame的格式:

Add a Markdown cell that details your observations and add a code cell that factors in the observations.

添加一个Markdown单元格,该单元格将详细记录您的观察结果,并添加一个代码单元格,将代码纳入观察结果。

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1, low_memory=False)
import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1, low_memory=False)
 

Read the data dictionary from the Lending Club download page to understand which columns don’t contain useful information for features. The desc and url columns seem to immediately fit this criteria.

Lending Club下载页面阅读数据字典,以了解哪些列不包含有关功能的有用信息。 descurl列似乎立即符合此条件。

The next step is to drop any columns with more than 50% missing rows. Use one cell to explore which columns meet that criteria, and another to actually drop the columns.

下一步是删除丢失行超过50%的所有列。 使用一个单元格探索哪些列符合该条件,而另一单元格实际删除这些列。

loans_2007.isnull().sum()/len(loans_2007)
loans_2007.isnull().sum()/len(loans_2007)
 

Because we’re using a Jupyter notebook to track our thoughts and our code, we’re relying on the environment (via the IPython kernel) to keep track of changes of the state. This frees us up to be freeform, move cells around, run the same code multiple times, etc.

因为我们使用Jupyter笔记本来跟踪我们的想法和代码,所以我们依靠环境(通过IPython内核)来跟踪状态的变化。 这使我们可以自由地自由移动,移动单元格,多次运行相同的代码等。

In general, code in the prototyping mindset should focus on:

通常,原型思维方式中的代码应重点关注:

  • Understandability
    • Markdown cells to describe our observations and assumptions
    • Small pieces of code for the actual logic
    • Lots of visualizations and counts
  • Minimal abstractions
    • Most code shouldn’t be in functions (should feel more object-oriented)
  • 易懂
    • 降价单元描述我们的观察和假设
    • 实际逻辑的小段代码
    • 大量的可视化和计数
  • 最小抽象
    • 大多数代码不应该包含在函数中(应该更加面向对象)

Let’s say we spent another hour exploring the data and writing markdown cells that describe the data cleaning we did. We can then switch over to the production mindset and make the code more robust.

假设我们花了一个小时来研究数据,并编写描述我们进行的数据清理的标记单元。 然后,我们可以切换到生产思维方式,并使代码更健壮。

生产思维 (Production mindset)

In the production mindset, we want to focus on writing code that will generalize to more situations. In our case, we want our data cleaning code to work for any of the data sets from Lending Club (from other time periods). The best way to generalize our code is to turn it into a data pipeline. A data pipeline is designed using principles from functional programming, where data is modified within functions and then passed between functions.

在生产思维中,我们希望专注于编写可推广到更多情况的代码。 就我们而言,我们希望我们的数据清理代码可用于Lending Club(来自其他时间段)的任何数据集。 概括我们的代码的最佳方法是将其变成数据管道 。 数据管道是使用函数式编程的原理设计的 ,其中数据在函数内进行修改,然后在函数之间传递。

Here’s a first iteration of this pipeline using a single function to encapsulate data cleaning code:

这是使用单个函数封装数据清理代码的此管道的第一个迭代:

import pandas as pd

def import_clean(file_list):
    frames = []
    for file in file_list:
        loans = pd.read_csv(file, skiprows=1, low_memory=False)
        loans = loans.drop(['desc', 'url'], axis=1)
        half_count = len(loans)/2
        loans = loans.dropna(thresh=half_count, axis=1)
        loans = loans.drop_duplicates()
        # Drop first group of features
        loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
        # Drop second group of features
        loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
        # Drop third group of features
        loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
        frames.append(loans)
    return frames
    
frames = import_clean(['LoanStats3a.csv'])
import pandas as pd

def import_clean(file_list):
    frames = []
    for file in file_list:
        loans = pd.read_csv(file, skiprows=1, low_memory=False)
        loans = loans.drop(['desc', 'url'], axis=1)
        half_count = len(loans)/2
        loans = loans.dropna(thresh=half_count, axis=1)
        loans = loans.drop_duplicates()
        # Drop first group of features
        loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
        # Drop second group of features
        loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
        # Drop third group of features
        loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
        frames.append(loans)
    return frames
    
frames = import_clean(['LoanStats3a.csv'])
 

In the code above, we abstracted the code from earlier into a single function. The input to this function is a list of filenames and the output is a list of DataFrame objects.

在上面的代码中,我们将代码从早期抽象为一个函数。 此函数的输入是文件名列表,而输出是DataFrame对象的列表。

In general, the production mindset should focus on:

通常,生产思路应集中在:

  • Healthy abstractions
    • Code should generalize to be compatible with similar data sources
    • Code shouldn’t be so general that it becomes cumbersome to understand
  • pipeline stability
    • Reliability should match how frequently its run (daily? weekly? monthly?)
  • 健康的抽象
    • 代码应该概括为与相似的数据源兼容
    • 代码不应该太笼统,以至于难以理解
  • 管道稳定性
    • 可靠性应与其运行的频率匹配(每天,每周,每月,每月)
在心态之间切换 (Switching between mindsets)

Let’s say we tried to run the function for all of the data sets from Lending Club and Python returned errors. Some potential sources for errors:

假设我们尝试为Lending Club的所有数据集运行该函数,并且Python返回了错误。 一些潜在的错误来源:

  • Variance in column names in some of the files
  • Variance in columns being dropped because of the 50% missing value threshold
  • Different column types based on pandas type inference for that file
  • 一些文件中列名的差异
  • 由于缺少50%的缺失值阈值,因此删除了列中的差异
  • 基于该文件的熊猫类型推断的不同列类型

In those cases, we should actually switch back to our prototype notebook and investigate further. When we’ve determined that we want our pipeline to be more flexible and account for specific variations in the data, we can re-incorporate that back into the pipeline logic.

在这些情况下,我们实际上应该切换回原型笔记本并进行进一步调查。 当我们确定希望管道更加灵活并考虑到数据中的特定变化时,可以将其重新整合到管道逻辑中。

Here’s an example where we adapted the function to accommodate for a different drop threshold value:

这是一个示例,其中我们对函数进行了调整以适应不同的下降阈值:

The default value is still 0.5, but we can over-ride it to 0.7 if we want.

默认值仍然是0.5 ,但是如果需要,我们可以将其0.70.7

Here are a few ways to make the pipeline more flexible, in decreasing priority:

以下是在降低优先级的情况下使管道更灵活的几种方法:

  • Use optional, positional, and required arguments
  • Use if / then statements along with Boolean input values within the functions
  • Use new data structures (dictionaries, lists, etc.) to represent custom actions for specific datasets
  • 使用可选,位置和必需参数
  • 在函数中使用if / then语句以及布尔输入值
  • 使用新的数据结构(字典,列表等)表示特定数据集的自定义操作

This pipeline can scale to all phases of the data science workflow. Here’s some skeleton code that previews how this looks.

该管道可以扩展到数据科学工作流的所有阶段。 这是一些预览其外观的基本代码。

import pandas as pd

def import_clean(file_list, threshold=0.5):
    ## Code
    
def visualize(df_list):
    # Find the most important features and generate pairwise scatter plots
    # Display visualizations and write to file.
    plt.savefig("scatter_plots.png")

def combine(df_list):
    # Combine dataframes and generate train and test sets
    # Drop features all dataframes don't share
    # Return both train and test dataframes
    return train,test
    
def train(train_df):
    # Train model
    return model
    
def validate(train_df, test-df):
    # K-fold cross validation
    # Return metrics dictionary
    return metrics_dict
    
frames = import_clean(['LoanStats3a.csv', 'LoanStats2012.csv'], threshold=0.7)
visualize(frames)
train_df, test_df = combine(frames)
model = train(train_df)
metrics = test(train_df, test_df)
print(metrics)
import pandas as pd

def import_clean(file_list, threshold=0.5):
    ## Code
    
def visualize(df_list):
    # Find the most important features and generate pairwise scatter plots
    # Display visualizations and write to file.
    plt.savefig("scatter_plots.png")

def combine(df_list):
    # Combine dataframes and generate train and test sets
    # Drop features all dataframes don't share
    # Return both train and test dataframes
    return train,test
    
def train(train_df):
    # Train model
    return model
    
def validate(train_df, test-df):
    # K-fold cross validation
    # Return metrics dictionary
    return metrics_dict
    
frames = import_clean(['LoanStats3a.csv', 'LoanStats2012.csv'], threshold=0.7)
visualize(frames)
train_df, test_df = combine(frames)
model = train(train_df)
metrics = test(train_df, test_df)
print(metrics)
 
下一步 (Next steps)

If you’re interested in deepening your understanding and practicing further, I recommend the following next steps:

如果您有兴趣加深了解并进一步练习,建议执行以下后续步骤:

翻译自: https://www.pybloggers.com/2018/06/programming-best-practices-for-data-science/

极限编程最佳实践

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值