极限编程最佳实践_数据科学编程最佳实践

最新推荐文章于 2024-06-06 09:50:15 发布

cumei1658

最新推荐文章于 2024-06-06 09:50:15 发布

阅读量465

点赞数

文章标签：可视化 python java 深度学习机器学习

原文链接：https://www.pybloggers.com/2018/06/programming-best-practices-for-data-science/

版权

极限编程最佳实践

The data science life cycle is generally comprised of the following components:

数据科学生命周期通常由以下组件组成：

data retrieval
data cleaning
data exploration and visualization
statistical or predictive modeling

资料检索
数据清理
数据探索和可视化
统计或预测模型

While these components are helpful for understanding the different phases, they don’t help us think about our programming workflow.

尽管这些组件有助于理解不同的阶段，但它们并不能帮助我们考虑编程工作流程。

Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script. In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical / predictive modeling.

通常，整个数据科学生命周期最终都以Jupyter Notebook或单个混乱脚本中的任意混乱的笔记本单元结束。此外，大多数数据科学问题都要求我们在数据检索，数据清理，数据探索，数据可视化以及统计/预测建模之间切换。

But there’s a better way! In this post, I’ll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.

但是有更好的方法！在本文中，我将介绍大多数人在进行专门针对数据科学的编程工作时会切换的两种思维方式：原型思维方式和生产思维方式。

Prototype mindset prioritizes:	原型心态优先考虑：	Production mindset prioritizes:	生产心态优先：
iteration speed on small pieces of code	一小段代码的迭代速度	iteration speed on the full pipeline	全管道迭代速度
less abstraction (directly modifying code and data objects)	较少的抽象（直接修改代码和数据对象）	more abstraction (modifying parameter values instead)	更多抽象（改为修改参数值）
less structure to the code (less modularity)	更少的代码结构（更少的模块化）	more structure to the code (more modularity)	代码更具结构性（模块化）
helping you and others understand the code and data	帮助您和其他人了解代码和数据	helping a computer run code automatically	帮助计算机自动运行代码

I personally use JupyterLab for the entire process (both prototyping and productionizing). I recommend using JupyterLab at least for prototyping.

我个人使用JupyterLab进行整个过程（原型设计和生产）。我建议至少使用JupyterLab 进行原型制作 。

借贷俱乐部数据 (Lending Club data)

To help more concretely understand the difference between the prototyping and the production mindset, let’s work with some real data. We’ll work with lending data from the peer-to-peer lending site, Lending Club. Unlike a bank, Lending Club doesn’t lend money itself. Lending Club is instead a marketplace for lenders to lend money to individuals who are seeking loans for a variety of reasons (home repairs, wedding costs, etc.). We can use this data to build models that will predict if a given loan application will be successful or not. We won’t dive into building a machine learning pipeline for making predictions in this post, but we cover it in our Machine Learning Project Walkthrough Course.

为了更具体地理解原型和生产思维方式之间的区别，让我们使用一些实际数据。我们将处理点对点贷款站点Lending Club的贷款数据。与银行不同，借贷俱乐部本身并不借钱。相反，借贷俱乐部是借贷者向出于各种原因（房屋维修，婚礼费用等）而寻求贷款的个人借钱的市场。我们可以使用这些数据来构建模型，以预测给定的贷款申请是否成功。在本文中，我们不会深入研究用于做出预测的机器学习管道，但会在我们的机器学习项目演练课程中进行介绍。

Lending Club offers detailed, historical data on both completed loans (loan applications that were approved by Lending Club and they found lenders) and declined loans (loan application was declined by Lending Club and money never changed hands). Navigate to their data download page and select 2007-2011 under DOWLNOAD LOAN DATA.

Lending Club提供有关已完成贷款（Lending Club批准的贷款申请并找到放款人的贷款）和已拒绝贷款（Lending Club拒绝了贷款申请且资金从未转手）的详细历史数据。导航到其数据下载页面，然后在DOWLNOAD LOAN DATA下选择2007-2011 。

原型思维 (Prototype mindset)

In the prototype mindset, we’re interested in quickly iterating and trying to understand some properties and truths about the data. Create a new Jupyter notebook and add a Markdown cell that explains:

在原型心态中，我们对快速迭代并试图了解有关数据的某些属性和事实感兴趣。创建一个新的Jupyter笔记本并添加一个Markdown单元，该单元说明：

Any research you did on Lending Club to better understand the platform
Any information on the data set you downloaded

您在Lending Club上进行的任何研究以更好地了解该平台
有关您下载的数据集的任何信息

First things first, let’s read the CSV file into pandas.

首先，让我们将CSV文件读入熊猫。

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv')
loans_2007.head(2)
import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv')
loans_2007.head(2)

We get two pieces of output, first a warning.

我们得到两个输出，首先是警告。

Then the first 5 rows of the dataframe, which we’ll avoid showing here (because it’s quite long).

然后是数据帧的前5行，我们将避免在此处显示（因为它很长）。

We also got the following dataframe output:

我们还获得了以下数据帧输出：

																																																																																																																																																																																																																																																																																																Notes offered by Prospectus (https://www.lendingclub.com/info/prospectus.action)	招股说明书提供的注释（https://www.lendingclub.com/info/prospectus.action）
d	d	member_id	会员ID	loan_amnt	loan_amnt	funded_amnt	funded_amnt	funded_amnt_inv	funded_amnt_inv	term	术语	int_rate	int_rate	installment	分期付款	grade	年级	sub_grade	次等级	emp_title	emp_title	emp_length	emp_length	home_ownership	房产权	annual_inc	Annual_inc	verification_status	验证状态	issue_d	发行	loan_status	贷款状态	pymnt_plan	pymnt_plan	url	网址	desc	描述	purpose	目的	title	标题	zip_code	邮政编码	addr_state	addr_state	dti	dti	delinq_2yrs	delinq_2yrs	earliest_cr_line	earlyest_cr_line	inq_last_6mths	inq_last_6mths	mths_since_last_delinq	mths_since_last_delinq	mths_since_last_record	mths_since_last_record	open_acc	open_acc	pub_rec	pub_rec	revol_bal	revol_bal	revol_util	revol_util	total_acc	total_acc	initial_list_status	initial_list_status	out_prncp	out_prncp	out_prncp_inv	out_prncp_inv	total_pymnt	total_pymnt	total_pymnt_inv	total_pymnt_inv	total_rec_prncp	total_rec_prncp	total_rec_int	total_rec_int	total_rec_late_fee	total_rec_late_fee	recoveries	回收率	collection_recovery_fee	collection_recovery_fee	last_pymnt_d	last_pymnt_d	last_pymnt_amnt	last_pymnt_amnt	next_pymnt_d	next_pymnt_d	last_credit_pull_d	last_credit_pull_d	collections_12_mths_ex_med	collections_12_mths_ex_med	mths_since_last_major_derog	mths_since_last_major_derog	policy_code	policy_code	application_type	应用类型	annual_inc_joint	Annual_inc_joint	dti_joint	dti_joint	verification_status_joint	Verification_status_joint	acc_now_delinq	acc_now_delinq	tot_coll_amt	tot_coll_amt	tot_cur_bal	tot_cur_bal	open_acc_6m	open_acc_6m	open_act_il	open_act_il	open_il_12m	open_il_12m	open_il_24m	open_il_24m	mths_since_rcnt_il	mths_since_rcnt_il	total_bal_il	total_bal_il	il_util	il_util	open_rv_12m	open_rv_12m	open_rv_24m	open_rv_24m	max_bal_bc	max_bal_bc	all_util	all_util	total_rev_hi_lim	total_rev_hi_lim	inq_fi	inq_fi	total_cu_tl	total_cu_tl	inq_last_12m	inq_last_12m	acc_open_past_24mths	acc_open_past_24mths	avg_cur_bal	avg_cur_bal	bc_open_to_buy	bc_open_to_buy	bc_util	bc_util	chargeoff_within_12_mths	chargeoff_within_12_mths	delinq_amnt	delinq_amnt	mo_sin_old_il_acct	mo_sin_old_il_acct	mo_sin_old_rev_tl_op	mo_sin_old_rev_tl_op	mo_sin_rcnt_rev_tl_op	mo_sin_rcnt_rev_tl_op	mo_sin_rcnt_tl	mo_sin_rcnt_tl	mort_acc	mort_acc	mths_since_recent_bc	mths_since_recent_bc	mths_since_recent_bc_dlq	mths_since_recent_bc_dlq	mths_since_recent_inq	mths_since_recent_inq	mths_since_recent_revol_delinq	mths_since_recent_revol_delinq	num_accts_ever_120_pd	num_accts_ever_120_pd	num_actv_bc_tl	num_actv_bc_tl	num_actv_rev_tl	num_actv_rev_tl	num_bc_sats	num_bc_sats	num_bc_tl	num_bc_tl	num_il_tl	num_il_tl	num_op_rev_tl	num_op_rev_tl	num_rev_accts	num_rev_accts	num_rev_tl_bal_gt_0	num_rev_tl_bal_gt_0	num_sats	num_sats	num_tl_120dpd_2m	num_tl_120dpd_2m	num_tl_30dpd	num_tl_30dpd	num_tl_90g_dpd_24m	num_tl_90g_dpd_24m	num_tl_op_past_12m	num_tl_op_past_12m	pct_tl_nvr_dlq	pct_tl_nvr_dlq	percent_bc_gt_75	percent_bc_gt_75	pub_rec_bankruptcies	pub_rec_bankruptcies	tax_liens	tax_liens	tot_hi_cred_lim	tot_hi_cred_lim	total_bal_ex_mort	total_bal_ex_mort	total_bc_limit	total_bc_limit	total_il_high_credit_limit	total_il_high_credit_limit	revol_bal_joint	revol_bal_joint	sec_app_earliest_cr_line	sec_app_earliest_cr_line	sec_app_inq_last_6mths	sec_app_inq_last_6mths	sec_app_mort_acc	sec_app_mort_acc	sec_app_open_acc	sec_app_open_acc	sec_app_revol_util	sec_app_revol_util	sec_app_open_act_il	sec_app_open_act_il	sec_app_num_rev_accts	sec_app_num_rev_accts	sec_app_chargeoff_within_12_mths	sec_app_chargeoff_within_12_mths	sec_app_collections_12_mths_ex_med	sec_app_collections_12_mths_ex_med	sec_app_mths_since_last_major_derog	sec_app_mths_since_last_major_derog	hardship_flag	hardship_flag	hardship_type	hardship_type	hardship_reason	困难原因	hardship_status	hardship_status	deferral_term	deferral_term	hardship_amount	hardship_amount	hardship_start_date	hardship_start_date	hardship_end_date	hardship_end_date	payment_plan_start_date	Payment_plan_start_date	hardship_length	hardship_length	hardship_dpd	hardship_dpd	hardship_loan_status	hardship_loan_status	orig_projected_additional_accrued_interest	orig_projected_additional_accrued_interest	hardship_payoff_balance_amount	hardship_payoff_balance_amount	hardship_last_payment_amount	hardship_last_payment_amount	disbursement_method	支出方式	debt_settlement_flag	债务清算标志	debt_settlement_flag_date	债务结算日期	settlement_status	结算状态	settlement_date	结算日期	settlement_amount	结算金额	settlement_percentage	结算百分比	settlement_term	结算期限

The warning lets us know that the type inferencing of pandas for each column would be improved if we set the low_memory parameter to False when calling pandas.read_csv().

该警告让我们知道，如果在调用pandas.read_csv()时将low_memory参数设置为False ，则会改善每一列的熊猫类型推断。

The second output is more problematic because the way the DataFrame is storing the data has issues. JupyterLab has a terminal environment built in, so we can open it and use the bash command head to observe the first two lines of the raw file:

第二个输出更成问题，因为DataFrame存储数据的方式存在问题。 JupyterLab内置了一个终端环境，因此我们可以打开它并使用bash命令head观察原始文件的前两行：

head -2 LoanStats3a.csv
head -2 LoanStats3a.csv

While the second line contains the column names as we expect in a CSV file, it looks like the first line is throwing off the formatting of the DataFrame when pandas tries to parse the file:

尽管第二行包含了我们在CSV文件中所期望的列名，但是当熊猫尝试解析文件时，第一行看起来似乎抛弃了DataFrame的格式：

Add a Markdown cell that details your observations and add a code cell that factors in the observations.

添加一个Markdown单元格，该单元格将详细记录您的观察结果，并添加一个代码单元格，将代码纳入观察结果。

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1, low_memory=False)
import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1, low_memory=False)

Read the data dictionary from the Lending Club download page to understand which columns don’t contain useful information for features. The desc and url columns seem to immediately fit this criteria.

从Lending Club下载页面阅读数据字典，以了解哪些列不包含有关功能的有用信息。 desc和url列似乎立即符合此条件。

The next step is to drop any columns with more than 50% missing rows. Use one cell to explore which columns meet that criteria, and another to actually drop the columns.

下一步是删除丢失行超过50％的所有列。使用一个单元格探索哪些列符合该条件，而另一单元格实际删除这些列。

loans_2007.isnull().sum()/len(loans_2007)
loans_2007.isnull().sum()/len(loans_2007)

Because we’re using a Jupyter notebook to track our thoughts and our code, we’re relying on the environment (via the IPython kernel) to keep track of changes of the state. This frees us up to be freeform, move cells around, run the same code multiple times, etc.

因为我们使用Jupyter笔记本来跟踪我们的想法和代码，所以我们依靠环境（通过IPython内核）来跟踪状态的变化。这使我们可以自由地自由移动，移动单元格，多次运行相同的代码等。

In general, code in the prototyping mindset should focus on:

通常，原型思维方式中的代码应重点关注：

Understandability
- Markdown cells to describe our observations and assumptions
- Small pieces of code for the actual logic
- Lots of visualizations and counts
Minimal abstractions
- Most code shouldn’t be in functions (should feel more object-oriented)

易懂
- 降价单元描述我们的观察和假设
- 实际逻辑的小段代码
- 大量的可视化和计数
最小抽象
- 大多数代码不应该包含在函数中（应该更加面向对象）

Let’s say we spent another hour exploring the data and writing markdown cells that describe the data cleaning we did. We can then switch over to the production mindset and make the code more robust.

假设我们花了一个小时来研究数据，并编写描述我们进行的数据清理的标记单元。然后，我们可以切换到生产思维方式，并使代码更健壮。

生产思维 (Production mindset)

In the production mindset, we want to focus on writing code that will generalize to more situations. In our case, we want our data cleaning code to work for any of the data sets from Lending Club (from other time periods). The best way to generalize our code is to turn it into a data pipeline. A data pipeline is designed using principles from functional programming, where data is modified within functions and then passed between functions.

在生产思维中，我们希望专注于编写可推广到更多情况的代码。就我们而言，我们希望我们的数据清理代码可用于Lending Club（来自其他时间段）的任何数据集。概括我们的代码的最佳方法是将其变成数据管道 。数据管道是使用函数式编程的原理设计的，其中数据在函数内进行修改，然后在函数之间传递。

Here’s a first iteration of this pipeline using a single function to encapsulate data cleaning code:

这是使用单个函数封装数据清理代码的此管道的第一个迭代：

import pandas as pd

def import_clean(file_list):
    frames = []
    for file in file_list:
        loans = pd.read_csv(file, skiprows=1, low_memory=False)
        loans = loans.drop(['desc', 'url'], axis=1)
        half_count = len(loans)/2
        loans = loans.dropna(thresh=half_count, axis=1)
        loans = loans.drop_duplicates()
        # Drop first group of features
        loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
        # Drop second group of features
        loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
        # Drop third group of features
        loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
        frames.append(loans)
    return frames
    
frames = import_clean(['LoanStats3a.csv'])
import pandas as pd

def import_clean(file_list):
    frames = []
    for file in file_list:
        loans = pd.read_csv(file, skiprows=1, low_memory=False)
        loans = loans.drop(['desc', 'url'], axis=1)
        half_count = len(loans)/2
        loans = loans.dropna(thresh=half_count, axis=1)
        loans = loans.drop_duplicates()
        # Drop first group of features
        loans = loans.drop(["funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)
        # Drop second group of features
        loans = loans.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)
        # Drop third group of features
        loans = loans.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
        frames.append(loans)
    return frames
    
frames = import_clean(['LoanStats3a.csv'])

In the code above, we abstracted the code from earlier into a single function. The input to this function is a list of filenames and the output is a list of DataFrame objects.

在上面的代码中，我们将代码从早期抽象为一个函数。此函数的输入是文件名列表，而输出是DataFrame对象的列表。

In general, the production mindset should focus on:

通常，生产思路应集中在：

Healthy abstractions
- Code should generalize to be compatible with similar data sources
- Code shouldn’t be so general that it becomes cumbersome to understand
pipeline stability
- Reliability should match how frequently its run (daily? weekly? monthly?)

健康的抽象
- 代码应该概括为与相似的数据源兼容
- 代码不应该太笼统，以至于难以理解
管道稳定性
- 可靠性应与其运行的频率匹配（每天，每周，每月，每月）

在心态之间切换 (Switching between mindsets)

Let’s say we tried to run the function for all of the data sets from Lending Club and Python returned errors. Some potential sources for errors:

假设我们尝试为Lending Club的所有数据集运行该函数，并且Python返回了错误。一些潜在的错误来源：

Variance in column names in some of the files
Variance in columns being dropped because of the 50% missing value threshold
Different column types based on pandas type inference for that file

一些文件中列名的差异
由于缺少50％的缺失值阈值，因此删除了列中的差异
基于该文件的熊猫类型推断的不同列类型

In those cases, we should actually switch back to our prototype notebook and investigate further. When we’ve determined that we want our pipeline to be more flexible and account for specific variations in the data, we can re-incorporate that back into the pipeline logic.

在这些情况下，我们实际上应该切换回原型笔记本并进行进一步调查。当我们确定希望管道更加灵活并考虑到数据中的特定变化时，可以将其重新整合到管道逻辑中。

Here’s an example where we adapted the function to accommodate for a different drop threshold value:

这是一个示例，其中我们对函数进行了调整以适应不同的下降阈值：

The default value is still 0.5, but we can over-ride it to 0.7 if we want.

默认值仍然是0.5 ，但是如果需要，我们可以将其0.7为0.7 。

Here are a few ways to make the pipeline more flexible, in decreasing priority:

以下是在降低优先级的情况下使管道更灵活的几种方法：

Use optional, positional, and required arguments
Use if / then statements along with Boolean input values within the functions
Use new data structures (dictionaries, lists, etc.) to represent custom actions for specific datasets

使用可选，位置和必需参数
在函数中使用if / then语句以及布尔输入值
使用新的数据结构（字典，列表等）表示特定数据集的自定义操作

This pipeline can scale to all phases of the data science workflow. Here’s some skeleton code that previews how this looks.

该管道可以扩展到数据科学工作流的所有阶段。这是一些预览其外观的基本代码。

import pandas as pd

def import_clean(file_list, threshold=0.5):
    ## Code
    
def visualize(df_list):
    # Find the most important features and generate pairwise scatter plots
    # Display visualizations and write to file.
    plt.savefig("scatter_plots.png")

def combine(df_list):
    # Combine dataframes and generate train and test sets
    # Drop features all dataframes don't share
    # Return both train and test dataframes
    return train,test
    
def train(train_df):
    # Train model
    return model
    
def validate(train_df, test-df):
    # K-fold cross validation
    # Return metrics dictionary
    return metrics_dict
    
frames = import_clean(['LoanStats3a.csv', 'LoanStats2012.csv'], threshold=0.7)
visualize(frames)
train_df, test_df = combine(frames)
model = train(train_df)
metrics = test(train_df, test_df)
print(metrics)
import pandas as pd

def import_clean(file_list, threshold=0.5):
    ## Code
    
def visualize(df_list):
    # Find the most important features and generate pairwise scatter plots
    # Display visualizations and write to file.
    plt.savefig("scatter_plots.png")

def combine(df_list):
    # Combine dataframes and generate train and test sets
    # Drop features all dataframes don't share
    # Return both train and test dataframes
    return train,test
    
def train(train_df):
    # Train model
    return model
    
def validate(train_df, test-df):
    # K-fold cross validation
    # Return metrics dictionary
    return metrics_dict
    
frames = import_clean(['LoanStats3a.csv', 'LoanStats2012.csv'], threshold=0.7)
visualize(frames)
train_df, test_df = combine(frames)
model = train(train_df)
metrics = test(train_df, test_df)
print(metrics)

下一步 (Next steps)

If you’re interested in deepening your understanding and practicing further, I recommend the following next steps:

如果您有兴趣加深了解并进一步练习，建议执行以下后续步骤：

Learn how to turn your pipeline into a standalone script that can be run as a module or from the command line: https://docs.python.org/3/library/main.html
Learn how to use Luigi to build more complex pipelines that can be run in the cloud: Building Data Pipelines in Python and Luigi
Learn more about data engineering: Data Engineering Posts on Dataquest

了解如何将管道转换为可以作为模块或从命令行运行的独立脚本： https : //docs.python.org/3/library/ main .html
了解如何使用Luigi构建可以在云中运行的更复杂的管道：使用Python和Luigi构建数据管道
了解有关数据工程的更多信息： Dataquest在Dataquest上的帖子

翻译自: https://www.pybloggers.com/2018/06/programming-best-practices-for-data-science/

极限编程最佳实践

cumei1658

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
极限编程最佳实践_数据科学编程最佳实践

极限编程最佳实践The data science life cycle is generally comprised of the following components: 数据科学生命周期通常由以下组件组成： data retrievaldata cleaningdata exploration and visualizationstatistical or predictive ...
复制链接

扫一扫