Pandas数据操作练习

使用Pandas、Numpy等工具库,完成以下数据操作

一、CSV数据读入

  • 从指定路径下读取CSV数据文件,并将Loan_ID设为Index
  • 数据文件train.csv在"./data/"路径下
  • 打印输出前该数据集10行

GenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_Status
Loan_ID
LP001002MaleNo0GraduateNo58490.0NaN360.01.0UrbanY
LP001003MaleYes1GraduateNo45831508.0128.0360.01.0RuralN
LP001005MaleYes0GraduateYes30000.066.0360.01.0UrbanY
LP001006MaleYes0Not GraduateNo25832358.0120.0360.01.0UrbanY
LP001008MaleNo0GraduateNo60000.0141.0360.01.0UrbanY
LP001011MaleYes2GraduateYes54174196.0267.0360.01.0UrbanY
LP001013MaleYes0Not GraduateNo23331516.095.0360.01.0UrbanY
LP001014MaleYes3+GraduateNo30362504.0158.0360.00.0SemiurbanN
LP001018MaleYes2GraduateNo40061526.0168.0360.01.0UrbanY
LP001020MaleYes1GraduateNo1284110968.0349.0360.01.0SemiurbanN

二、数据选择

  • 从数据集中得到“所有没有毕业”(Education: Not Graduate)并且“获得贷款”(Loan_Status: Y)的“女性”(Gender: Female),并输出“性别”(Gender)、“教育状况”(Education)及“贷款状态”(Loan_Status)。

GenderEducationLoan_Status
Loan_ID
LP001155FemaleNot GraduateY
LP001669FemaleNot GraduateY
LP001692FemaleNot GraduateY
LP001908FemaleNot GraduateY
LP002300FemaleNot GraduateY
LP002314FemaleNot GraduateY
LP002407FemaleNot GraduateY
LP002489FemaleNot GraduateY
LP002502FemaleNot GraduateY
LP002534FemaleNot GraduateY
LP002582FemaleNot GraduateY
LP002731FemaleNot GraduateY
LP002757FemaleNot GraduateY
LP002917FemaleNot GraduateY

三、使用apply对数据集应用自定义函数

def num_missing(x):
    return sum(x.isnull())
3.1 使用apply函数将num_missing函数用于统计数据集的每列缺失值数量

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64
3.2 使用apply函数将num_missing函数用于统计数据集每行缺失值数量,并打印前10行

Loan_ID
LP001002    1
LP001003    0
LP001005    0
LP001006    0
LP001008    0
LP001011    0
LP001013    0
LP001014    0
LP001018    0
LP001020    0
dtype: int64

四、缺失值填充

4.1 对于Gender、Married、Self_Employed三个因子型变量,使用各自最常见的因子进行缺失值填充
4.2 对于LoanAmount变量进行缺失值填充处理
  • 按照“Gender”、“Married”及“Self_Employed”的组合下的每个组群进行LoanAmount变量的均值统计
  • 按照每组统计得到的平均值,对“LoanAmount”中缺失值进行填充

五、数据透视表

基于data数据,得到下表:

Loan_StatusNYALL
Credit_History
0.0numnumnum
1.0numnumnum
Allnumnumnum

其中num代表统计数量。

pd.crosstab(T['Credit_History'],T['Loan_Status'],margins=True)
Loan_StatusNYAll
Credit_History
0.082789
1.097378475
All179385564

六、合并数据集

  • 将prop_rates数据集与data数据集合并
  • 基于合并后的数据集,按照“Property_Area”、“rates”的组合下的每个组群下Credit_History变量的样本数量统计
prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
T.merge(right=prop_rates,how='inner',left_on='Property_Area',right_index=True,sort=False) #inner拼接方式.最后sort=false是不做排序
Credit_History
Property_Arearates
Rural1000179.0
Semiurban5000233.0
Urban12000202.0

七、数据集排序

7.1 将data数据集按照ApplicantIncome、CoapplicantIncome 两列变量值进行降序排列,并输出排序后数据集的前10行。

ApplicantIncomeCoapplicantIncome
Loan_ID
LP002317810000.0
LP002101633370.0
LP001585517630.0
LP001536399990.0
LP001640391474750.0
LP002422377190.0
LP001637338460.0
LP001448238030.0
LP002624208336667.0
LP001922206670.0

八、变量离散化

  • 将LoanAmount变量离散化,得到新的变量LoanAmount_Bin
  • 按照以下条件进行处理:
    • [min,90):low
    • [90,140):medium
    • [140,190): high

[9.0, 90, 140, 190, 700.0]





low           98
medium       274
high         150
very_high     91
Name: LoanAmount_Bin, dtype: int64

九、变量映射

  • 将变量Loan_Status中的字符映射为数字,得到新的变量Loan_Status_Coded
  • 映射方法为:{‘N’: 0; ‘Y’: 1}
  • 输出Loan_Status_Coded变量的类型统计数值

1    422
0    192
Name: Loan_Status_Coded, dtype: int64

十、独热编码

  • 将LoanAmount_Bin变量进行独热编码
  • 得到新变量:LoanAmount_low, Loan_Amount_medium, Loan_Amount_high, Loan_Amount_very_high
  • 将新变量合并到data数据集,并打印数据集前10行

GenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_StatusLoanAmount_BinLoan_Status_CodedLoanAmount_lowLoanAmount_mediumLoanAmount_highLoanAmount_very_high
Loan_ID
LP001002MaleNo0GraduateNo58490.0129.936937360.01.0UrbanYmedium10100
LP001003MaleYes1GraduateNo45831508.0128.000000360.01.0RuralNmedium00100
LP001005MaleYes0GraduateYes30000.066.000000360.01.0UrbanYlow11000
LP001006MaleYes0Not GraduateNo25832358.0120.000000360.01.0UrbanYmedium10100
LP001008MaleNo0GraduateNo60000.0141.000000360.01.0UrbanYhigh10010
LP001011MaleYes2GraduateYes54174196.0267.000000360.01.0UrbanYvery_high10001
LP001013MaleYes0Not GraduateNo23331516.095.000000360.01.0UrbanYmedium10100
LP001014MaleYes3+GraduateNo30362504.0158.000000360.00.0SemiurbanNhigh00010
LP001018MaleYes2GraduateNo40061526.0168.000000360.01.0UrbanYhigh10010
LP001020MaleYes1GraduateNo1284110968.0349.000000360.01.0SemiurbanNvery_high00001
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值