上课Pandas数据操作练习

使用Pandas、Numpy等工具库,完成以下数据操作

一、CSV数据读入

  • 从指定路径下读取CSV数据文件,并将Loan_ID设为Index
  • 数据文件train.csv在"./data/"路径下
  • 打印输出前该数据集10行
import pandas as pd
import numpy as np

data = pd.read_csv("C:/Users/user/Desktop/train.csv")
data.index = data['Loan_ID']
data.head(10)
Loan_IDGenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_Status
Loan_ID
LP001002LP001002MaleNo0GraduateNo58490.0NaN360.01.0UrbanY
LP001003LP001003MaleYes1GraduateNo45831508.0128.0360.01.0RuralN
LP001005LP001005MaleYes0GraduateYes30000.066.0360.01.0UrbanY
LP001006LP001006MaleYes0Not GraduateNo25832358.0120.0360.01.0UrbanY
LP001008LP001008MaleNo0GraduateNo60000.0141.0360.01.0UrbanY
LP001011LP001011MaleYes2GraduateYes54174196.0267.0360.01.0UrbanY
LP001013LP001013MaleYes0Not GraduateNo23331516.095.0360.01.0UrbanY
LP001014LP001014MaleYes3+GraduateNo30362504.0158.0360.00.0SemiurbanN
LP001018LP001018MaleYes2GraduateNo40061526.0168.0360.01.0UrbanY
LP001020LP001020MaleYes1GraduateNo1284110968.0349.0360.01.0SemiurbanN

二、数据选择

  • 从数据集中得到“所有没有毕业”(Education: Not Graduate)并且“获得贷款”(Loan_Status: Y)的“女性”(Gender: Female),并输出“性别”(Gender)、“教育状况”(Education)及“贷款状态”(Loan_Status)。
data.loc[(data.Education=='Not Graduate') & (data.Loan_Status=='Y') & (data.Gender=='Female'),["Gender","Education","Loan_Status"]]
GenderEducationLoan_Status
Loan_ID
LP001155FemaleNot GraduateY
LP001669FemaleNot GraduateY
LP001692FemaleNot GraduateY
LP001908FemaleNot GraduateY
LP002300FemaleNot GraduateY
LP002314FemaleNot GraduateY
LP002407FemaleNot GraduateY
LP002489FemaleNot GraduateY
LP002502FemaleNot GraduateY
LP002534FemaleNot GraduateY
LP002582FemaleNot GraduateY
LP002731FemaleNot GraduateY
LP002757FemaleNot GraduateY
LP002917FemaleNot GraduateY
data.loc[:,'ki_ki'] = 'nope'
data
Loan_IDGenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_Statusnew_colki_ki
Loan_ID
LP001002LP001002MaleNo0GraduateNo58490.0NaN360.01.0UrbanYtestnope
LP001003LP001003MaleYes1GraduateNo45831508.0128.0360.01.0RuralNtestnope
LP001005LP001005MaleYes0GraduateYes30000.066.0360.01.0UrbanYtestnope
LP001006LP001006MaleYes0Not GraduateNo25832358.0120.0360.01.0UrbanYtestnope
LP001008LP001008MaleNo0GraduateNo60000.0141.0360.01.0UrbanYtestnope
LP001011LP001011MaleYes2GraduateYes54174196.0267.0360.01.0UrbanYtestnope
LP001013LP001013MaleYes0Not GraduateNo23331516.095.0360.01.0UrbanYtestnope
LP001014LP001014MaleYes3+GraduateNo30362504.0158.0360.00.0SemiurbanNtestnope
LP001018LP001018MaleYes2GraduateNo40061526.0168.0360.01.0UrbanYtestnope
LP001020LP001020MaleYes1GraduateNo1284110968.0349.0360.01.0SemiurbanNtestnope
LP001024LP001024MaleYes2GraduateNo3200700.070.0360.01.0UrbanYtestnope
LP001027LP001027MaleYes2GraduateNaN25001840.0109.0360.01.0UrbanYtestnope
LP001028LP001028MaleYes2GraduateNo30738106.0200.0360.01.0UrbanYtestnope
LP001029LP001029MaleNo0GraduateNo18532840.0114.0360.01.0RuralNtestnope
LP001030LP001030MaleYes2GraduateNo12991086.017.0120.01.0UrbanYtestnope
LP001032LP001032MaleNo0GraduateNo49500.0125.0360.01.0UrbanYtestnope
LP001034LP001034MaleNo1Not GraduateNo35960.0100.0240.0NaNUrbanYtestnope
LP001036LP001036FemaleNo0GraduateNo35100.076.0360.00.0UrbanNtestnope
LP001038LP001038MaleYes0Not GraduateNo48870.0133.0360.01.0RuralNtestnope
LP001041LP001041MaleYes0GraduateNaN26003500.0115.0NaN1.0UrbanYtestnope
LP001043LP001043MaleYes0Not GraduateNo76600.0104.0360.00.0UrbanNtestnope
LP001046LP001046MaleYes1GraduateNo59555625.0315.0360.01.0UrbanYtestnope
LP001047LP001047MaleYes0Not GraduateNo26001911.0116.0360.00.0SemiurbanNtestnope
LP001050LP001050NaNYes2Not GraduateNo33651917.0112.0360.00.0RuralNtestnope
LP001052LP001052MaleYes1GraduateNaN37172925.0151.0360.0NaNSemiurbanNtestnope
LP001066LP001066MaleYes0GraduateYes95600.0191.0360.01.0SemiurbanYtestnope
LP001068LP001068MaleYes0GraduateNo27992253.0122.0360.01.0SemiurbanYtestnope
LP001073LP001073MaleYes2Not GraduateNo42261040.0110.0360.01.0UrbanYtestnope
LP001086LP001086MaleNo0Not GraduateNo14420.035.0360.01.0UrbanNtestnope
LP001087LP001087FemaleNo2GraduateNaN37502083.0120.0360.01.0SemiurbanYtestnope
................................................
LP002911LP002911MaleYes1GraduateNo27871917.0146.0360.00.0RuralNtestnope
LP002912LP002912MaleYes1GraduateNo42833000.0172.084.01.0RuralNtestnope
LP002916LP002916MaleYes0GraduateNo22971522.0104.0360.01.0UrbanYtestnope
LP002917LP002917FemaleNo0Not GraduateNo21650.070.0360.01.0SemiurbanYtestnope
LP002925LP002925NaNNo0GraduateNo47500.094.0360.01.0SemiurbanYtestnope
LP002926LP002926MaleYes2GraduateYes27260.0106.0360.00.0SemiurbanNtestnope
LP002928LP002928MaleYes0GraduateNo30003416.056.0180.01.0SemiurbanYtestnope
LP002931LP002931MaleYes2GraduateYes60000.0205.0240.01.0SemiurbanNtestnope
LP002933LP002933NaNNo3+GraduateYes93570.0292.0360.01.0SemiurbanYtestnope
LP002936LP002936MaleYes0GraduateNo38593300.0142.0180.01.0RuralYtestnope
LP002938LP002938MaleYes0GraduateYes161200.0260.0360.01.0UrbanYtestnope
LP002940LP002940MaleNo0Not GraduateNo38330.0110.0360.01.0RuralYtestnope
LP002941LP002941MaleYes2Not GraduateYes63831000.0187.0360.01.0RuralNtestnope
LP002943LP002943MaleNoNaNGraduateNo29870.088.0360.00.0SemiurbanNtestnope
LP002945LP002945MaleYes0GraduateYes99630.0180.0360.01.0RuralYtestnope
LP002948LP002948MaleYes2GraduateNo57800.0192.0360.01.0UrbanYtestnope
LP002949LP002949FemaleNo3+GraduateNaN41641667.0350.0180.0NaNUrbanNtestnope
LP002950LP002950MaleYes0Not GraduateNaN28942792.0155.0360.01.0RuralYtestnope
LP002953LP002953MaleYes3+GraduateNo57030.0128.0360.01.0UrbanYtestnope
LP002958LP002958MaleNo0GraduateNo36764301.0172.0360.01.0RuralYtestnope
LP002959LP002959FemaleYes1GraduateNo120000.0496.0360.01.0SemiurbanYtestnope
LP002960LP002960MaleYes0Not GraduateNo24003800.0NaN180.01.0UrbanNtestnope
LP002961LP002961MaleYes1GraduateNo34002500.0173.0360.01.0SemiurbanYtestnope
LP002964LP002964MaleYes2Not GraduateNo39871411.0157.0360.01.0RuralYtestnope
LP002974LP002974MaleYes0GraduateNo32321950.0108.0360.01.0RuralYtestnope
LP002978LP002978FemaleNo0GraduateNo29000.071.0360.01.0RuralYtestnope
LP002979LP002979MaleYes3+GraduateNo41060.040.0180.01.0RuralYtestnope
LP002983LP002983MaleYes1GraduateNo8072240.0253.0360.01.0UrbanYtestnope
LP002984LP002984MaleYes2GraduateNo75830.0187.0360.01.0UrbanYtestnope
LP002990LP002990FemaleNo0GraduateYes45830.0133.0360.00.0SemiurbanNtestnope

614 rows × 15 columns

三、使用apply对数据集应用自定义函数

def num_missing(x):
    return sum(x.isnull())
sum(data['Gender'].isnull())
13
3.1 使用apply函数将num_missing函数用于统计数据集的每列缺失值数量
data.apply(num_missing , axis=0)
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
new_col               0
ki_ki                 0
dtype: int64
3.2 使用apply函数将num_missing函数用于统计数据集每行缺失值数量,并打印前10行
data.apply(num_missing , axis=1)[:10]
Loan_ID
LP001002    1
LP001003    0
LP001005    0
LP001006    0
LP001008    0
LP001011    0
LP001013    0
LP001014    0
LP001018    0
LP001020    0
dtype: int64

四、缺失值填充

4.1 对于Gender、Married、Self_Employed三个因子型变量,使用各自最常见的因子进行缺失值填充
gender_mode = data['Gender'].mode().iloc[0]
data['Gender'].fillna(gender_mode)
Loan_ID
LP001002      Male
LP001003      Male
LP001005      Male
LP001006      Male
LP001008      Male
LP001011      Male
LP001013      Male
LP001014      Male
LP001018      Male
LP001020      Male
LP001024      Male
LP001027      Male
LP001028      Male
LP001029      Male
LP001030      Male
LP001032      Male
LP001034      Male
LP001036    Female
LP001038      Male
LP001041      Male
LP001043      Male
LP001046      Male
LP001047      Male
LP001050      Male
LP001052      Male
LP001066      Male
LP001068      Male
LP001073      Male
LP001086      Male
LP001087    Female
             ...  
LP002911      Male
LP002912      Male
LP002916      Male
LP002917    Female
LP002925      Male
LP002926      Male
LP002928      Male
LP002931      Male
LP002933      Male
LP002936      Male
LP002938      Male
LP002940      Male
LP002941      Male
LP002943      Male
LP002945      Male
LP002948      Male
LP002949    Female
LP002950      Male
LP002953      Male
LP002958      Male
LP002959    Female
LP002960      Male
LP002961      Male
LP002964      Male
LP002974      Male
LP002978    Female
LP002979      Male
LP002983      Male
LP002984      Male
LP002990    Female
Name: Gender, Length: 614, dtype: object
data['Gender'].fillna(data['Gender'].mode().iloc[0],inplace=True) #mode()由于取众数,iloc是按行数取值的,inplace是表明在当前数据上进行修改。
data['Married'].fillna(data['Married'].iloc[0],inplace=True)
data['Self_Employed'].fillna(data['Self_Employed'].iloc[0],inplace=True)
data.apply(num_missing,axis=0)
Loan_ID               0
Gender                0
Married               0
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
new_col               0
ki_ki                 0
dtype: int64
4.2 对于LoanAmount变量进行缺失值填充处理
  • 按照“Gender”、“Married”及“Self_Employed”的组合下的每个组群进行LoanAmount变量的均值统计
  • 按照每组统计得到的平均值,对“LoanAmount”中缺失值进行填充
impute_grps = data.pivot_table(values=["LoanAmount"], index=["Gender", "Married", "Self_Employed"], aggfunc=np.mean)#pivot_table 建立一个数据透视表(还有groupbyhecross_table),index为索引
                                                                                                      #利用“values”域显式地定义我们关心的列,就可以实现移除那些不关心的列。
                                                                                                      #aggfunc用于添加功能,统计方法
for i, row in data.loc[data["LoanAmount"].isnull(),:].iterrows(): #生产器iterrows index=i row=值
    ind = tuple([row["Gender"], row["Married"], row["Self_Employed"]])#组成条件
    data.loc[i, "LoanAmount"] = impute_grps.loc[ind].values[0]
grps = data.groupby(["Gender", "Married", "Self_Employed"]) #先分组
grps["LoanAmount"].mean()  #然后再进行其他操作,通用化
Gender  Married  Self_Employed
Female  No       No               114.691176
                 Yes              125.800000
        Yes      No               134.222222
                 Yes              282.250000
Male    No       No               130.451327
                 Yes              180.588235
        Yes      No               153.849180
                 Yes              169.395833
Name: LoanAmount, dtype: float64
answer = data.pivot_table(values=['LoanAmount'],index=['Gender','Married','Self_Employed'],aggfunc=np.mean)
data.apply(num_missing,axis=0)

五、数据透视表

基于data数据,得到下表:

Loan_StatusNYALL
Credit_History
0.0numnumnum
1.0numnumnum
Allnumnumnum

其中num代表统计数量。


Loan_StatusNYAll
Credit_History
0.082789
1.097378475
All179385564

六、合并数据集

  • 将prop_rates数据集与data数据集合并
  • 基于合并后的数据集,按照“Property_Area”、“rates”的组合下的每个组群下Credit_History变量的样本数量统计
prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])

Credit_History
Property_Arearates
Rural1000179.0
Semiurban5000233.0
Urban12000202.0

七、数据集排序

7.1 将data数据集按照ApplicantIncome、CoapplicantIncome 两列变量值进行降序排列,并输出排序后数据集的前10行。

ApplicantIncomeCoapplicantIncome
Loan_ID
LP002317810000.0
LP002101633370.0
LP001585517630.0
LP001536399990.0
LP001640391474750.0
LP002422377190.0
LP001637338460.0
LP001448238030.0
LP002624208336667.0
LP001922206670.0

八、变量离散化

  • 将LoanAmount变量离散化,得到新的变量LoanAmount_Bin
  • 按照以下条件进行处理:
    • [min,90):low
    • [90,140):medium
    • [140,190): high

[9.0, 90, 140, 190, 700.0]





low           98
medium       274
high         150
very_high     91
Name: LoanAmount_Bin, dtype: int64

九、变量映射

  • 将变量Loan_Status中的字符映射为数字,得到新的变量Loan_Status_Coded
  • 映射方法为:{‘N’: 0; ‘Y’: 1}
  • 输出Loan_Status_Coded变量的类型统计数值

1    422
0    192
Name: Loan_Status_Coded, dtype: int64

十、独热编码

  • 将LoanAmount_Bin变量进行独热编码
  • 得到新变量:LoanAmount_low, Loan_Amount_medium, Loan_Amount_high, Loan_Amount_very_high
  • 将新变量合并到data数据集,并打印数据集前10行

GenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_StatusLoanAmount_BinLoan_Status_CodedLoanAmount_lowLoanAmount_mediumLoanAmount_highLoanAmount_very_high
Loan_ID
LP001002MaleNo0GraduateNo58490.0129.936937360.01.0UrbanYmedium10100
LP001003MaleYes1GraduateNo45831508.0128.000000360.01.0RuralNmedium00100
LP001005MaleYes0GraduateYes30000.066.000000360.01.0UrbanYlow11000
LP001006MaleYes0Not GraduateNo25832358.0120.000000360.01.0UrbanYmedium10100
LP001008MaleNo0GraduateNo60000.0141.000000360.01.0UrbanYhigh10010
LP001011MaleYes2GraduateYes54174196.0267.000000360.01.0UrbanYvery_high10001
LP001013MaleYes0Not GraduateNo23331516.095.000000360.01.0UrbanYmedium10100
LP001014MaleYes3+GraduateNo30362504.0158.000000360.00.0SemiurbanNhigh00010
LP001018MaleYes2GraduateNo40061526.0168.000000360.01.0UrbanYhigh10010
LP001020MaleYes1GraduateNo1284110968.0349.000000360.01.0SemiurbanNvery_high00001
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值