使用Pandas、Numpy等工具库,完成以下数据操作
一、CSV数据读入
从指定路径下读取CSV数据文件,并将Loan_ID设为Index 数据文件train.csv在"./data/"路径下 打印输出前该数据集10行
import pandas as pd
import numpy as np
data = pd. read_csv( "C:/Users/user/Desktop/train.csv" )
data. index = data[ 'Loan_ID' ]
data. head( 10 )
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status Loan_ID LP001002 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y LP001003 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N LP001005 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y LP001006 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y LP001008 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y LP001011 LP001011 Male Yes 2 Graduate Yes 5417 4196.0 267.0 360.0 1.0 Urban Y LP001013 LP001013 Male Yes 0 Not Graduate No 2333 1516.0 95.0 360.0 1.0 Urban Y LP001014 LP001014 Male Yes 3+ Graduate No 3036 2504.0 158.0 360.0 0.0 Semiurban N LP001018 LP001018 Male Yes 2 Graduate No 4006 1526.0 168.0 360.0 1.0 Urban Y LP001020 LP001020 Male Yes 1 Graduate No 12841 10968.0 349.0 360.0 1.0 Semiurban N
二、数据选择
从数据集中得到“所有没有毕业”(Education: Not Graduate)并且“获得贷款”(Loan_Status: Y)的“女性”(Gender: Female),并输出“性别”(Gender)、“教育状况”(Education)及“贷款状态”(Loan_Status)。
data. loc[ ( data. Education== 'Not Graduate' ) & ( data. Loan_Status== 'Y' ) & ( data. Gender== 'Female' ) , [ "Gender" , "Education" , "Loan_Status" ] ]
Gender Education Loan_Status Loan_ID LP001155 Female Not Graduate Y LP001669 Female Not Graduate Y LP001692 Female Not Graduate Y LP001908 Female Not Graduate Y LP002300 Female Not Graduate Y LP002314 Female Not Graduate Y LP002407 Female Not Graduate Y LP002489 Female Not Graduate Y LP002502 Female Not Graduate Y LP002534 Female Not Graduate Y LP002582 Female Not Graduate Y LP002731 Female Not Graduate Y LP002757 Female Not Graduate Y LP002917 Female Not Graduate Y
data. loc[ : , 'ki_ki' ] = 'nope'
data
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status new_col ki_ki Loan_ID LP001002 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y test nope LP001003 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N test nope LP001005 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y test nope LP001006 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y test nope LP001008 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y test nope LP001011 LP001011 Male Yes 2 Graduate Yes 5417 4196.0 267.0 360.0 1.0 Urban Y test nope LP001013 LP001013 Male Yes 0 Not Graduate No 2333 1516.0 95.0 360.0 1.0 Urban Y test nope LP001014 LP001014 Male Yes 3+ Graduate No 3036 2504.0 158.0 360.0 0.0 Semiurban N test nope LP001018 LP001018 Male Yes 2 Graduate No 4006 1526.0 168.0 360.0 1.0 Urban Y test nope LP001020 LP001020 Male Yes 1 Graduate No 12841 10968.0 349.0 360.0 1.0 Semiurban N test nope LP001024 LP001024 Male Yes 2 Graduate No 3200 700.0 70.0 360.0 1.0 Urban Y test nope LP001027 LP001027 Male Yes 2 Graduate NaN 2500 1840.0 109.0 360.0 1.0 Urban Y test nope LP001028 LP001028 Male Yes 2 Graduate No 3073 8106.0 200.0 360.0 1.0 Urban Y test nope LP001029 LP001029 Male No 0 Graduate No 1853 2840.0 114.0 360.0 1.0 Rural N test nope LP001030 LP001030 Male Yes 2 Graduate No 1299 1086.0 17.0 120.0 1.0 Urban Y test nope LP001032 LP001032 Male No 0 Graduate No 4950 0.0 125.0 360.0 1.0 Urban Y test nope LP001034 LP001034 Male No 1 Not Graduate No 3596 0.0 100.0 240.0 NaN Urban Y test nope LP001036 LP001036 Female No 0 Graduate No 3510 0.0 76.0 360.0 0.0 Urban N test nope LP001038 LP001038 Male Yes 0 Not Graduate No 4887 0.0 133.0 360.0 1.0 Rural N test nope LP001041 LP001041 Male Yes 0 Graduate NaN 2600 3500.0 115.0 NaN 1.0 Urban Y test nope LP001043 LP001043 Male Yes 0 Not Graduate No 7660 0.0 104.0 360.0 0.0 Urban N test nope LP001046 LP001046 Male Yes 1 Graduate No 5955 5625.0 315.0 360.0 1.0 Urban Y test nope LP001047 LP001047 Male Yes 0 Not Graduate No 2600 1911.0 116.0 360.0 0.0 Semiurban N test nope LP001050 LP001050 NaN Yes 2 Not Graduate No 3365 1917.0 112.0 360.0 0.0 Rural N test nope LP001052 LP001052 Male Yes 1 Graduate NaN 3717 2925.0 151.0 360.0 NaN Semiurban N test nope LP001066 LP001066 Male Yes 0 Graduate Yes 9560 0.0 191.0 360.0 1.0 Semiurban Y test nope LP001068 LP001068 Male Yes 0 Graduate No 2799 2253.0 122.0 360.0 1.0 Semiurban Y test nope LP001073 LP001073 Male Yes 2 Not Graduate No 4226 1040.0 110.0 360.0 1.0 Urban Y test nope LP001086 LP001086 Male No 0 Not Graduate No 1442 0.0 35.0 360.0 1.0 Urban N test nope LP001087 LP001087 Female No 2 Graduate NaN 3750 2083.0 120.0 360.0 1.0 Semiurban Y test nope ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... LP002911 LP002911 Male Yes 1 Graduate No 2787 1917.0 146.0 360.0 0.0 Rural N test nope LP002912 LP002912 Male Yes 1 Graduate No 4283 3000.0 172.0 84.0 1.0 Rural N test nope LP002916 LP002916 Male Yes 0 Graduate No 2297 1522.0 104.0 360.0 1.0 Urban Y test nope LP002917 LP002917 Female No 0 Not Graduate No 2165 0.0 70.0 360.0 1.0 Semiurban Y test nope LP002925 LP002925 NaN No 0 Graduate No 4750 0.0 94.0 360.0 1.0 Semiurban Y test nope LP002926 LP002926 Male Yes 2 Graduate Yes 2726 0.0 106.0 360.0 0.0 Semiurban N test nope LP002928 LP002928 Male Yes 0 Graduate No 3000 3416.0 56.0 180.0 1.0 Semiurban Y test nope LP002931 LP002931 Male Yes 2 Graduate Yes 6000 0.0 205.0 240.0 1.0 Semiurban N test nope LP002933 LP002933 NaN No 3+ Graduate Yes 9357 0.0 292.0 360.0 1.0 Semiurban Y test nope LP002936 LP002936 Male Yes 0 Graduate No 3859 3300.0 142.0 180.0 1.0 Rural Y test nope LP002938 LP002938 Male Yes 0 Graduate Yes 16120 0.0 260.0 360.0 1.0 Urban Y test nope LP002940 LP002940 Male No 0 Not Graduate No 3833 0.0 110.0 360.0 1.0 Rural Y test nope LP002941 LP002941 Male Yes 2 Not Graduate Yes 6383 1000.0 187.0 360.0 1.0 Rural N test nope LP002943 LP002943 Male No NaN Graduate No 2987 0.0 88.0 360.0 0.0 Semiurban N test nope LP002945 LP002945 Male Yes 0 Graduate Yes 9963 0.0 180.0 360.0 1.0 Rural Y test nope LP002948 LP002948 Male Yes 2 Graduate No 5780 0.0 192.0 360.0 1.0 Urban Y test nope LP002949 LP002949 Female No 3+ Graduate NaN 416 41667.0 350.0 180.0 NaN Urban N test nope LP002950 LP002950 Male Yes 0 Not Graduate NaN 2894 2792.0 155.0 360.0 1.0 Rural Y test nope LP002953 LP002953 Male Yes 3+ Graduate No 5703 0.0 128.0 360.0 1.0 Urban Y test nope LP002958 LP002958 Male No 0 Graduate No 3676 4301.0 172.0 360.0 1.0 Rural Y test nope LP002959 LP002959 Female Yes 1 Graduate No 12000 0.0 496.0 360.0 1.0 Semiurban Y test nope LP002960 LP002960 Male Yes 0 Not Graduate No 2400 3800.0 NaN 180.0 1.0 Urban N test nope LP002961 LP002961 Male Yes 1 Graduate No 3400 2500.0 173.0 360.0 1.0 Semiurban Y test nope LP002964 LP002964 Male Yes 2 Not Graduate No 3987 1411.0 157.0 360.0 1.0 Rural Y test nope LP002974 LP002974 Male Yes 0 Graduate No 3232 1950.0 108.0 360.0 1.0 Rural Y test nope LP002978 LP002978 Female No 0 Graduate No 2900 0.0 71.0 360.0 1.0 Rural Y test nope LP002979 LP002979 Male Yes 3+ Graduate No 4106 0.0 40.0 180.0 1.0 Rural Y test nope LP002983 LP002983 Male Yes 1 Graduate No 8072 240.0 253.0 360.0 1.0 Urban Y test nope LP002984 LP002984 Male Yes 2 Graduate No 7583 0.0 187.0 360.0 1.0 Urban Y test nope LP002990 LP002990 Female No 0 Graduate Yes 4583 0.0 133.0 360.0 0.0 Semiurban N test nope
614 rows × 15 columns
三、使用apply对数据集应用自定义函数
def num_missing ( x) :
return sum ( x. isnull( ) )
sum ( data[ 'Gender' ] . isnull( ) )
13
3.1 使用apply函数将num_missing函数用于统计数据集的每列缺失值数量
data. apply ( num_missing , axis= 0 )
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
new_col 0
ki_ki 0
dtype: int64
3.2 使用apply函数将num_missing函数用于统计数据集每行缺失值数量,并打印前10行
data. apply ( num_missing , axis= 1 ) [ : 10 ]
Loan_ID
LP001002 1
LP001003 0
LP001005 0
LP001006 0
LP001008 0
LP001011 0
LP001013 0
LP001014 0
LP001018 0
LP001020 0
dtype: int64
四、缺失值填充
4.1 对于Gender、Married、Self_Employed三个因子型变量,使用各自最常见的因子进行缺失值填充
gender_mode = data[ 'Gender' ] . mode( ) . iloc[ 0 ]
data[ 'Gender' ] . fillna( gender_mode)
Loan_ID
LP001002 Male
LP001003 Male
LP001005 Male
LP001006 Male
LP001008 Male
LP001011 Male
LP001013 Male
LP001014 Male
LP001018 Male
LP001020 Male
LP001024 Male
LP001027 Male
LP001028 Male
LP001029 Male
LP001030 Male
LP001032 Male
LP001034 Male
LP001036 Female
LP001038 Male
LP001041 Male
LP001043 Male
LP001046 Male
LP001047 Male
LP001050 Male
LP001052 Male
LP001066 Male
LP001068 Male
LP001073 Male
LP001086 Male
LP001087 Female
...
LP002911 Male
LP002912 Male
LP002916 Male
LP002917 Female
LP002925 Male
LP002926 Male
LP002928 Male
LP002931 Male
LP002933 Male
LP002936 Male
LP002938 Male
LP002940 Male
LP002941 Male
LP002943 Male
LP002945 Male
LP002948 Male
LP002949 Female
LP002950 Male
LP002953 Male
LP002958 Male
LP002959 Female
LP002960 Male
LP002961 Male
LP002964 Male
LP002974 Male
LP002978 Female
LP002979 Male
LP002983 Male
LP002984 Male
LP002990 Female
Name: Gender, Length: 614, dtype: object
data[ 'Gender' ] . fillna( data[ 'Gender' ] . mode( ) . iloc[ 0 ] , inplace= True )
data[ 'Married' ] . fillna( data[ 'Married' ] . iloc[ 0 ] , inplace= True )
data[ 'Self_Employed' ] . fillna( data[ 'Self_Employed' ] . iloc[ 0 ] , inplace= True )
data. apply ( num_missing, axis= 0 )
Loan_ID 0
Gender 0
Married 0
Dependents 15
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
new_col 0
ki_ki 0
dtype: int64
4.2 对于LoanAmount变量进行缺失值填充处理
按照“Gender”、“Married”及“Self_Employed”的组合下的每个组群进行LoanAmount变量的均值统计 按照每组统计得到的平均值,对“LoanAmount”中缺失值进行填充
impute_grps = data. pivot_table( values= [ "LoanAmount" ] , index= [ "Gender" , "Married" , "Self_Employed" ] , aggfunc= np. mean)
for i, row in data. loc[ data[ "LoanAmount" ] . isnull( ) , : ] . iterrows( ) :
ind = tuple ( [ row[ "Gender" ] , row[ "Married" ] , row[ "Self_Employed" ] ] )
data. loc[ i, "LoanAmount" ] = impute_grps. loc[ ind] . values[ 0 ]
grps = data. groupby( [ "Gender" , "Married" , "Self_Employed" ] )
grps[ "LoanAmount" ] . mean( )
Gender Married Self_Employed
Female No No 114.691176
Yes 125.800000
Yes No 134.222222
Yes 282.250000
Male No No 130.451327
Yes 180.588235
Yes No 153.849180
Yes 169.395833
Name: LoanAmount, dtype: float64
answer = data. pivot_table( values= [ 'LoanAmount' ] , index= [ 'Gender' , 'Married' , 'Self_Employed' ] , aggfunc= np. mean)
data. apply ( num_missing, axis= 0 )
五、数据透视表
基于data数据,得到下表:
Loan_Status N Y ALL Credit_History 0.0 num num num 1.0 num num num All num num num
其中num代表统计数量。
Loan_Status N Y All Credit_History 0.0 82 7 89 1.0 97 378 475 All 179 385 564
六、合并数据集
将prop_rates数据集与data数据集合并 基于合并后的数据集,按照“Property_Area”、“rates”的组合下的每个组群下Credit_History变量的样本数量统计
prop_rates = pd. DataFrame( [ 1000 , 5000 , 12000 ] , index= [ 'Rural' , 'Semiurban' , 'Urban' ] , columns= [ 'rates' ] )
Credit_History Property_Area rates Rural 1000 179.0 Semiurban 5000 233.0 Urban 12000 202.0
七、数据集排序
7.1 将data数据集按照ApplicantIncome、CoapplicantIncome 两列变量值进行降序排列,并输出排序后数据集的前10行。
ApplicantIncome CoapplicantIncome Loan_ID LP002317 81000 0.0 LP002101 63337 0.0 LP001585 51763 0.0 LP001536 39999 0.0 LP001640 39147 4750.0 LP002422 37719 0.0 LP001637 33846 0.0 LP001448 23803 0.0 LP002624 20833 6667.0 LP001922 20667 0.0
八、变量离散化
将LoanAmount变量离散化,得到新的变量LoanAmount_Bin 按照以下条件进行处理:
[min,90):low [90,140):medium [140,190): high
[9.0, 90, 140, 190, 700.0]
low 98
medium 274
high 150
very_high 91
Name: LoanAmount_Bin, dtype: int64
九、变量映射
将变量Loan_Status中的字符映射为数字,得到新的变量Loan_Status_Coded 映射方法为:{‘N’: 0; ‘Y’: 1} 输出Loan_Status_Coded变量的类型统计数值
1 422
0 192
Name: Loan_Status_Coded, dtype: int64
十、独热编码
将LoanAmount_Bin变量进行独热编码 得到新变量:LoanAmount_low, Loan_Amount_medium, Loan_Amount_high, Loan_Amount_very_high 将新变量合并到data数据集,并打印数据集前10行
Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status LoanAmount_Bin Loan_Status_Coded LoanAmount_low LoanAmount_medium LoanAmount_high LoanAmount_very_high Loan_ID LP001002 Male No 0 Graduate No 5849 0.0 129.936937 360.0 1.0 Urban Y medium 1 0 1 0 0 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.000000 360.0 1.0 Rural N medium 0 0 1 0 0 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.000000 360.0 1.0 Urban Y low 1 1 0 0 0 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.000000 360.0 1.0 Urban Y medium 1 0 1 0 0 LP001008 Male No 0 Graduate No 6000 0.0 141.000000 360.0 1.0 Urban Y high 1 0 0 1 0 LP001011 Male Yes 2 Graduate Yes 5417 4196.0 267.000000 360.0 1.0 Urban Y very_high 1 0 0 0 1 LP001013 Male Yes 0 Not Graduate No 2333 1516.0 95.000000 360.0 1.0 Urban Y medium 1 0 1 0 0 LP001014 Male Yes 3+ Graduate No 3036 2504.0 158.000000 360.0 0.0 Semiurban N high 0 0 0 1 0 LP001018 Male Yes 2 Graduate No 4006 1526.0 168.000000 360.0 1.0 Urban Y high 1 0 0 1 0 LP001020 Male Yes 1 Graduate No 12841 10968.0 349.000000 360.0 1.0 Semiurban N very_high 0 0 0 0 1