使用Pandas、Numpy等工具库,完成以下数据操作
一、CSV数据读入
从指定路径下读取CSV数据文件,并将Loan_ID设为Index
数据文件train.csv在"./data/"路径下
打印输出前该数据集10行
import pandas as pd
import numpy as np
data= pd. read_csv( "train.csv" , index_col= 0 )
data. head( 10 )
Gender
Married
Dependents
Education
Self_Employed
ApplicantIncome
CoapplicantIncome
LoanAmount
Loan_Amount_Term
Credit_History
Property_Area
Loan_Status
Loan_ID
LP001002
Male
No
0
Graduate
No
5849
0.0
NaN
360.0
1.0
Urban
Y
LP001003
Male
Yes
1
Graduate
No
4583
1508.0
128.0
360.0
1.0
Rural
N
LP001005
Male
Yes
0
Graduate
Yes
3000
0.0
66.0
360.0
1.0
Urban
Y
LP001006
Male
Yes
0
Not Graduate
No
2583
2358.0
120.0
360.0
1.0
Urban
Y
LP001008
Male
No
0
Graduate
No
6000
0.0
141.0
360.0
1.0
Urban
Y
LP001011
Male
Yes
2
Graduate
Yes
5417
4196.0
267.0
360.0
1.0
Urban
Y
LP001013
Male
Yes
0
Not Graduate
No
2333
1516.0
95.0
360.0
1.0
Urban
Y
LP001014
Male
Yes
3+
Graduate
No
3036
2504.0
158.0
360.0
0.0
Semiurban
N
LP001018
Male
Yes
2
Graduate
No
4006
1526.0
168.0
360.0
1.0
Urban
Y
LP001020
Male
Yes
1
Graduate
No
12841
10968.0
349.0
360.0
1.0
Semiurban
N
二、数据选择
从数据集中得到“所有没有毕业”(Education: Not Graduate)并且“获得贷款”(Loan_Status: Y)的“女性”(Gender: Female),并输出“性别”(Gender)、“教育状况”(Education)及“贷款状态”(Loan_Status)。
Gender
Education
Loan_Status
Loan_ID
LP001155
Female
Not Graduate
Y
LP001669
Female
Not Graduate
Y
LP001692
Female
Not Graduate
Y
LP001908
Female
Not Graduate
Y
LP002300
Female
Not Graduate
Y
LP002314
Female
Not Graduate
Y
LP002407
Female
Not Graduate
Y
LP002489
Female
Not Graduate
Y
LP002502
Female
Not Graduate
Y
LP002534
Female
Not Graduate
Y
LP002582
Female
Not Graduate
Y
LP002731
Female
Not Graduate
Y
LP002757
Female
Not Graduate
Y
LP002917
Female
Not Graduate
Y
三、使用apply对数据集应用自定义函数
def num_missing ( x) :
return sum ( x. isnull( ) )
3.1 使用apply函数将num_missing函数用于统计数据集的每列缺失值数量
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
3.2 使用apply函数将num_missing函数用于统计数据集每行缺失值数量,并打印前10行
Loan_ID
LP001002 1
LP001003 0
LP001005 0
LP001006 0
LP001008 0
LP001011 0
LP001013 0
LP001014 0
LP001018 0
LP001020 0
dtype: int64
四、缺失值填充
4.1 对于Gender、Married、Self_Employed三个因子型变量,使用各自最常见的因子进行缺失值填充
4.2 对于LoanAmount变量进行缺失值填充处理
按照“Gender”、“Married”及“Self_Employed”的组合下的每个组群进行LoanAmount变量的均值统计
按照每组统计得到的平均值,对“LoanAmount”中缺失值进行填充
五、数据透视表
基于data数据,得到下表:
Loan_Status
N
Y
ALL
Credit_History
0.0
num
num
num
1.0
num
num
num
All
num
num
num
其中num代表统计数量。
pd. crosstab( data[ 'Credit_History' ] , data[ 'Loan_Status' ] , margins= True )
Loan_Status
N
Y
All
Credit_History
0.0
82
7
89
1.0
97
378
475
All
179
385
564
六、合并数据集
将prop_rates数据集与data数据集合并
基于合并后的数据集,按照“Property_Area”、“rates”的组合下的每个组群下Credit_History变量的样本数量统计
prop_rates = pd. DataFrame( [ 1000 , 5000 , 12000 ] , index= [ 'Rural' , 'Semiurban' , 'Urban' ] , columns= [ 'rates' ] )
prop_rates