使用Pandas、Numpy等工具库,完成以下数据操作
一、CSV数据读入
- 从指定路径下读取CSV数据文件,并将Loan_ID设为Index
- 数据文件train.csv在"./data/"路径下
- 打印输出前该数据集10行
| Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status |
---|
Loan_ID | | | | | | | | | | | | |
---|
LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | NaN | 360.0 | 1.0 | Urban | Y |
---|
LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.0 | 360.0 | 1.0 | Rural | N |
---|
LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.0 | 360.0 | 1.0 | Urban | Y |
---|
LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.0 | 360.0 | 1.0 | Urban | Y |
---|
LP001008 | Male | No | 0 | Graduate | No | 6000 | 0.0 | 141.0 | 360.0 | 1.0 | Urban | Y |
---|
LP001011 | Male | Yes | 2 | Graduate | Yes | 5417 | 4196.0 | 267.0 | 360.0 | 1.0 | Urban | Y |
---|
LP001013 | Male | Yes | 0 | Not Graduate | No | 2333 | 1516.0 | 95.0 | 360.0 | 1.0 | Urban | Y |
---|
LP001014 | Male | Yes | 3+ | Graduate | No | 3036 | 2504.0 | 158.0 | 360.0 | 0.0 | Semiurban | N |
---|
LP001018 | Male | Yes | 2 | Graduate | No | 4006 | 1526.0 | 168.0 | 360.0 | 1.0 | Urban | Y |
---|
LP001020 | Male | Yes | 1 | Graduate | No | 12841 | 10968.0 | 349.0 | 360.0 | 1.0 | Semiurban | N |
---|
* 从数据集中得到“所有没有毕业”(Education: Not Graduate)并且“获得贷款”(Loan_Status: Y)的“女性”(Gender: Female),
并输出“性别”(Gender)、“教育状况”(Education)及“贷款状态”(Loan_Status)。
| Gender | Education | Loan_Status |
---|
Loan_ID | | | |
---|
LP001155 | Female | Not Graduate | Y |
---|
LP001669 | Female | Not Graduate | Y |
---|
LP001692 | Female | Not Graduate | Y |
---|
LP001908 | Female | Not Graduate | Y |
---|
LP002300 | Female | Not Graduate | Y |
---|
LP002314 | Female | Not Graduate | Y |
---|
LP002407 | Female | Not Graduate | Y |
---|
LP002489 | Female | Not Graduate | Y |
---|
LP002502 | Female | Not Graduate | Y |
---|
LP002534 | Female | Not Graduate | Y |
---|
LP002582 | Female | Not Graduate | Y |
---|
LP002731 | Female | Not Graduate | Y |
---|
LP002757 | Female | Not Graduate | Y |
---|
LP002917 | Female | Not Graduate | Y |
---|
三、使用apply对数据集应用自定义函数
def num_missing(x):
return sum(x.isnull())
3.1 使用apply函数将num_missing函数用于统计数据集的每列缺失值数量
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
3.2 使用apply函数将num_missing函数用于统计数据集每行缺失值数量,并打印前10行
Loan_ID
LP001002 1
LP001003 0
LP001005 0
LP001006 0
LP001008 0
LP001011 0
LP001013 0
LP001014 0
LP001018 0
LP001020 0
dtype: int64
四、缺失值填充
4.1 对于Gender、Married、Self_Employed三个因子型变量,使用各自最常见的因子进行缺失值填充
4.2 对于LoanAmount变量进行缺失值填充处理
- 按照“Gender”、“Married”及“Self_Employed”的组合下的每个组群进行LoanAmount变量的均值统计
- 按照每组统计得到的平均值,对“LoanAmount”中缺失值进行填充
五、数据透视表
基于data数据,得到下表:
Loan_Status | N | Y | ALL |
---|
Credit_History | | | |
0.0 | num | num | num |
1.0 | num | num | num |
All | num | num | num |
其中num代表统计数量。
Loan_Status | N | Y | All |
---|
Credit_History | | | |
---|
0.0 | 82 | 7 | 89 |
---|
1.0 | 97 | 378 | 475 |
---|
All | 179 | 385 | 564 |
---|
六、合并数据集
- 将prop_rates数据集与data数据集合并
- 基于合并后的数据集,按照“Property_Area”、“rates”的组合下的每个组群下Credit_History变量的样本数量统计
prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
| | Credit_History |
---|
Property_Area | rates | |
---|
Rural | 1000 | 179.0 |
---|
Semiurban | 5000 | 233.0 |
---|
Urban | 12000 | 202.0 |
---|
七、数据集排序
7.1 将data数据集按照ApplicantIncome、CoapplicantIncome 两列变量值进行降序排列,并输出排序后数据集的前10行。
| ApplicantIncome | CoapplicantIncome |
---|
Loan_ID | | |
---|
LP002317 | 81000 | 0.0 |
---|
LP002101 | 63337 | 0.0 |
---|
LP001585 | 51763 | 0.0 |
---|
LP001536 | 39999 | 0.0 |
---|
LP001640 | 39147 | 4750.0 |
---|
LP002422 | 37719 | 0.0 |
---|
LP001637 | 33846 | 0.0 |
---|
LP001448 | 23803 | 0.0 |
---|
LP002624 | 20833 | 6667.0 |
---|
LP001922 | 20667 | 0.0 |
---|
八、变量离散化
- 将LoanAmount变量离散化,得到新的变量LoanAmount_Bin
- 按照以下条件进行处理:
- [min,90):low
- [90,140):medium
- [140,190): high
[9.0, 90, 140, 190, 700.0]
low 98
medium 274
high 150
very_high 91
Name: LoanAmount_Bin, dtype: int64
九、变量映射
- 将变量Loan_Status中的字符映射为数字,得到新的变量Loan_Status_Coded
- 映射方法为:{‘N’: 0; ‘Y’: 1}
- 输出Loan_Status_Coded变量的类型统计数值
1 422
0 192
Name: Loan_Status_Coded, dtype: int64
十、独热编码
- 将LoanAmount_Bin变量进行独热编码
- 得到新变量:LoanAmount_low, Loan_Amount_medium, Loan_Amount_high, Loan_Amount_very_high
- 将新变量合并到data数据集,并打印数据集前10行
| Gender | Married | Dependents | Education | Self_Employed | ApplicantIncome | CoapplicantIncome | LoanAmount | Loan_Amount_Term | Credit_History | Property_Area | Loan_Status | LoanAmount_Bin | Loan_Status_Coded | LoanAmount_low | LoanAmount_medium | LoanAmount_high | LoanAmount_very_high |
---|
Loan_ID | | | | | | | | | | | | | | | | | | |
---|
LP001002 | Male | No | 0 | Graduate | No | 5849 | 0.0 | 129.936937 | 360.0 | 1.0 | Urban | Y | medium | 1 | 0 | 1 | 0 | 0 |
---|
LP001003 | Male | Yes | 1 | Graduate | No | 4583 | 1508.0 | 128.000000 | 360.0 | 1.0 | Rural | N | medium | 0 | 0 | 1 | 0 | 0 |
---|
LP001005 | Male | Yes | 0 | Graduate | Yes | 3000 | 0.0 | 66.000000 | 360.0 | 1.0 | Urban | Y | low | 1 | 1 | 0 | 0 | 0 |
---|
LP001006 | Male | Yes | 0 | Not Graduate | No | 2583 | 2358.0 | 120.000000 | 360.0 | 1.0 | Urban | Y | medium | 1 | 0 | 1 | 0 | 0 |
---|
LP001008 | Male | No | 0 | Graduate | No | 6000 | 0.0 | 141.000000 | 360.0 | 1.0 | Urban | Y | high | 1 | 0 | 0 | 1 | 0 |
---|
LP001011 | Male | Yes | 2 | Graduate | Yes | 5417 | 4196.0 | 267.000000 | 360.0 | 1.0 | Urban | Y | very_high | 1 | 0 | 0 | 0 | 1 |
---|
LP001013 | Male | Yes | 0 | Not Graduate | No | 2333 | 1516.0 | 95.000000 | 360.0 | 1.0 | Urban | Y | medium | 1 | 0 | 1 | 0 | 0 |
---|
LP001014 | Male | Yes | 3+ | Graduate | No | 3036 | 2504.0 | 158.000000 | 360.0 | 0.0 | Semiurban | N | high | 0 | 0 | 0 | 1 | 0 |
---|
LP001018 | Male | Yes | 2 | Graduate | No | 4006 | 1526.0 | 168.000000 | 360.0 | 1.0 | Urban | Y | high | 1 | 0 | 0 | 1 | 0 |
---|
LP001020 | Male | Yes | 1 | Graduate | No | 12841 | 10968.0 | 349.000000 | 360.0 | 1.0 | Semiurban | N | very_high | 0 | 0 | 0 | 0 | 1 |
---|