天池比赛-01-用随机森林进行信贷违约预测-Baseline

DataScienceZone

已于 2022-09-24 11:38:06 修改

阅读量2.7k

点赞数 1

分类专栏：数据挖掘

于 2022-01-08 16:13:32 首次发布

本文链接：https://blog.csdn.net/qq_35487917/article/details/121388561

版权

这篇文章构建了信贷违约预测数据挖掘项目的一个baseline，这个项目来源于天池数据科学大赛，是一个二分类问题。

赛题链接：https://tianchi.aliyun.com/competition/entrance/531830/introduction。

1、赛题和数据介绍

1.1 赛题背景

赛题以金融风控中的个人信贷为背景，要求选手根据贷款申请人的数据信息预测其是否有违约的可能，以此判断是否通过此项贷款，这是一个典型的分类问题。

1.2 赛题数据

数据集中的字段含义如下：
![在这里插入图片描述](https://img-blog.csdnimg.cn/1d034e8402594ae799f0dde8409f7955.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBARGF0YVNjaWVuY2Vab25l,size_20,color_FFFFFF,t_70,g_se,x_1

2、数据探索分析和预处理

2.1 数据探索分析

首先导入需要使用的相关模块：

import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder

然后读取训练数据集，查看数据集中特征的数据类型和缺失情况，代码如下：

df = pd.read_csv('train.csv')
df.info()

执行代码，结果如下：

在这里插入代码片Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000

最低0.47元/天解锁文章

DataScienceZone

关注

1
点赞
踩
36

收藏

觉得还不错? 一键收藏
2
评论
天池比赛-01-用随机森林进行信贷违约预测-Baseline

这篇文章构建了信贷违约预测数据挖掘项目的一个baseline，这个项目来源于天池数据科学大赛，是一个二分类问题。赛题链接：https://tianchi.aliyun.com/competition/entrance/531830/introduction。1、赛题和数据介绍1.1 赛题背景赛题以金融风控中的个人信贷为背景，要求选手根据贷款申请人的数据信息预测其是否有违约的可能，以此判断是否通过此项贷款，这是一个典型的分类问题。1.2 赛题数据数据集中的字段含义如下：...
复制链接

扫一扫