task1:赛题理解

1 赛题理解

赛题以金融风控中的个人信贷为背景,要求选手根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题。

1.1 学习目标

理解赛题数据和目标,清楚评分体系。
完成相应报名,下载数据和结果提交打卡(可提交示例结果),熟悉比赛流程

1.2 赛题概况

赛题以预测金融风险为任务,数据集报名后可见并可下载,该数据来自某信贷平台的贷款记录,总数据量超过120w,包含47列变量信息,其中15列为匿名变量。为了保证比赛的公平性,将会从中抽取80万条作为训练集,20万条作为测试集A,20万条作为测试集B,同时会对employmentTitle、purpose、postCode和title等信息进行脱敏。

train.csv

特征变量:

  • id 为贷款清单分配的唯一信用证标识
  • loanAmnt 贷款金额
  • term 贷款期限(year)
  • interestRate 贷款利率
  • installment 分期付款金额
  • grade 贷款等级
  • subGrade 贷款等级之子级
  • employmentTitle 就业职称
  • employmentLength 就业年限(年)
  • homeOwnership 借款人在登记时提供的房屋所有权状况
  • annualIncome 年收入
  • verificationStatus 验证状态
  • issueDate 贷款发放的月份
  • purpose 借款人在贷款申请时的贷款用途类别
  • postCode 借款人在贷款申请中提供的邮政编码的前3位数字
  • regionCode 地区编码
  • dti 债务收入比
  • delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
  • ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
  • ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
  • openAcc 借款人信用档案中未结信用额度的数量
  • pubRec 贬损公共记录的数量
  • pubRecBankruptcies 公开记录清除的数量
  • revolBal 信贷周转余额合计
  • revolUtil 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额
  • totalAcc 借款人信用档案中当前的信用额度总数
  • initialListStatus 贷款的初始列表状态
  • applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
  • earliesCreditLine 借款人最早报告的信用额度开立的月份
  • title 借款人提供的贷款名称
  • policyCode 公开可用的策略_代码=1新产品不公开可用的策略_代码=2
  • n系列匿名特征 匿名特征n0-n14,为一些贷款人行为计数特征的处理

目标变量:isDefault

1.3 评估指标

竞赛采用AUC作为评价指标。AUC(Area Under Curve)被定义为 ROC曲线 下与坐标轴围成的面积。

1.4 赛题流程

数据探索性分析–特征工程–建模调参–模型融合

1.5 数据初始查看

import pandas as pd
path=r'C:\Users\gnzha\工作\工作\bonc\python\学习资料\202009datawhale资料\数据\\'
train_data=pd.read_csv(path+'train.csv')
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n2.1                759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
testA_data=pd.read_csv(path+'testA.csv')
testA_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 48 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  200000 non-null  int64  
 1   loanAmnt            200000 non-null  float64
 2   term                200000 non-null  int64  
 3   interestRate        200000 non-null  float64
 4   installment         200000 non-null  float64
 5   grade               200000 non-null  object 
 6   subGrade            200000 non-null  object 
 7   employmentTitle     200000 non-null  float64
 8   employmentLength    188258 non-null  object 
 9   homeOwnership       200000 non-null  int64  
 10  annualIncome        200000 non-null  float64
 11  verificationStatus  200000 non-null  int64  
 12  issueDate           200000 non-null  object 
 13  purpose             200000 non-null  int64  
 14  postCode            200000 non-null  float64
 15  regionCode          200000 non-null  int64  
 16  dti                 199939 non-null  float64
 17  delinquency_2years  200000 non-null  float64
 18  ficoRangeLow        200000 non-null  float64
 19  ficoRangeHigh       200000 non-null  float64
 20  openAcc             200000 non-null  float64
 21  pubRec              200000 non-null  float64
 22  pubRecBankruptcies  199884 non-null  float64
 23  revolBal            200000 non-null  float64
 24  revolUtil           199873 non-null  float64
 25  totalAcc            200000 non-null  float64
 26  initialListStatus   200000 non-null  int64  
 27  applicationType     200000 non-null  int64  
 28  earliesCreditLine   200000 non-null  object 
 29  title               200000 non-null  float64
 30  policyCode          200000 non-null  float64
 31  n0                  189889 non-null  float64
 32  n1                  189889 non-null  float64
 33  n2                  189889 non-null  float64
 34  n2.1                189889 non-null  float64
 35  n2.2                189889 non-null  float64
 36  n2.3                189889 non-null  float64
 37  n4                  191606 non-null  float64
 38  n5                  189889 non-null  float64
 39  n6                  189889 non-null  float64
 40  n7                  189889 non-null  float64
 41  n8                  189889 non-null  float64
 42  n9                  189889 non-null  float64
 43  n10                 191606 non-null  float64
 44  n11                 182425 non-null  float64
 45  n12                 189889 non-null  float64
 46  n13                 189889 non-null  float64
 47  n14                 189889 non-null  float64
dtypes: float64(35), int64(8), object(5)
memory usage: 73.2+ MB
train_data.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
       'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
       'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
       'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
       'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
       'policyCode', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8',
       'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
      dtype='object')
testA_data.columns
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
       'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
       'annualIncome', 'verificationStatus', 'issueDate', 'purpose',
       'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow',
       'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal',
       'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType',
       'earliesCreditLine', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n2.1',
       'n2.2', 'n2.3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12',
       'n13', 'n14'],
      dtype='object')
train_data.head()
idloanAmntterminterestRateinstallmentgradesubGradeemploymentTitleemploymentLengthhomeOwnership...n5n6n7n8n9n10n11n12n13n14
0035000.0519.52917.97EE2320.02 years2...9.08.04.012.02.07.00.00.00.02.0
1118000.0518.49461.90DD2219843.05 years0...NaNNaNNaNNaNNaN13.0NaNNaNNaNNaN
2212000.0516.99298.17DD331698.08 years0...0.021.04.05.03.011.00.00.00.04.0
3311000.037.26340.96AA446854.010+ years1...16.04.07.021.06.09.00.00.00.01.0
443000.0312.99101.07CC254.0NaN1...4.09.010.015.07.012.00.00.00.04.0

5 rows × 47 columns

train_data['isDefault'].head()
0    1
1    0
2    0
3    0
4    0
Name: isDefault, dtype: int64
train_data['isDefault'].describe()
count    800000.000000
mean          0.199513
std           0.399634
min           0.000000
25%           0.000000
50%           0.000000
75%           0.000000
max           1.000000
Name: isDefault, dtype: float64
train_data['isDefault'].value_counts()
0    640390
1    159610
Name: isDefault, dtype: int64
train_data['isDefault'].values
array([1, 0, 0, ..., 1, 0, 0], dtype=int64)

2 提交结果

在这里插入图片描述
在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值