主要目标
- 了解数据集的一些基本情况,如缺失值,异常值,数据集大小,原始特征维度等
- 了解变量间的相互关系、变量与预测值之间存在的关系
- 观察变量的分布情况,为特征工程做准备
初始化
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
import pandas.util.testing as tm
warnings.filterwarnings('ignore')
# 文件目录,绝对路径
INPUT_PATH = 'G:\代码\数据挖掘实践\\'
读取文件
data_train = pd.read_csv(INPUT_PATH +'train.csv')
data_test_a = pd.read_csv(INPUT_PATH +'testA.csv')
分块读取
当需要读取的数据文件特别大时,可以采用分块读取的方式,分块读取返回的类型为 TextFileReader,是一个可迭代的对象。
#设置chunksize参数,来控制每次读取数据的大小
chunker = pd.read_csv(INPUT_PATH +"train.csv",chunksize=100000,iterator=True)
temp = chunker.get_chunk(200000) # 读取指定长度的数据
len(temp)
200000
# 对每一块进行迭代
for item in chunker: # 每个item是一个长度为chunksize的DataFrame
print(type(item))
print(len(item))
print(item['loanAmnt'].mean())
<class 'pandas.core.frame.DataFrame'>
100000
14405.10325
<class 'pandas.core.frame.DataFrame'>
100000
14449.50525
<class 'pandas.core.frame.DataFrame'>
100000
14368.34225
<class 'pandas.core.frame.DataFrame'>
100000
14422.87375
<class 'pandas.core.frame.DataFrame'>
100000
14426.4795
<class 'pandas.core.frame.DataFrame'>
100000
14417.336
统计分析
查看数据集的样本个数和原始特征维度:
print(data_train.shape)
print(data_train.columns)
(800000, 47)
Index(['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'grade',
'subGrade', 'employmentTitle', 'employmentLength', 'homeOwnership',
'annualIncome', 'verificationStatus', 'issueDate', 'isDefault',
'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years',
'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec',
'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc',
'initialListStatus', 'applicationType', 'earliesCreditLine', 'title',
'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8',
'n9', 'n10', 'n11', 'n12', 'n13', 'n14'],
dtype='object')
查看数据类型以及是否有空值:
data_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n3 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
memory usage: 286.9+ MB
查看数据的统计描述:
data_train.describe()
id | loanAmnt | term | interestRate | installment | employmentTitle | homeOwnership | annualIncome | verificationStatus | isDefault | ... | n5 | n6 | n7 | n8 | n9 | n10 | n11 | n12 | n13 | n14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 800000.000000 | 799999.000000 | 800000.000000 | 8.000000e+05 | 800000.000000 | 800000.000000 | ... | 759730.000000 | 759730.000000 | 759730.000000 | 759729.000000 | 759730.000000 | 766761.000000 | 730248.000000 | 759730.000000 | 759730.000000 | 759730.000000 |
mean | 399999.500000 | 14416.818875 | 3.482745 | 13.238391 | 437.947723 | 72005.351714 | 0.614213 | 7.613391e+04 | 1.009683 | 0.199513 | ... | 8.107937 | 8.575994 | 8.282953 | 14.622488 | 5.592345 | 11.643896 | 0.000815 | 0.003384 | 0.089366 | 2.178606 |
std | 230940.252013 | 8716.086178 | 0.855832 | 4.765757 | 261.460393 | 106585.640204 | 0.675749 | 6.894751e+04 | 0.782716 | 0.399634 | ... | 4.799210 | 7.400536 | 4.561689 | 8.124610 | 3.216184 | 5.484104 | 0.030075 | 0.062041 | 0.509069 | 1.844377 |
min | 0.000000 | 500.000000 | 3.000000 | 5.310000 | 15.690000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 199999.750000 | 8000.000000 | 3.000000 | 9.750000 | 248.450000 | 427.000000 | 0.000000 | 4.560000e+04 | 0.000000 | 0.000000 | ... | 5.000000 | 4.000000 | 5.000000 | 9.000000 | 3.000000 | 8.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
50% | 399999.500000 | 12000.000000 | 3.000000 | 12.740000 | 375.135000 | 7755.000000 | 1.000000 | 6.500000e+04 | 1.000000 | 0.000000 | ... | 7.000000 | 7.000000 | 7.000000 | 13.000000 | 5.000000 | 11.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
75% | 599999.250000 | 20000.000000 | 3.000000 | 15.990000 | 580.710000 | 117663.500000 | 1.000000 | 9.000000e+04 | 2.000000 | 0.000000 | ... | 11.000000 | 11.000000 | 10.000000 | 19.000000 | 7.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
max | 799999.000000 | 40000.000000 | 5.000000 | 30.990000 | 1715.420000 | 378351.000000 | 5.000000 | 1.099920e+07 | 2.000000 | 1.000000 | ... | 70.000000 | 132.000000 | 79.000000 | 128.000000 | 45.000000 | 82.000000 | 4.000000 | 4.000000 | 39.000000 | 30.000000 |
8 rows × 42 columns
查看存在缺失值的情况:
print(f'有{data_train.isnull().any().sum()}列数据存在缺失值') # .any()表示任意一列中存在缺失值
有22列数据存在缺失值
查看缺失特征中缺失率大于50%的特征:
have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {}
for key,value in have_null_fea_dict.items():
if value > 0.5:
fea_null_moreThanHalf[key] = value
fea_null_moreThanHalf
{}
具体的查看缺失特征及缺失率:
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
如果缺失值的数量较小一般选择填充,如使用 lgb 等树模型可以直接填充空缺,让树自己去优化。如果存在的缺失值过多,则可以考虑删掉该特征。
统计取值都相同的特征:
one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]
one_value_fea
['policyCode']
查看特征的数据类型:
- 特征一般都是由类别型特征和数值型特征组成。
- 类别型特征有时具有非数值关系,有时也具有数值关系。比如’grade’中的等级 A,B,C 等,是否只是单纯的分类,还是A优于其他要结合业务判断。
- 数值型特征一般是可以直接入模的,但风控人员往往还要对其做分箱、转化为 WOE 编码进而做标准评分卡等操作。从模型效果上来看,特征分箱主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
print('数值型特征')
print(numerical_fea)
print('类别型特征')
print(category_fea)
数值型特征
['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'policyCode', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
类别型特征
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
对数值型变量进行分析
- 判断数值特征是离散型还是连续型
# 过滤数值型类别特征
def get_numerical_serial_fea(data,feas):
numerical_serial_fea = []
numerical_noserial_fea = []
for fea in feas:
temp = data[fea].nunique()
if temp <= 10: # 取值少于10个的特征视为离散型特征,取值多于10个的视为连续性特征
numerical_noserial_fea.append(fea)
continue
numerical_serial_fea.append(fea)
return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)
print('连续性数值特征')
print(numerical_serial_fea)
print('离散型数值特征')
print(numerical_noserial_fea)
连续性数值特征
['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n3', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14']
离散型数值特征
['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'policyCode', 'n11', 'n12']
- 查看离散型数值特征的类别数
# 查看离散型数值特征的类别数
print('term')
print(data_train['term'].value_counts())
print('\nhomeOwnership')
print(data_train['homeOwnership'].value_counts())
print('\nverificationStatus')
print(data_train['verificationStatus'].value_counts())
print('\ninitialListStatus')
print(data_train['initialListStatus'].value_counts())
print('\napplicationType')
print(data_train['applicationType'].value_counts())
print('\npolicyCode')
print(data_train['policyCode'].value_counts())
print('\nn11')
print(data_train['n11'].value_counts())
print('\nn12')
print(data_train['n12'].value_counts())
term
3 606902
5 193098
Name: term, dtype: int64
homeOwnership
0 395732
1 317660
2 86309
3 185
5 81
4 33
Name: homeOwnership, dtype: int64
verificationStatus
1 309810
2 248968
0 241222
Name: verificationStatus, dtype: int64
initialListStatus
0 466438
1 333562
Name: initialListStatus, dtype: int64
applicationType
0 784586
1 15414
Name: applicationType, dtype: int64
policyCode
1.0 800000
Name: policyCode, dtype: int64
n11
0.0 729682
1.0 540
2.0 24
4.0 1
3.0 1
Name: n11, dtype: int64
n12
0.0 757315
1.0 2281
2.0 115
3.0 16
4.0 3
Name: n12, dtype: int64
存在取值全部相同的情况,如’policyCode’,这在实际分析中是没有价值的。此外,有些变量存在取值数量相差较大的情况,如’n11’,‘n12’.
- 连续型数值变量可视化
# 每个数值特征的分布可视化
f = pd.melt(data_train, value_vars=numerical_serial_fea) # pd.melt 列转行
g = sns.FacetGrid(f, col="variable", col_wrap=5, sharex=False, sharey=False)
# FacetGrid 是sns的库的一个类,在初始化的时候只需要传入一个DataFrame将其示例化,实例化之后就可以直接使用这个类绘制需要的图形
g = g.map(sns.distplot, "value")
- 查看某一个数值型变量的分布,查看变量是否符合正态分布,如果不符合正态分布的变量可以取对数后再观察是否符合正态分布。
- 取对数的作用相当于标准化,可以将变化范围较大的数据限制在较小的变化范围内。
# 画出 loanAmnt 与 log(loanAmnt) 的分布
plt.figure(figsize=(16,12))
plt.suptitle('Transaction Values Distribution', fontsize=22)
plt.subplot(221)
sub_plot_1 = sns.distplot(data_train['loanAmnt'])
sub_plot_1.set_title("loanAmnt Distribuition", fontsize=18)
sub_plot_1.set_xlabel("")
sub_plot_1.set_ylabel("Probability", fontsize=15)
plt.subplot(222)
sub_plot_2 = sns.distplot(np.log(data_train['loanAmnt']))
sub_plot_2.set_title("loanAmnt (Log) Distribuition", fontsize=18)
sub_plot_2.set_xlabel("")
sub_plot_2.set_ylabel("Probability", fontsize=15)
对类别型特征进行分析
print('grade')
print(data_train['grade'].value_counts())
print('\nsubGrade')
print(data_train['subGrade'].value_counts())
print('\nemploymentLength')
print(data_train['employmentLength'].value_counts())
print('\nissueDate')
print(data_train['issueDate'].value_counts())
print('\nearliesCreditLine')
print(data_train['earliesCreditLine'].value_counts())
print('\nisDefault')
print(data_train['isDefault'].value_counts())
grade
B 233690
C 227118
A 139661
D 119453
E 55661
F 19053
G 5364
Name: grade, dtype: int64
subGrade
C1 50763
B4 49516
B5 48965
B3 48600
C2 47068
C3 44751
C4 44272
B2 44227
B1 42382
C5 40264
A5 38045
A4 30928
D1 30538
D2 26528
A1 25909
D3 23410
A3 22655
A2 22124
D4 21139
D5 17838
E1 14064
E2 12746
E3 10925
E4 9273
E5 8653
F1 5925
F2 4340
F3 3577
F4 2859
F5 2352
G1 1759
G2 1231
G3 978
G4 751
G5 645
Name: subGrade, dtype: int64
employmentLength
10+ years 262753
2 years 72358
< 1 year 64237
3 years 64152
1 year 52489
5 years 50102
4 years 47985
6 years 37254
8 years 36192
7 years 35407
9 years 30272
Name: employmentLength, dtype: int64
issueDate
2016-03-01 29066
2015-10-01 25525
2015-07-01 24496
2015-12-01 23245
2014-10-01 21461
...
2007-08-01 23
2007-07-01 21
2008-09-01 19
2007-09-01 7
2007-06-01 1
Name: issueDate, Length: 139, dtype: int64
earliesCreditLine
Aug-2001 5567
Aug-2002 5403
Sep-2003 5403
Oct-2001 5258
Aug-2000 5246
...
Feb-1960 1
Jul-1955 1
Oct-1957 1
Aug-1946 1
Aug-1958 1
Name: earliesCreditLine, Length: 720, dtype: int64
isDefault
0 640390
1 159610
Name: isDefault, dtype: int64
变量分布可视化
单一变量分布可视化
plt.figure(figsize=(8, 8))
sns.barplot(data_train["employmentLength"].value_counts(dropna=False)[:20],
data_train["employmentLength"].value_counts(dropna=False).keys()[:20])
plt.show()
可视化某个特征x在不同y值上的分布
- 查看类别型变量在不同y值上的分布
train_loan_fr = data_train.loc[data_train['isDefault'] == 1]
train_loan_nofr = data_train.loc[data_train['isDefault'] == 0]
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
train_loan_fr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax1, title='Count of grade fraud')
train_loan_nofr.groupby('grade')['grade'].count().plot(kind='barh', ax=ax2, title='Count of grade non-fraud')
train_loan_fr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',ax=ax3, title='Count of employmentLength fraud')
train_loan_nofr.groupby('employmentLength')['employmentLength'].count().plot(kind='barh',ax=ax4, title='Count of employmentLength non-fraud')
plt.show()
- 查看连续型变量在不同y值上的分布
fig, ((ax1, ax2)) = plt.subplots(1, 2, figsize=(15, 6))
data_train.loc[data_train['isDefault'] == 1] \
['loanAmnt'].apply(np.log) \
.plot(kind='hist',
bins=100,
title='Log Loan Amt - Fraud',
color='r',
xlim=(-3, 10),
ax= ax1)
data_train.loc[data_train['isDefault'] == 0] \
['loanAmnt'].apply(np.log) \
.plot(kind='hist',
bins=100,
title='Log Loan Amt - Not Fraud',
color='b',
xlim=(-3, 10),
ax=ax2)
total = len(data_train)
total_amt = data_train.groupby(['isDefault'])['loanAmnt'].sum().sum()
plt.figure(figsize=(12,5))
plt.subplot(121) # 1代表行,2代表列,所以一共有2个图,1代表此时绘制第一个图。
plot_tr = sns.countplot(x='isDefault',data=data_train)
plot_tr.set_title("Fraud Loan Distribution \n 0: good user | 1: bad user", fontsize=14)
plot_tr.set_xlabel("Is fraud by count", fontsize=16)
plot_tr.set_ylabel('Count', fontsize=16)
for p in plot_tr.patches:
height = p.get_height()
plot_tr.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/total*100),
ha="center", fontsize=15)
percent_amt = (data_train.groupby(['isDefault'])['loanAmnt'].sum())
percent_amt = percent_amt.reset_index()
plt.subplot(122)
plot_tr_2 = sns.barplot(x='isDefault', y='loanAmnt', dodge=True, data=percent_amt)
plot_tr_2.set_title("Total Amount in loanAmnt \n 0: good user | 1: bad user", fontsize=14)
plot_tr_2.set_xlabel("Is fraud by percent", fontsize=16)
plot_tr_2.set_ylabel('Total Loan Amount Scalar', fontsize=16)
for p in plot_tr_2.patches:
height = p.get_height()
plot_tr_2.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/total_amt * 100),
ha="center", fontsize=15)
时间格式数据处理及查看
#转化成时间格式
data_train['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_train['issueDateDT'] = data_train['issueDate'].apply(lambda x: x-startdate).dt.days
data_test_a['issueDate'] = pd.to_datetime(data_train['issueDate'],format='%Y-%m-%d')
startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
data_test_a['issueDateDT'] = data_test_a['issueDate'].apply(lambda x: x-startdate).dt.days
plt.hist(data_train['issueDateDT'], label='train');
plt.hist(data_test_a['issueDateDT'], label='test');
plt.legend();
plt.title('Distribution of issueDateDT dates');
# train 和 test issueDateDT 日期有重叠,所以使用基于时间的分割进行验证是不明智的
透视图
# 透视图的索引可以有多个,'columns'是可选的,聚合函数 aggfunc 最后是被应用到了变量'values'中所列举的项目上
pivot = pd.pivot_table(data_train, index=['grade'], columns=['issueDateDT'], values=['loanAmnt'], aggfunc=np.sum)
pivot
loanAmnt | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
issueDateDT | 0 | 30 | 61 | 92 | 122 | 153 | 183 | 214 | 245 | 274 | ... | 3926 | 3957 | 3987 | 4018 | 4048 | 4079 | 4110 | 4140 | 4171 | 4201 |
grade | |||||||||||||||||||||
A | NaN | 53650.0 | 42000.0 | 19500.0 | 34425.0 | 63950.0 | 43500.0 | 168825.0 | 85600.0 | 101825.0 | ... | 13093850.0 | 11757325.0 | 11945975.0 | 9144000.0 | 7977650.0 | 6888900.0 | 5109800.0 | 3919275.0 | 2694025.0 | 2245625.0 |
B | NaN | 13000.0 | 24000.0 | 32125.0 | 7025.0 | 95750.0 | 164300.0 | 303175.0 | 434425.0 | 538450.0 | ... | 16863100.0 | 17275175.0 | 16217500.0 | 11431350.0 | 8967750.0 | 7572725.0 | 4884600.0 | 4329400.0 | 3922575.0 | 3257100.0 |
C | NaN | 68750.0 | 8175.0 | 10000.0 | 61800.0 | 52550.0 | 175375.0 | 151100.0 | 243725.0 | 393150.0 | ... | 17502375.0 | 17471500.0 | 16111225.0 | 11973675.0 | 10184450.0 | 7765000.0 | 5354450.0 | 4552600.0 | 2870050.0 | 2246250.0 |
D | NaN | NaN | 5500.0 | 2850.0 | 28625.0 | NaN | 167975.0 | 171325.0 | 192900.0 | 269325.0 | ... | 11403075.0 | 10964150.0 | 10747675.0 | 7082050.0 | 7189625.0 | 5195700.0 | 3455175.0 | 3038500.0 | 2452375.0 | 1771750.0 |
E | 7500.0 | NaN | 10000.0 | NaN | 17975.0 | 1500.0 | 94375.0 | 116450.0 | 42000.0 | 139775.0 | ... | 3983050.0 | 3410125.0 | 3107150.0 | 2341825.0 | 2225675.0 | 1643675.0 | 1091025.0 | 1131625.0 | 883950.0 | 802425.0 |
F | NaN | NaN | 31250.0 | 2125.0 | NaN | NaN | NaN | 49000.0 | 27000.0 | 43000.0 | ... | 1074175.0 | 868925.0 | 761675.0 | 685325.0 | 665750.0 | 685200.0 | 316700.0 | 315075.0 | 72300.0 | NaN |
G | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 24625.0 | NaN | NaN | ... | 56100.0 | 243275.0 | 224825.0 | 64050.0 | 198575.0 | 245825.0 | 53125.0 | 23750.0 | 25100.0 | 1000.0 |
7 rows × 139 columns
用 pandas_profiling 生成数据报告
import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("example.html")
参考资料
关于特征分箱可以参考:特征的分箱