风控模型-风险预警模型
最近一个朋友在面试一家银行的算法岗,第一轮是take home test,也就是公司发了个测试题,要求候选人回家做完,面试讲解。看来今年市场行情的确不太好,很少高级算法岗会有这么一轮面试。笔者看了下数据,还蛮有意思,正好顺一下建模的pipeline,讲解下WOE和LR的结合应用。
WOE 的应用价值
- 处理缺失值:利用分箱讲null单独处理,可以讲有效覆盖率只有30%的数据利用起来
- 处理异常值:利用分箱讲异常值单独处理,增加变量的鲁棒性。例如,年龄由于用户手动填写的,可能存在200这种异常值,可以将这种情况划入age>60的分箱。
- 业务解释性:业务习惯于线性判断变量的作用,当x越来越大,y就越来越大。但实际x与y之间经常存在着非线性关系,可经过WOE进行变换
建模pipeline
建模的pipeline:
加载数据->EDA->Feature Generation->Model Establishment-> Release online
EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from scipy.stats import chi2_contingency
pd.set_option('display.max_columns', None)
sns.set_theme(style="darkgrid")
sns.set(rc = {'figure.figsize':(20,15)})
dataset.head()
accountNumber | customerId | creditLimit | availableMoney | transactionDateTime | transactionAmount | merchantName | acqCountry | merchantCountryCode | posEntryMode | posConditionCode | merchantCategoryCode | currentExpDate | accountOpenDate | dateOfLastAddressChange | cardCVV | enteredCVV | cardLast4Digits | transactionType | echoBuffer | currentBalance | merchantCity | merchantState | merchantZip | cardPresent | posOnPremises | recurringAuthInd | expirationDateKeyInMatch | isFraud | transactionDate | transactionHour | transactionMonth | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 737265056 | 737265056 | 5000.0 | 5000.0 | 2016-08-13 14:27:32 | 98.55 | Uber | US | US | 02 | 01 | rideshare | 06/2023 | 2015-03-14 | 2015-03-14 | 414 | 414 | 1803 | PURCHASE | NaN | 0.0 | NaN | NaN | NaN | False | NaN | NaN | False | False | 2016-08-13 | 14 | 8 |
1 | 737265056 | 737265056 | 5000.0 | 5000.0 | 2016-10-11 05:05:54 | 74.51 | AMC #191138 | US | US | 09 | 01 | entertainment | 02/2024 | 2015-03-14 | 2015-03-14 | 486 | 486 | 767 | PURCHASE | NaN | 0.0 | NaN | NaN | NaN | True | NaN | NaN | False | False | 2016-10-11 | 5 | 10 |
2 | 737265056 | 737265056 | 5000.0 | 5000.0 | 2016-11-08 09:18:39 | 7.47 | Play Store | US | US | 09 | 01 | mobileapps | 08/2025 | 2015-03-14 | 2015-03-14 | 486 | 486 | 767 | PURCHASE | NaN | 0.0 | NaN | NaN | NaN | False | NaN | NaN | False | False | 2016-11-08 | 9 | 11 |
3 | 737265056 | 737265056 | 5000.0 | 5000.0 | 2016-12-10 02:14:50 | 7.47 | Play Store | US | US | 09 | 01 | mobileapps | 08/2025 | 2015-03-14 | 2015-03-14 | 486 | 486 | 767 | PURCHASE | NaN | 0.0 | NaN | NaN | NaN | False | NaN | NaN | False | False | 2016-12-10 | 2 | 12 |
4 | 830329091 | 830329091 | 5000.0 | 5000.0 | 2016-03-24 21:04:46 | 71.18 | Tim Hortons #947751 | US | US | 02 | 01 | fastfood | 10/2029 | 2015-08-06 | 2015-08-06 | 885 | 885 | 3143 | PURCHASE | NaN | 0.0 | NaN | NaN | NaN | True | NaN | NaN | False | False | 2016-03-24 | 21 | 3 |
fraud = dataset['isFraud'].value_counts().to_frame()
fraud['pct'] = fraud['isFraud']/fraud['isFraud'].sum()
display(fraud)
isFraud | pct | |
---|---|---|
False | 773946 | 0.98421 |
True | 12417 | 0.01579 |
这是个极度不平衡的数据集,对于unbalanced的数据集,两个思路,一个是upsampling,一个是downsampling.这个放在其他文章里讲吧,本篇就不涉及了
信用卡刷卡,有时候会因为网络等原因存在多次扣款的情况,数据集里就会存在多条重复记录,需要将这些重复记录删除
dataset = dataset[~(dataset['transactionType'].isin(['REVERSAL']))]
mult_swipe = dataset[dataset.duplicated(keep='first',subset=['customerId','transactionDate','transactionAmount','cardLast4Digits','transactionHour'])]
print('multi-swipe transaction number:{0},amount:{1}'.format(len(mult_swipe),sum(mult_swipe['transactionAmount'])))
dataset = dataset[~(dataset.index.isin(mult_swipe.index))]
multi-swipe transaction number:7565,amount:1076660.0299999956
先看一眼随时间的分布
sns.set_theme(style="darkgrid")
sns.set(rc = {'figure.figsize':(20,15)})
plt.figure(figsize = (15,8))
sns.barplot(data = dataset, x='transactionMonth',y='transactionAmount',estimator=sum)
sns.set_theme(style="darkgrid")
sns.set(rc = {'figure.figsize':(20,15)})
plt.figure(figsize = (15,8))
sns.barplot(data = dataset, x='transactionMonth',y='transactionAmount',estimator=len)
可以看出每个月份的交易金额和交易笔数相差不大
plt.figure(figsize = (15,8))
sns.boxplot(data = dataset,x='transactionMonth',y='transactionAmount',notch=True,showcaps=True,
flierprops={'marker':'x'},medianprops={'color':'coral'})
boxplot反映异常值主要是较高金额
处理下null较多的字段
data_null
null_num | total | pct | |
---|---|---|---|
accountNumber | 0 | 758495 | 0.000000 |
customerId | 0 | 758495 | 0.000000 |
creditLimit | 0 | 758495 | 0.000000 |
availableMoney | 0 | 758495 | 0.000000 |
transactionDateTime | 0 | 758495 | 0.000000 |
transactionAmount | 0 | 758495 | 0.000000 |
merchantName | 0 | 758495 | 0.000000 |
acqCountry | 4401 | 758495 | 0.005802 |
merchantCountryCode | 703 | 758495 | 0.000927 |
posEntryMode | 3904 | 758495 | 0.005147 |
posConditionCode | 396 | 758495 | 0.000522 |
merchantCategoryCode | 0 | 758495 | 0.000000 |
currentExpDate | 0 | 758495 | 0.000000 |
accountOpenDate | 0 | 758495 | 0.000000 |
dateOfLastAddressChange | 0 | 758495 | 0.000000 |
cardCVV | 0 | 758495 | 0.000000 |
enteredCVV | 0 | 758495 | 0.000000 |
cardLast4Digits | 0 | 758495 | 0.000000 |
transactionType | 690 | 758495 | 0.000910 |
echoBuffer | 758495 | 758495 | 1.000000 |
currentBalance | 0 | 758495 | 0.000000 |
merchantCity | 758495 | 758495 | 1.000000 |
merchantState | 758495 | 758495 | 1.000000 |
merchantZip | 758495 | 758495 | 1.000000 |
cardPresent | 0 | 758495 | 0.000000 |
posOnPremises | 758495 | 758495 | 1.000000 |
recurringAuthInd | 758495 | 758495 | 1.000000 |
expirationDateKeyInMatch | 0 | 758495 | 0.000000 |
isFraud | 0 | 758495 | 0.000000 |
transactionDate | 0 | 758495 | 0.000000 |
transactionHour | 0 | 758495 | 0.000000 |
transactionMonth | 0 | 758495 | 0.000000 |
remove columns which null_pct >= 0.5
data_df = dataset[data_null[data_null['pct']<0.5].index.tolist()]
data_df = data_df[['customerId', 'creditLimit', 'availableMoney','transactionDateTime','transactionAmount', 'merchantName','acqCountry', 'merchantCountryCode', 'posEntryMode', 'posConditionCode','merchantCategoryCode','accountOpenDate','dateOfLastAddressChange', 'cardCVV', 'enteredCVV', 'cardLast4Digits',
'transactionType', 'currentBalance','expirationDateKeyInMatch', 'isFraud', 'transactionDate','transactionHour','transactionMonth']]
进一步看下fraud交易和正常交易随时间分布的差异
fig, axes = plt.subplots(3,4)
sns.histplot(ax=axes[0,0],data = data_df[(data_df['transactionMonth']==1)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='1')
sns.histplot(ax=axes[0,1],data = data_df[(data_df['transactionMonth']==2)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='2')
sns.histplot(ax=axes[0,2],data = data_df[(data_df['transactionMonth']==3)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='3')
sns.histplot(ax=axes[0,3],data = data_df[(data_df['transactionMonth']==4)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='4')
sns.histplot(ax=axes[1,0],data = data_df[(data_df['transactionMonth']==5)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='5')
sns.histplot(ax=axes[1,1],data = data_df[(data_df['transactionMonth']==6)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='6')
sns.histplot(ax=axes[1,2],data = data_df[(data_df['transactionMonth']==7)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='7')
sns.histplot(ax=axes[1,3],data = data_df[(data_df['transactionMonth']==8)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='8')
sns.histplot(ax=axes[2,0],data = data_df[(data_df['transactionMonth']==9)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='9')
sns.histplot(ax=axes[2,1],data = data_df[(data_df['transactionMonth']==10)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='10')
sns.histplot(ax=axes[2,2],data = data_df[(data_df['transactionMonth']==11)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='11')
sns.histplot(ax=axes[2,3],data = data_df[(data_df['transactionMonth']==12)&(data_df['isFraud']==True)], x='transactionAmount',binwidth = 100,stat='probability').set(title='12')
plt.tight_layout()
fig, axes = plt.subplots(3,4)
sns.histplot(ax=axes[0,0],data = data_df[(data_df['transactionMonth']==1)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='1')
sns.histplot(ax=axes[0,1],data = data_df[(data_df['transactionMonth']==2)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='2')
sns.histplot(ax=axes[0,2],data = data_df[(data_df['transactionMonth']==3)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='3')
sns.histplot(ax=axes[0,3],data = data_df[(data_df['transactionMonth']==4)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='4')
sns.histplot(ax=axes[1,0],data = data_df[(data_df['transactionMonth']==5)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='5')
sns.histplot(ax=axes[1,1],data = data_df[(data_df['transactionMonth']==6)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='6')
sns.histplot(ax=axes[1,2],data = data_df[(data_df['transactionMonth']==7)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='7')
sns.histplot(ax=axes[1,3],data = data_df[(data_df['transactionMonth']==8)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='8')
sns.histplot(ax=axes[2,0],data = data_df[(data_df['transactionMonth']==9)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='9')
sns.histplot(ax=axes[2,1],data = data_df[(data_df['transactionMonth']==10)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='10')
sns.histplot(ax=axes[2,2],data = data_df[(data_df['transactionMonth']==11)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='11')
sns.histplot(ax=axes[2,3],data = data_df[(data_df['transactionMonth']==12)&(data_df['isFraud']==False)], x='transactionAmount',binwidth = 100,stat='probability').set(title='12')
plt.tight_layout()
fraud类型交易0-100和100-200区间交易占比较高。正常交易0-100占比明显高于其他区间,且第四季度占比略高于其他季度
plt.figure(figsize = (15,8))
g = sns.FacetGrid(data_df,col='transactionTime_',row='isFraud',margin_titles=True)
g.map(sns.histplot,'transactionAmount',stat='probability',binwidth = 50)
plt.tight_layout()
看着这个比例分布,正常交易对时间段不敏感,无差异。fraud类型的交易下午时间段分布和其他时间段不同,100-200占比最高
cus_df.head()
customerId | transactionAmount | transactionDate | cumpct | |
---|---|---|---|---|
0 | 380680241 | 4589985.93 | 31554 | 0.044210 |
1 | 882815134 | 1842601.52 | 12665 | 0.061958 |
2 | 570884863 | 1514931.43 | 10452 | 0.076549 |
3 | 246251253 | 1425588.84 | 9806 | 0.090280 |
4 | 369308035 | 1012414.42 | 6928 | 0.100032 |
sns.lineplot(x='cus_pct',y='cumpct',data=cus_df )
80%的交易金额由20%的客户贡献
sns.histplot(data = cus_df, x='TA',stat='count',binwidth=10).set(title='TA')
客单价集中在150左右
df_ttl = df[['customerId', 'creditLimit','transactionAmount','acqCountry','merchantCategoryCode','transactionType','currentBalance','transactionMonth','transactionTime_',
'days_tranz_open','merchantName_', 'CVV', 'days_address', 'isFraud']]
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
trained_cols = ['customerId', 'creditLimit','transactionAmount','acqCountry','merchantCategoryCode','transactionType','currentBalance','transactionMonth','transactionTime_',
'days_tranz_open','merchantName_', 'CVV', 'days_address','label']
s_train_X,s_test_X,s_train_y,s_test_y = train_test_split(df_ttl[trained_cols],\
df_ttl['label'],train_size=0.8,random_state=123)
cus_df = s_train_X.groupby('customerId',as_index=False).agg({'transactionAmount':'sum','acqCountry':'count'})
cus_df = cus_df.rename(columns = {'acqCountry':'Frequency'})
cus_df['TA'] = cus_df['transactionAmount']/cus_df['Frequency']
s_train_X = s_train_X.merge(cus_df[['customerId','Frequency','TA']],on='customerId',how='left')
s_test_X = s_test_X.merge(cus_df[['customerId','Frequency','TA']],on='customerId',how='left')
s_train_X['TA_Tranz'] = s_train_X['transactionAmount'] - s_train_X['TA']
s_test_X['TA_Tranz'] = s_test_X['transactionAmount'] - s_test_X['TA']
feature generation
class WOE(Analysis):
@staticmethod
def __perc_share(df,group_name):
return df[group_name]/df[group_name].sum()
def __calculate_perc_share(self,feat):
df = self.group_by_feature(feat)
df['perc_good'] = self.__perc_share(df,'good')
df['perc_bad'] = self.__perc_share(df,'bad')
df['perc_diff'] = df['perc_good'] - df['perc_bad']
return df
def calculate_woe(self,feat):
df = self.__calculate_perc_share(feat)
df['woe'] = np.log(df['perc_good']/df['perc_bad'])
df['woe'] = df['woe'].replace([np.inf,-np.inf],np.nan).fillna(0)
return df
class CategoricalFeature():
def __init__(self,df,feature):
self.df = df
self.feature = feature
@property
def _df_woe(self):
df_woe = self.df.copy()
df_woe['bin'] = df_woe[self.feature].fillna('missing')
return df_woe[['bin','label']]
def draw_woe(woe_df):
fig, ax = plt.subplots(figsize=(10,6))
sns.barplot(x=woe_df.columns[0], y=woe_df.columns[-2], data=woe_df, palette=sns.cubehelix_palette(len(woe_df),start=0.5,rot=0.75,reverse=True))
ax.set_title('WOE visualization for: ' + feature)
plt.xticks(rotation=30)
plt.show()
def print_iv(woe_df):
iv = woe_df['iv'].sum()
if iv < 0.02:
interpre = 'useless'
elif iv < 0.1:
interpre = 'weak'
elif iv < 0.3:
interpre = 'medium'
elif iv < 0.5:
interpre = 'strong'
else:
interpre = 'toogood'
return iv,interpre
category feature
feature_cat = ['creditLimit','acqCountry', 'transactionType',
'transactionMonth', 'transactionTime_', 'CVV','merchantName_','merchantCategoryCode']
# feature_cat = ['creditLimit']
iv_dic = {}
iv_dic['feature'] = []
iv_dic['iv'] = []
iv_dic['interpretation'] = []
iv_df = pd.DataFrame(iv_dic)
display(iv_df)
feature | iv | interpretation | |
---|---|---|---|
0 | creditLimit | 0.019578 | useless |
1 | acqCountry | 0.000674 | useless |
2 | transactionType | 0.016641 | useless |
3 | transactionMonth | 0.003295 | useless |
4 | transactionTime_ | 0.000690 | useless |
5 | CVV | 0.004968 | useless |
6 | merchantName_ | 0.766754 | toogood |
7 | merchantCategoryCode | 0.222249 | medium |
continuous feature
import scipy.stats as stats
feature_conti = ['transactionAmount','currentBalance','days_tranz_open','days_address', 'Frequency', 'TA','TA_Tranz']
# feature_conti = ['Frequency']
iv_con_dic = {}
iv_con_dic['feature'] = []
iv_con_dic['iv'] = []
iv_con_dic['interpretation'] = []
iv_con_df = pd.DataFrame(iv_con_dic)
iv_con_df
feature | iv | interpretation | |
---|---|---|---|
0 | transactionAmount | 0.377574 | strong |
1 | currentBalance | 0.003375 | useless |
2 | days_tranz_open | 0.000420 | useless |
3 | days_address | 0.003841 | useless |
4 | Frequency | 0.023392 | weak |
5 | TA | 0.022107 | weak |
6 | TA_Tranz | 0.322500 | strong |
draw_woe(woe_df)
这里用单笔交易与相应客户的平均比单价的差举例,可以看出当差大于10时,相应交易更倾向于fraud
col_model = iv_df[~(iv_df['interpretation'].isin(['useless']))]['feature'].values.tolist()
col_model.extend(iv_con_df[~(iv_con_df['interpretation'].isin(['useless']))]['feature'].values.tolist())
trained_col = []
for col in col_model:
trained_col.append('woe'+col)
trained_col
['woemerchantName_',
'woemerchantCategoryCode',
'woetransactionAmount',
'woeFrequency',
'woeTA',
'woeTA_Tranz']
后面就使用这6个feature进行建模
Model
利用GridsearchCV对超参数进行遍历寻优
lr_paras = [{'penalty':['l2'],
'C':[0.01,0.05,0.1,0.5,1,5,10,50,100],
'class_weight':[{0:0.1,1:0.9},{0:0.2,1:0.8},{0:0.3,1:0.7},{0:0.4,1:0.6},{0:0.5,1:0.5},{0:0.6,1:0.4},{0:0.7,1:0.3},{0:0.8,1:0.2},{0:0.9,1:0.1}],
'solver':['liblinear'],
'multi_class':['ovr']}]
modelLR = GridSearchCV(LogisticRegression(tol=1e-6),lr_paras,cv=5,verbose=1)
modelLR.fit(s_train_X,s_train_y)
Fitting 5 folds for each of 81 candidates, totalling 405 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 405 out of 405 | elapsed: 7.0min finished
GridSearchCV(cv=5, estimator=LogisticRegression(tol=1e-06),
param_grid=[{'C': [0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100],
'class_weight': [{0: 0.1, 1: 0.9}, {0: 0.2, 1: 0.8},
{0: 0.3, 1: 0.7}, {0: 0.4, 1: 0.6},
{0: 0.5, 1: 0.5}, {0: 0.6, 1: 0.4},
{0: 0.7, 1: 0.3}, {0: 0.8, 1: 0.2},
{0: 0.9, 1: 0.1}],
'multi_class': ['ovr'], 'penalty': ['l2'],
'solver': ['liblinear']}],
verbose=1)
coef_,intercept_ = LR_(modelLR.best_estimator_,s_train_X,s_train_y)
s_test_X.drop(columns='pred',inplace=True)
s_train_X.drop(columns='pred',inplace=True)
s_y_pred_prob = lr_score(s_test_X,coef_,intercept_)
s_y_pred_prob_train = lr_score(s_train_X,coef_,intercept_)
LR可以输出probability,遍历下步长,确定最优的threshold
thre = np.linspace(0,0.5,50)
score_dic = {}
score_dic['thre'] = []
score_dic['score'] = []
for item in thre:
s_y_pred_train = [1 if i >= item else 0 for i in s_y_pred_prob_train]
score_f1 = f1_score(s_train_y,s_y_pred_train,average='macro')
score_dic['thre'].append(item)
score_dic['score'].append(score_f1)
s_y_pred_train = [1 if i >= thresh else 0 for i in s_y_pred_prob_train]
s_y_pred = [1 if i >= thresh else 0 for i in s_y_pred_prob]
print('Training set performance')
print(metrics.confusion_matrix(s_train_y, s_y_pred_train))
print(metrics.classification_report(s_train_y, s_y_pred_train))
print('Test set performance')
print(metrics.confusion_matrix(s_test_y, s_y_pred))
print(metrics.classification_report(s_test_y, s_y_pred))
Training set performance
[[576091 14082]
[ 7969 1162]]
precision recall f1-score support
0 0.99 0.98 0.98 590173
1 0.08 0.13 0.10 9131
accuracy 0.96 599304
macro avg 0.53 0.55 0.54 599304
weighted avg 0.97 0.96 0.97 599304
Test set performance
[[144053 3385]
[ 2091 297]]
precision recall f1-score support
0 0.99 0.98 0.98 147438
1 0.08 0.12 0.10 2388
accuracy 0.96 149826
macro avg 0.53 0.55 0.54 149826
weighted avg 0.97 0.96 0.97 149826
fpr, tpr, _ = metrics.roc_curve(s_test_y, s_test_X_.pred)
auc = metrics.roc_auc_score(s_test_y, s_test_X_.pred)
plt.figure(figsize=(8,6))
sns.lineplot(fpr,tpr,label='Model AUC %0.2f' % auc, color='palevioletred', lw = 2)
plt.plot([0, 1], [0, 1], color='lightgrey', lw=1.5, linestyle='--')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate',fontsize=12)
plt.ylabel('True Positive Rate',fontsize=12)
plt.title('ROC - Test Set',fontsize=13)
plt.legend(loc="lower right",fontsize=12)
plt.rc_context({'axes.edgecolor':'darkgrey','xtick.color':'black','ytick.color':'black','figure.facecolor':'white'})
plt.show()
AUC虽然有0.75,但是F1score只有0.54,模型效果其实是很差的。这主要是极端不平衡数据集带来的影响。同时也说明,对于这种极端不平衡数据集挑选合适的metrics很重要