数据挖掘 —— 金融数据(一)

数据预处理

任务1:对数据进行探索和分析。时间:2天
数据类型的分析
无关特征删除
数据类型转换
缺失值处理
以及你能想到和借鉴的数据分析处理
要求:数据切分方式 - 三七分,其中测试集30%,训练集70%,随机种子设置为2018

# 导入需要的包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
# 忽略警告
import warnings
warnings.filterwarnings('ignore')
# 读入数据并查看数据信息
data = pd.read_csv('data.csv', encoding='gbk')
print(data.iloc[0])
print('===================')
print(data.shape[1])
Unnamed: 0                                                              5
custid                                                            2791858
trade_no                                 20180507115231274000000023057383
bank_card_no                                                          卡号1
low_volume_percent                                                   0.01
middle_volume_percent                                                0.99
take_amount_in_later_12_month_highest                                   0
trans_amount_increase_rate_lately                                     0.9
trans_activity_month                                                 0.55
trans_activity_day                                                  0.313
transd_mcc                                                             17
trans_days_interval_filter                                             27
trans_days_interval                                                    26
regional_mobility                                                       3
student_feature                                                       NaN
repayment_capability                                                19890
is_high_user                                                            0
number_of_trans_from_2011                                              30
first_transaction_time                                        2.01308e+07
historical_trans_amount                                            149050
historical_trans_day                                                  151
rank_trad_1_month                                                     0.4
trans_amount_3_month                                                34030
avg_consume_less_12_valid_month                                         7
abs                                                                  3920
top_trans_count_last_1_month                                         0.15
avg_price_last_12_month                                              1020
avg_price_top_last_12_valid_month                                    0.55
reg_preference_for_trad                                              一线城市
trans_top_time_last_1_month                                             4
                                                       ...               
loans_credibility_behavior                                             73
loans_count                                                            37
loans_settle_count                                                     34
loans_overdue_count                                                     2
loans_org_count_behavior                                               10
consfin_org_count_behavior                                              1
loans_cash_count                                                        9
latest_one_month_loan                                                   1
latest_three_month_loan                                                 1
latest_six_month_loan                                                  13
history_suc_fee                                                        37
history_fail_fee                                                        7
latest_one_month_suc                                                    1
latest_one_month_fail                                                   0
loans_long_time                                                       341
loans_latest_time                                              2018-04-19
loans_credit_limit                                                   2200
loans_credibility_limit                                                72
loans_org_count_current                                                 9
loans_product_count                                                    10
loans_max_limit                                                      2900
loans_avg_limit                                                      1688
consfin_credit_limit                                                 1200
consfin_credibility                                                    75
consfin_org_count_current                                               1
consfin_product_count                                                   2
consfin_max_limit                                                    1200
consfin_avg_limit                                                    1200
latest_query_day                                                       12
loans_latest_day                                                       18
Name: 0, dtype: object
===================
90
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4754 entries, 0 to 4753
Data columns (total 90 columns):
Unnamed: 0                                    4754 non-null int64
custid                                        4754 non-null int64
trade_no                                      4754 non-null object
bank_card_no                                  4754 non-null object
low_volume_percent                            4752 non-null float64
middle_volume_percent                         4752 non-null float64
take_amount_in_later_12_month_highest         4754 non-null int64
trans_amount_increase_rate_lately             4751 non-null float64
trans_activity_month                          4752 non-null float64
trans_activity_day                            4752 non-null float64
transd_mcc                                    4752 non-null float64
trans_days_interval_filter                    4746 non-null float64
trans_days_interval                           4752 non-null float64
regional_mobility                             4752 non-null float64
student_feature                               1756 non-null float64
repayment_capability                          4754 non-null int64
is_high_user                                  4754 non-null int64
number_of_trans_from_2011                     4752 non-null float64
first_transaction_time                        4752 non-null float64
historical_trans_amount                       4754 non-null int64
historical_trans_day                          4752 non-null float64
rank_trad_1_month                             4752 non-null float64
trans_amount_3_month                          4754 non-null int64
avg_consume_less_12_valid_month               4752 non-null float64
abs                                           4754 non-null int64
top_trans_count_last_1_month                  4752 non-null float64
avg_price_last_12_month                       4754 non-null int64
avg_price_top_last_12_valid_month             4650 non-null float64
reg_preference_for_trad                       4752 non-null object
trans_top_time_last_1_month                   4746 non-null float64
trans_top_time_last_6_month                   4746 non-null float64
consume_top_time_last_1_month                 4746 non-null float64
consume_top_time_last_6_month                 4746 non-null float64
cross_consume_count_last_1_month              4328 non-null float64
trans_fail_top_count_enum_last_1_month        4738 non-null float64
trans_fail_top_count_enum_last_6_month        4738 non-null float64
trans_fail_top_count_enum_last_12_month       4738 non-null float64
consume_mini_time_last_1_month                4728 non-null float64
max_cumulative_consume_later_1_month          4754 non-null int64
max_consume_count_later_6_month               4746 non-null float64
railway_consume_count_last_12_month           4742 non-null float64
pawns_auctions_trusts_consume_last_1_month    4754 non-null int64
pawns_auctions_trusts_consume_last_6_month    4754 non-null int64
jewelry_consume_count_last_6_month            4742 non-null float64
status                                        4754 non-null int64
source                                        4754 non-null object
first_transaction_day                         4752 non-null float64
trans_day_last_12_month                       4752 non-null float64
id_name                                       4478 non-null object
apply_score                                   4450 non-null float64
apply_credibility                             4450 non-null float64
query_org_count                               4450 non-null float64
query_finance_count                           4450 non-null float64
query_cash_count                              4450 non-null float64
query_sum_count                               4450 non-null float64
latest_query_time                             4450 non-null object
latest_one_month_apply                        4450 non-null float64
latest_three_month_apply                      4450 non-null float64
latest_six_month_apply                        4450 non-null float64
loans_score                                   4457 non-null float64
loans_credibility_behavior                    4457 non-null float64
loans_count                                   4457 non-null float64
loans_settle_count                            4457 non-null float64
loans_overdue_count                           4457 non-null float64
loans_org_count_behavior                      4457 non-null float64
consfin_org_count_behavior                    4457 non-null float64
loans_cash_count                              4457 non-null float64
latest_one_month_loan                         4457 non-null float64
latest_three_month_loan                       4457 non-null float64
latest_six_month_loan                         4457 non-null float64
history_suc_fee                               4457 non-null float64
history_fail_fee                              4457 non-null float64
latest_one_month_suc                          4457 non-null float64
latest_one_month_fail                         4457 non-null float64
loans_long_time                               4457 non-null float64
loans_latest_time                             4457 non-null object
loans_credit_limit                            4457 non-null float64
loans_credibility_limit                       4457 non-null float64
loans_org_count_current                       4457 non-null float64
loans_product_count                           4457 non-null float64
loans_max_limit                               4457 non-null float64
loans_avg_limit                               4457 non-null float64
consfin_credit_limit                          4457 non-null float64
consfin_credibility                           4457 non-null float64
consfin_org_count_current                     4457 non-null float64
consfin_product_count                         4457 non-null float64
consfin_max_limit                             4457 non-null float64
consfin_avg_limit                             4457 non-null float64
latest_query_day                              4450 non-null float64
loans_latest_day                              4457 non-null float64
dtypes: float64(70), int64(13), object(7)
memory usage: 3.3+ MB

通过上述结果可得:数据的种类多,需要去除一些无关数据
分析数据列信息:
Unnamed: 不连续序号,可能代表部分数据删除,先保存
custid: id号,删除
trad_no: 前部分表示的是时间,先保存
bank_card: 卡号1,都相同,删除
student_feature: 缺失率高,删除
source: 都为xs,删除
id_name
latest_query_time
loans_latest_time

useless_columns = ['Unnamed: 0','custid','trade_no','bank_card_no','id_name',
                   'student_feature', 'source','latest_query_time','loans_latest_time']
data = data.drop(useless_columns,axis=1)
print(data.iloc[0])
print(data.shape[1])
low_volume_percent                               0.01
middle_volume_percent                            0.99
take_amount_in_later_12_month_highest               0
trans_amount_increase_rate_lately                 0.9
trans_activity_month                             0.55
trans_activity_day                              0.313
transd_mcc                                         17
trans_days_interval_filter                         27
trans_days_interval                                26
regional_mobility                                   3
repayment_capability                            19890
is_high_user                                        0
number_of_trans_from_2011                          30
first_transaction_time                    2.01308e+07
historical_trans_amount                        149050
historical_trans_day                              151
rank_trad_1_month                                 0.4
trans_amount_3_month                            34030
avg_consume_less_12_valid_month                     7
abs                                              3920
top_trans_count_last_1_month                     0.15
avg_price_last_12_month                          1020
avg_price_top_last_12_valid_month                0.55
reg_preference_for_trad                          一线城市
trans_top_time_last_1_month                         4
trans_top_time_last_6_month                        19
consume_top_time_last_1_month                       4
consume_top_time_last_6_month                      19
cross_consume_count_last_1_month                    1
trans_fail_top_count_enum_last_1_month              1
                                             ...     
loans_score                                       552
loans_credibility_behavior                         73
loans_count                                        37
loans_settle_count                                 34
loans_overdue_count                                 2
loans_org_count_behavior                           10
consfin_org_count_behavior                          1
loans_cash_count                                    9
latest_one_month_loan                               1
latest_three_month_loan                             1
latest_six_month_loan                              13
history_suc_fee                                    37
history_fail_fee                                    7
latest_one_month_suc                                1
latest_one_month_fail                               0
loans_long_time                                   341
loans_credit_limit                               2200
loans_credibility_limit                            72
loans_org_count_current                             9
loans_product_count                                10
loans_max_limit                                  2900
loans_avg_limit                                  1688
consfin_credit_limit                             1200
consfin_credibility                                75
consfin_org_count_current                           1
consfin_product_count                               2
consfin_max_limit                                1200
consfin_avg_limit                                1200
latest_query_day                                   12
loans_latest_day                                   18
Name: 0, dtype: object
81
data.head(10).T
0123456789
low_volume_percent0.010.020.0400.010.020.020.020.030.01
middle_volume_percent0.990.940.960.960.990.980.980.980.650.99
take_amount_in_later_12_month_highest020000200002000000500
trans_amount_increase_rate_lately0.91.2810.130.467.5923.670.250.310.8
trans_activity_month0.55110.57110.940.880.761
trans_activity_day0.3130.4580.1140.7770.1750.7330.0870.3020.4720.088
transd_mcc17191322132710191515
trans_days_interval_filter2730681466854202136
trans_days_interval2614226421153201435
regional_mobility3413132222
repayment_capability1989016970971062101115084201122082206980012510
is_high_user0000000000
number_of_trans_from_20113023933121815302814
first_transaction_time2.01308e+072.01604e+072.01706e+072.01305e+072.01703e+072.01609e+072.01611e+072.01505e+072.01502e+072.01701e+07
historical_trans_amount14905030291011520491130614704166703414014460017229037250
historical_trans_day15122431360633175019825344
rank_trad_1_month0.40.3510.150.650.210.40.350.75
trans_amount_3_month3403010590571091690977078890308013930217606640
avg_consume_less_12_valid_month7557666887
abs39206950840468507602277011088099950110
top_trans_count_last_1_month0.150.050.650.0510.0510.10.051
avg_price_last_12_month1020121057012901110116012509208601110
avg_price_top_last_12_valid_month0.550.50.650.450.50.5NaN0.550.60.5
reg_preference_for_trad一线城市一线城市一线城市三线城市一线城市三线城市一线城市一线城市三线城市一线城市
trans_top_time_last_1_month413060406100
trans_top_time_last_6_month193068866754201036
consume_top_time_last_1_month413060406100
consume_top_time_last_6_month1930688661254201036
cross_consume_count_last_1_month100001NaN020
trans_fail_top_count_enum_last_1_month1031300100
.................................
loans_score552635633542479676612NaN451589
loans_credibility_behavior73768375737575NaN7874
loans_count373748537607NaN2413
loans_settle_count343628132554NaN1112
loans_overdue_count2004600NaN110
loans_org_count_behavior101732212223NaN105
consfin_org_count_behavior15152102NaN80
loans_cash_count91221710121NaN25
latest_one_month_loan1222012NaN30
latest_three_month_loan1224013NaN100
latest_six_month_loan13843410253NaN166
history_suc_fee374929136789NaN2213
history_fail_fee7422625126NaN533
latest_one_month_suc1212041NaN00
latest_one_month_fail0110002NaN210
loans_long_time341353157355360360312NaN316312
loans_credit_limit2200200015001800180026002200NaN47001900
loans_credibility_limit72747774727276NaN7774
loans_org_count_current91221710121NaN25
loans_product_count101221810131NaN25
loans_max_limit2900350016003200230053002200NaN53002800
loans_avg_limit1688175812501541163019412200NaN47501520
consfin_credit_limit1200151004200163008300112007600NaN55000
consfin_credibility75808780798073NaN790
consfin_org_count_current15152102NaN80
consfin_product_count26152122NaN110
consfin_max_limit12002280042003000084002040016800NaN192000
consfin_avg_limit12009360420012180825081308900NaN79870
latest_query_day124222231NaN2418
loans_latest_day1826412043NaN7142

81 rows × 10 columns

data['status'].value_counts()
0    3561
1    1193
Name: status, dtype: int64
# 删除数据相同的列
orig_columns = data.columns
drop_columns = []
for col in orig_columns:
    col_series = data[col].dropna().unique()
    if len(col_series) == 1:
        drop_columns.append(col)
data = data.drop(drop_columns, axis=1)
print(drop_columns)
[]
# 查看每列缺失值情况
null_counts = data.isnull().sum()
null_counts
low_volume_percent                          2
middle_volume_percent                       2
take_amount_in_later_12_month_highest       0
trans_amount_increase_rate_lately           3
trans_activity_month                        2
trans_activity_day                          2
transd_mcc                                  2
trans_days_interval_filter                  8
trans_days_interval                         2
regional_mobility                           2
repayment_capability                        0
is_high_user                                0
number_of_trans_from_2011                   2
first_transaction_time                      2
historical_trans_amount                     0
historical_trans_day                        2
rank_trad_1_month                           2
trans_amount_3_month                        0
avg_consume_less_12_valid_month             2
abs                                         0
top_trans_count_last_1_month                2
avg_price_last_12_month                     0
avg_price_top_last_12_valid_month         104
reg_preference_for_trad                     2
trans_top_time_last_1_month                 8
trans_top_time_last_6_month                 8
consume_top_time_last_1_month               8
consume_top_time_last_6_month               8
cross_consume_count_last_1_month          426
trans_fail_top_count_enum_last_1_month     16
                                         ... 
loans_score                               297
loans_credibility_behavior                297
loans_count                               297
loans_settle_count                        297
loans_overdue_count                       297
loans_org_count_behavior                  297
consfin_org_count_behavior                297
loans_cash_count                          297
latest_one_month_loan                     297
latest_three_month_loan                   297
latest_six_month_loan                     297
history_suc_fee                           297
history_fail_fee                          297
latest_one_month_suc                      297
latest_one_month_fail                     297
loans_long_time                           297
loans_credit_limit                        297
loans_credibility_limit                   297
loans_org_count_current                   297
loans_product_count                       297
loans_max_limit                           297
loans_avg_limit                           297
consfin_credit_limit                      297
consfin_credibility                       297
consfin_org_count_current                 297
consfin_product_count                     297
consfin_max_limit                         297
consfin_avg_limit                         297
latest_query_day                          304
loans_latest_day                          297
dtype: int64
# 查看数据类型
data.dtypes.value_counts()
float64    69
int64      11
object      1
dtype: int64
# 处理非数值型数据
object_columns_df = data.select_dtypes(include=['object'])
object_columns_df.iloc[0]
reg_preference_for_trad    一线城市
Name: 0, dtype: object
data['reg_preference_for_trad'].unique()
array(['一线城市', '三线城市', '境外', '二线城市', '其他城市', nan], dtype=object)
# 数据映射
mapping_dict = {
    'reg_preference_for_trad': {
        '一线城市': 1,
        '二线城市': 2,
        '三线城市': 3,
        '其他城市': 4,
        '境外': 0
    }
}
data = data.replace(mapping_dict)
data['reg_preference_for_trad'].unique()
array([  1.,   3.,   0.,   2.,   4.,  nan])
data.head(10)
low_volume_percentmiddle_volume_percenttake_amount_in_later_12_month_highesttrans_amount_increase_rate_latelytrans_activity_monthtrans_activity_daytransd_mcctrans_days_interval_filtertrans_days_intervalregional_mobility...loans_max_limitloans_avg_limitconsfin_credit_limitconsfin_credibilityconsfin_org_count_currentconsfin_product_countconsfin_max_limitconsfin_avg_limitlatest_query_dayloans_latest_day
00.010.9900.900.550.31317.027.026.03.0...2900.01688.01200.075.01.02.01200.01200.012.018.0
10.020.9420001.281.000.45819.030.014.04.0...3500.01758.015100.080.05.06.022800.09360.04.02.0
20.040.9601.001.000.11413.068.022.01.0...1600.01250.04200.087.01.01.04200.04200.02.06.0
30.000.9620000.130.570.77722.014.06.03.0...3200.01541.016300.080.05.05.030000.012180.02.04.0
40.010.9900.461.000.17513.066.042.01.0...2300.01630.08300.079.02.02.08400.08250.022.0120.0
50.020.9820007.591.000.73327.08.011.03.0...5300.01941.011200.080.010.012.020400.08130.03.04.0
60.020.98023.670.940.08710.054.053.02.0...2200.02200.07600.073.02.02.016800.08900.01.03.0
70.020.9800.250.880.30219.020.020.02.0...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
80.030.6500.310.760.47215.021.014.02.0...5300.04750.05500.079.08.011.019200.07987.024.07.0
90.010.995000.801.000.08815.036.035.02.0...2800.01520.00.00.00.00.00.00.018.0142.0

10 rows × 81 columns

# 划分数据集和测试集
y = data['status']
x = data.drop(['status'], axis=1)
X_train,X_test,y_train,y_test=train_test_split(x, y, test_size=0.3, random_state=2018)
print("Training Size:{}".format(X_train.shape))
print('Testing Size:{}'.format(X_test.shape))
Training Size:(3327, 80)
Testing Size:(1427, 80)

问题讨论

1. 缺失数据
  • 为什么需要处理缺失值?
  • 缺失率大于多少时应当抛弃该特征?
  • 缺失值填充有哪些方法?
  • 采用均值填充的影响或者优缺点?
  • 需要依据什么样的准则去选择合适的方法?

1 为什么需要处理缺失值?
不同缺失值的处理方式在一定程度上影响了特征提取、建模和模型训练,缺失值太多,可以尝试着直接删除,如果不删除,处理不好,可能会引来噪声缺失值较少,少于某一缺失率时,直接删除又会带来信息的损失,此时可以采取适当的填充方式

2 缺失率大于多少时应当抛弃该特征?
一般为70%,但是还要分析该特征与训练目标的重要程度

3 缺失值填充有哪些方法?
(1) 插值填充
- 特殊值,均值、中位数、众数等
(2) 插补法
- 随机插补法----从总体中随机抽取某个样本代替缺失样本
- 多重插补法----通过变量之间的关系对缺失数据进行预测,利用蒙特卡洛方法生成多个完整的数据集,在对这些数据集进行分析,最后对分析结果进行汇总处理
- 热平台插补----指在非缺失数据集中找到一个与缺失值所在样本相似的样本(匹配样本),利用其中的观测值对缺失值进行插补
- 拉格朗日差值法和牛顿插值法

4 采用均值填充的影响或者优缺点?
缺点:大大降低数据的方差,即随机性

5 需要依据什么样的准则去选择合适的方法?
(1) 删除
- 如果行和列的缺失达到一定的比例,建议放弃整行或整列数据
(2) 插补
- 列的维度上,如果是连续性,就使用平均值插补,如果是离散性,就使用众数来插补
- 行的维度上,引入预测模型,可考虑辅助回归,通过变量间的关系来预测缺失数据
(3) 不处理
- 当缺失值对模型的影响不大时,直接在包含空值的数据上进行数据挖掘,很多模型对于缺失值有容忍度或灵活的处理方法,常见的能够自动处理缺失值的模型包括:KNN、决策树和随机森林、神经网络和朴素贝叶斯、DBSCAN

2. 数据探索

对于字段较少的情况下经常使用绘图来更直观的观察数据的分布,进而对数据进行针对性的处理;但是再字段量较多的情况下一个一个字段去绘图会比较费时间,那应该用怎么的顺序逻辑对字段进行处理?

这个情况在银行业普遍存在,当然,其他领域估计也会有。以我个人经历,在实际生产中会有一张表超过300个字段的情况,哪些字段该要哪些不该要确实比较麻烦。我采取的方式是首先去判断哪些字段值重复率较高,这个通过sql语句group by可以直接看出来。其次把数据通过spss对每一个特征进行分析,是绘图还是简单的分析,软件里面都有提供,基本上通过上面两步保证百分之七八十吧,如果仅仅是是在数据探索阶段的话,基本上就完成了

3.时间序列

时间序列应该怎么处理?除了提取天数还能做什么处理?

依情况处理,主要看单独时间字段或时间字段与某些字段的组合属性对目标分析的作用程度,再采取相应方式来进行特征提取比如:可以将时间字段与其他字段属性进行组合,分析每天、每周、每月或特点星期几等情况下特征数据频率信息,总的来说还是得看分析得目标

了考虑时间序列这个单独的特征外,往往是将时间序列和具有时间属性的特征联合起来分析,查看组合特征的对所需要分析的内容的影响

4.异常值和离群值

怎样判断离群值以及是否需要删除离群值或怎样替代离群值?(比如一些手动录入过程中出错产生的离群值等)

离群点可以用分位数
看与平均值的偏差超过几倍标准差
LOF算法
describe的时候加一个 箱型图

大多数的参数统计数值,如均值、标准差、相关系数等,以及基于这些参数的统计分析,均对离群值高度敏感。因此,离群值的存在会对数据分析造成极大影响

可以参考博文: https://www.jianshu.com/p/0c967a1526ef

5.分类数据的编码

这里城市的分类显然不适合用独热码编码了,那么如果在其它时候使用sklearn.preprocessing中的LabelBinarizer后重新编码的文本特征又怎样应用到预测中?

one-hot:彼此之间没关系
dummy: 有顺序性的特征
试想编码出来只是数字,建模的时候本质上计算的是类别间的距离
所以我们就需要根据实际情况去判断,我们的类别间到底存不存在距离关系

6.类别不平衡问题

举实际场景例子:
1.信用卡欺诈 可能几百几千个客户里面才会有一个欺诈情况出现
2.网页广告点击率,这个比例更夸张
3.医生病情误判情况

处理数据不平衡问题

  1. 改变数据量大小,使类别间变得平衡
  2. 不改变数据量,设置成本矩阵或代价函数来限定

以上问题是大家提出问题并一起讨论的解决办法,收获颇多,希望之后能够多多参与一起讨论、学习
下面是负责人整理的资料

缺失值处理方法:https://blog.csdn.net/w352986331qq/article/details/78639233
数据编码:https://mp.weixin.qq.com/s/U93vvFwZ8vSJuswk24yc6w
数据不平衡问题

数据不平衡问题
what?
通常出现在类别问题中,类别间数据量差异过大
例如:10W正例,100反例

why?
为什么会出现
- 收集的数据量不多,不全面
- 数据集本身特点,如信用卡欺诈、用户投诉、机器故障数据,可能1K个顾客里面,只有1个会欺诈。

为什么需要处理
- 类别间的不平衡,会很容易导致模型预测偏移。比如信用卡欺诈,直接跑决策树,可能预测准确率会高达99.99%不会欺诈。这样的结果并没有意义。
- 所以像这种类别不平衡的,我们需要采取一些策略去调整数据。

how?
目标:
- 改变数据量,使类别间数据量尽量相等。 正例数据量 = 反例数据量
- 数据量不变,通过采用不同模型或者评估方法来消除不平衡的影响

策略:
- 1. 扩大数据集。以期得到更多的分布信息;同时扩大数据量后也方便后面的重新采样。
- 2. 尝试新角度理解问题。我们可以把那些小类的样本作为异常点(outliers),因此该问题便转化为异常点检测(anomaly detection)与变化趋势检测问题(change detection)。

异常点检测即是对那些罕见事件进行识别。如通过机器的部件的振动识别机器故障,又如通过系统调用序列识别恶意程序。这些事件相对于正常情况是很少见的。

变化趋势检测类似于异常点检测,不同在于其通过检测不寻常的变化趋势来识别。如通过观察用户模式或银行交易来检测用户行为的不寻常改变。

Sampling methods(采样方法)
增加偏少的类别 - OverSampling过采样
① 简单随机重复。
② SMOTE。在相同边界内,人工造数据。
减少偏多的类别 - UnderSampling欠采样
① 简单随机抽样。
② 带边界清理。尽量保留分布信息。

采用集成学习思想:
把多数类进行划分,然后和少数类组合成多个小的训练集,然后生成学习器,最后再集成。

  • 2
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值