kaggle——Santander Customer Transaction Prediction

3 篇文章 0 订阅
1 篇文章 0 订阅

比赛地址
https://www.kaggle.com/c/santander-customer-transaction-prediction

一、赛后总结

1.1学习他人

1.1.1 List of Fake Samples and Public/Private LB split

https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split
首先测试集和训练集统计分析非常相似,但是unique value统计差别大。所以猜想测试集由真实样本由真实样本采样而生成的。由此可以找到100000个fake example、100000个real example。又假设采样是再分public/private LB后进行的,所以real example可以分出50000 public+50000 private

1.1.2 giba single model public 0.9245 private 0.9234

https://www.kaggle.com/titericz/giba-single-model-public-0-9245-private-0-9234

>Reverse features
不理解为什么要反转负相关特征

#Reverse features
for var in features:
    if np.corrcoef( train_df['target'], train_df[var] )[1][0] < 0:
        train_df[var] = train_df[var] * -1
        test_df[var]  = test_df[var]  * -1

>生成特征
每个特征生成四个特征,为本值、count、feature_id、rank值。(不是很理解feature_id的作用),后面还对var做了归一化。 但生成的是(40000000, 4)的矩阵,也就是竖列的堆叠。

def var_to_feat(vr, var_stats, feat_id ):
    new_df = pd.DataFrame()
    new_df["var"] = vr.values
    new_df["hist"] = pd.Series(vr).map(var_stats)
    new_df["feature_id"] = feat_id
    new_df["var_rank"] = new_df["var"].rank()/200000.
    return new_df.values

TARGET = np.array( list(train_df['target'].values) * 200 )

TRAIN = []
var_mean = {}
var_var  = {}
for var in features:
    tmp = var_to_feat(train_df[var], var_stats[var], int(var[4:]) )
    var_mean[var] = np.mean(tmp[:,0]) 
    var_var[var]  = np.var(tmp[:,0])
    tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]
    TRAIN.append( tmp )
TRAIN = np.vstack( TRAIN )

del train_df
_=gc.collect()

print( TRAIN.shape, len( TARGET ) )

>模型LGBM
用LGBM模型训练

model = lgb.LGBMClassifier(**{
     'learning_rate': 0.04,
     'num_leaves': 31,
     'max_bin': 1023,
     'min_child_samples': 1000,
     'reg_alpha': 0.1,
     'reg_lambda': 0.2,
     'feature_fraction': 1.0,
     'bagging_freq': 1,
     'bagging_fraction': 0.85,
     'objective': 'binary',
     'n_jobs': -1,
     'n_estimators':200,})

MODELS = []
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=11111)
for fold_, (train_indexes, valid_indexes) in enumerate(skf.split(TRAIN, TARGET)):
    print('Fold:', fold_ )
    model = model.fit( TRAIN[train_indexes], TARGET[train_indexes],
                      eval_set = (TRAIN[valid_indexes], TARGET[valid_indexes]),
                      verbose = 10,
                      eval_metric='auc',
                      early_stopping_rounds=25,
                      categorical_feature = [2] )
    MODELS.append( model )
    
del TRAIN, TARGET
_=gc.collect()

>预测
对test数据进行同样的特征工程,对每个特征每个模型预测,然后logx-log(1-x),为什么??

为什么最后还要sub[‘target’] = sub[‘target’].rank() / 200000.???
原话: rank or not it produces the same score since the metric is rank based (AUC). I used rank just to normalize to the range [0-1]

ypred = np.zeros( (200000,200) )
for feat,var in enumerate(features):
    tmp = var_to_feat(test_df[var], var_stats[var], int(var[4:]) )
    tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]
    for model_id in range(10):
        model = MODELS[model_id]
        ypred[:,feat] += model.predict_proba( tmp )[:,1] / 10.
ypred = np.mean( logit(ypred), axis=1 )

sub = test_df[['ID_code']]
sub['target'] = ypred
sub['target'] = sub['target'].rank() / 200000.
sub.to_csv('golden_sub.csv', index=False)
print( sub.head(10) )

I studied your code some more. This is a brilliant solution !! Reversing some variables and stacking all of them into 4 columns is really ingenious. It simulates ideas from an NN convolution where the model can use patterns it learns from one variable to assist in its pattern detection of another variable. This also prevents LGBM from modeling spurious interactions between variables. But it’s more advanced than a convolution (that uses the same weights for all variables) because you provide column 3 which has the original variable’s number (0-199), so your LGBM can customize its prediction for each variable. Lastly you combine everything back together mathematically accurate by using mean logit. Very very nice. Setting the frequency count as a categorical value is a nice touch which allows LGBM to efficiently divide the different distributions. You maximized the modeling ability of an LGBM duplicating other participants’ success with NNs. I am quite impressed !!

持续更新。。。

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值