梧桐杯重庆赛道B榜第二名开源代码

最新推荐文章于 2023-03-29 18:50:38 发布

starry0001

最新推荐文章于 2023-03-29 18:50:38 发布

阅读量1k

点赞数

分类专栏： Python 梧桐杯竞赛

本文链接：https://blog.csdn.net/qq_39158406/article/details/116573374

版权

Python 同时被 2 个专栏收录

15 篇文章 2 订阅

订阅专栏

梧桐杯竞赛

1 篇文章 0 订阅

订阅专栏

梧桐杯重庆赛道B榜第二名开源代码

继上次的金融赛道Top1开源后，这次给大家贡献的是城市赛道B榜第二名的代码。该代码来自我们ChallengeHub的队伍(不是吧asir,Winto还有初九)，感谢他们的开源(全部代码，无任何私藏)。
玩过该比赛的人应该也知道，该赛道其实前排差距很小，并且AB榜之间其实还是有一些差距的，最后的得分其实还是需要凭借一点运气的，因此开源代码仅供大家参考，学习一下特征工程的思路。废话不多说，直接开始吧。

首先比赛链接：https://js.dclab.run/v2/cmptDetail.html?id=464

接下里直接上代码：

数据读取：

常规的数据读取，为了更好的理解业务特征，以便进行特征工程，将所有的columns都改为了原始的名字。

import pandas as pd
import numpy as np
train_label=pd.read_csv('train_label.csv')
test=pd.read_csv('result_predict_B.csv')
train_set=pd.read_csv('train_set.csv')
train=pd.merge(train_set,train_label,on='user_id',how='left')
test['label']=-1
data=pd.concat([train,test])
fea=['用户id','用户号码','性别','年龄','星级','在网时长','细分市场','3月消费','2月消费','1月消费','3月流量','2月流量','1月流量',
     '3月语音','2月语音','1月语音','三个消费平均','三个流量平均','三个语音平均','3月语音超套金额','2月语音超套金额','1月语音超套金额'
    ,'3月流量超套金额','2月流量超套金额','1月流量超套金额','是否本网','是否异网','带宽','是否激活','宽带捆绑标识','终端捆绑标识'
,'话费签约标识','套餐签约标识','用户套餐价值','用户主资费套餐','3月流量饱和度','2月流量饱和度','1月流量饱和度','是否家庭用户','是否产生5G'
     ,'终端类型','是否低消保号用户','是否当月换机','居住地5G','工作地5G','label'
    ]
data.columns=fea

特征工程

本次比赛中，该队伍主要从两个方面进行了特征工程。
一：特征之间的交互
LGB模型可以很好的学习到类别特征之间的交叉，用户画像的各种类型特征都是很容易学习到的。但是无法学习到数值特征之间的交叉，比如各种用户消费信息之间的加减乘除。如下所示，本模型主要是针对用户的金额消费和流量消费进行了特征拓展，这也是5G用户最关注的方面。
在这里插入图片描述

for i in range(1,4):
    data['{}月语音超套金额_add_月流量超套金额'.format(str(i))]=data['{}月语音超套金额'.format(str(i))]+data['{}月流量超套金额'.format(str(i))]
    data['{}月消费_min_月语音_add_月流量'.format(str(i))]=data['{}月消费'.format(str(i))]-data['{}月语音超套金额_add_月流量超套金额'.format(str(i))]
    data['{}月流量_mu_月流量饱和度'.format(str(i))]=data['{}月流量'.format(str(i))]*data['{}月流量饱和度'.format(str(i))]/100
    data['{}月流量_mu_月流量饱和度2'.format(str(i))]=data['{}月流量'.format(str(i))]*(100-data['{}月流量饱和度'.format(str(i))])/100
    data['{}月超套餐消费'.format(str(i))]=data['{}月消费'.format(str(i))]-data['用户主资费套餐']
    data['{}月超套餐消费2'.format(str(i))]=data['{}月消费'.format(str(i))]-data['用户套餐价值']
    data['{}月额外消费'.format(str(i))]=data['{}月超套餐消费'.format(str(i))]-data['{}月语音超套金额_add_月流量超套金额'.format(str(i))]
    data['{}月额外消费2'.format(str(i))]=data['{}月超套餐消费2'.format(str(i))]-data['{}月语音超套金额_add_月流量超套金额'.format(str(i))]

二：月份之间的交叉特征
数据中给了我们当月上月，上上月的一些数据。为了确认用户是否为5G的潜在用户，可以从用户的消费趋势和消费行为入手，将三个月的消费行为数据做一些交叉类的特征。具体的操作如下图所示。(其中F代表任意带有月份的特征，123代表上上月，上月，当月)
在这里插入图片描述

f1=['月消费','月流量','月语音','月语音超套金额']
f2=['月消费','月流量','月语音','月语音超套金额','月流量超套金额','月流量饱和度',
   '月语音超套金额_add_月流量超套金额','月消费_min_月语音_add_月流量','月流量_mu_月流量饱和度','月流量_mu_月流量饱和度2'
   ,'月超套餐消费','月超套餐消费2','月额外消费','月额外消费2']
for f in f2:
    m1,m2,m3=str(1)+f,str(2)+f,str(3)+f
    data['{}_min_{}'.format(m3,m1)]=data[m3]-data[m1]
    data['{}_min_{}'.format(m3,m2)]=data[m3]-data[m2]
    data['max_{}'.format(f)]=data[[m1,m2,m3]].max(axis=1)
    data['min_{}'.format(f)]=data[[m1,m2,m3]].min(axis=1)
    if f not in f1:
        data['mean_{}'.format(f)]=data[[m1,m2,m3]].mean(axis=1)

然后就是一些不是很重要的惨猜测性特征，这类特征在A榜中有几乎可以忽略不计的提升，B榜中并没有验证。

data['套餐价值_主资费套餐']=data['用户套餐价值']-data['用户主资费套餐']
data['套餐价值=主资费套餐']=(data['用户套餐价值']==data['用户主资费套餐']).astype(int)
data['用户号码1']=data['用户号码'].apply(lambda x:str(x)[0:2])
data['用户id1']=data['用户id'].apply(lambda x:str(x)[0:3])

特征工程完成后，还去除了一部分特征，因为判断用户是否是5G潜在用户，第三个月的消费还有月份间交叉的信息更加重要，1，2月的消费特征其实并不是很重要。(去除后同样在A榜有可以忽略不计的提升，在B榜没确认。)

nofea=[]
for  f in f2:
    m1,m2=str(1)+f,str(2)+f
    nofea.append(m1)
    nofea.append(m2)

建模

采用的是大家都懂的LightGBM建模，有且采用了该算法，不过考虑到本赛题中存在一定的波动性，而前排的差距又过小，为了结果的稳定性，采用了两组参数多个种子。具体的代码如下所示：

L=['性别','星级','细分市场','终端类型','用户套餐价值','用户主资费套餐']+['用户号码1','用户id1']
for col in L:
    data[col]=data[col].astype(str)
    
col=[i for i in data.columns if i not in ['用户id','用户号码','label','是否激活'] and i not in nofea]
X_train=data[data['label']!=-1].reset_index(drop=True)
X_testp=data[data['label']==-1].reset_index(drop=True)
y=X_train.label
X_train=X_train[col]
X_test=X_testp[col]
import time
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
parameters = {
    'learning_rate': 0.02,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 63,
    'feature_fraction': 0.85,
    'bagging_fraction': 0.85,
    'bagging_freq': 5,
    'seed': 2020,
    'bagging_seed': 1,
    'feature_fraction_seed': 20201101,
    'min_data_in_leaf': 50,
    'n_jobs': -1, 
    'verbose': -1, }
    
cate_fea=['性别','星级','细分市场','终端类型','用户套餐价值','用户主资费套餐']+['用户号码1','用户id1']#,'用户号码2','用户号码3']
for col in cate_fea:
    data[col]=data[col].astype('category')
col=[i for i in data.columns if i not in ['用户id','用户号码','label','是否激活'] and i not in nofea]
X_train=data[data['label']!=-1].reset_index(drop=True)
X_testp=data[data['label']==-1].reset_index(drop=True)

y=X_train.label
X_train=X_train[col]
X_test=X_testp[col]
oof1 = np.zeros(X_train.shape[0])
prediction1 = np.zeros(X_test.shape[0])
seeds = [5, 4, 2, 209 * 2 + 1024, 4096, 2048]
num_model_seed = len(seeds)
num_cv= 10
##模型1
for model_seed in range(num_model_seed):
    print('--' * 20 + str(seeds[model_seed]) + '--' * 20)
    oof_cat = np.zeros(X_train.shape[0])
    prediction_cat = np.zeros(X_test.shape[0])
    skf = StratifiedKFold(n_splits=num_cv, random_state=seeds[model_seed], shuffle=True)
    for index, (train_index, test_index) in enumerate(skf.split(X_train, y)):
        train_x, test_x, train_y, test_y = X_train.iloc[train_index], X_train.iloc[
            test_index], y.iloc[train_index], y.iloc[test_index]
        dtrain = lgb.Dataset(train_x, label=train_y)
        dval = lgb.Dataset(test_x, label=test_y)
        lgb_model = lgb.train(
            parameters,
            dtrain,
            num_boost_round=10000,
            valid_sets=[dtrain,dval],
            early_stopping_rounds=50,
            verbose_eval=200,
            #feval=myeval,
        )
        x2 = lgb_model.predict(test_x, num_iteration=lgb_model.best_iteration)
        y2 = lgb_model.predict(X_test, num_iteration=lgb_model.best_iteration)
        oof_cat[test_index] += x2
        prediction_cat += y2 / num_cv
    print(roc_auc_score(y,oof_cat))

    oof1 += oof_cat / num_model_seed
    prediction1 += prediction_cat / num_model_seed
roc_auc_score(y,oof1)

##模型2
parameters = {
    'learning_rate': 0.01,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 63,
    'feature_fraction': 0.85,
    'bagging_fraction': 0.85,
    'bagging_freq': 5,
    'seed': 2020,
    'bagging_seed': 1,
    'feature_fraction_seed': 20201101,
    'min_data_in_leaf': 50,
    'n_jobs': -1, 
    'verbose': -1, }
oof2 = np.zeros(X_train.shape[0])
prediction2 = np.zeros(X_test.shape[0])
seeds = [1024,123,234,345,456,567,678,789]
num_model_seed = len(seeds)
num_cv= 10
for model_seed in range(num_model_seed):
    print('--' * 20 + str(seeds[model_seed]) + '--' * 20)
    oof_cat = np.zeros(X_train.shape[0])
    prediction_cat = np.zeros(X_test.shape[0])
    skf = StratifiedKFold(n_splits=num_cv, random_state=seeds[model_seed], shuffle=True)
    for index, (train_index, test_index) in enumerate(skf.split(X_train, y)):
        train_x, test_x, train_y, test_y = X_train.iloc[train_index], X_train.iloc[
            test_index], y.iloc[train_index], y.iloc[test_index]
        dtrain = lgb.Dataset(train_x, label=train_y)
        dval = lgb.Dataset(test_x, label=test_y)
        lgb_model = lgb.train(
            parameters,
            dtrain,
            num_boost_round=10000,
            valid_sets=[dtrain,dval],
            early_stopping_rounds=50,
            verbose_eval=200,
            #feval=myeval,
        )
        x2 = lgb_model.predict(test_x, num_iteration=lgb_model.best_iteration)
        y2 = lgb_model.predict(X_test, num_iteration=lgb_model.best_iteration)
        oof_cat[test_index] += x2
        prediction_cat += y2 / num_cv
    print(roc_auc_score(y,oof_cat))

    oof2 += oof_cat / num_model_seed
    prediction2 += prediction_cat / num_model_seed
roc_auc_score(y,oof2)
pred_union1=prediction1.copy()/2+prediction2.copy()/2
oof5=oof1/2+oof2/2