2022科大讯飞商品销量智能预测挑战赛—参赛总结

最新推荐文章于 2022-12-21 10:47:00 发布

七里香还是稻香

最新推荐文章于 2022-12-21 10:47:00 发布

阅读量2.9k

点赞数 5

分类专栏：那些年打过的比赛文章标签：科大讯飞商品销量智能预测挑战赛比赛总结

本文链接：https://blog.csdn.net/sinat_39629323/article/details/126500790

版权

那些年打过的比赛专栏收录该内容

5 篇文章 2 订阅

订阅专栏

2022科大讯飞商品销量智能预测挑战赛—参赛总结

摘要

比赛网址：https://challenge.xfyun.cn/topic/info?type=product-sales
赛题任务：本次大赛提供了商品销量历史数据作为训练样本，参赛选手需基于提供的样本构建模型，预测商品未来三个月的销售量。初赛提供了2018年2月1日至2020年12月31日的若干商品历史销量数据和订单数据，预测其2021年1月至3月的销量数据。决赛提供了2018年2月1日至2021年3月31日的若干商品历史销量数据和订单数据，预测其2021年4月至6月的销量数据。相当于复赛时提供了初赛测试集的标签，这一点在复赛时可以利用起来，能够节省不少提交次数。而且赛后也可以继续研究。
数据分析：提供的数据集的销量是细分到每一天的，但预测是要细分到月份的，因此，可以有两种角度去预测销量。第一是按天进行预测，但是有些特征是没有提供细分到天的数据的，特征相对按月预测会少一点；第二种是按月预测，这就需要先将销量按月汇总起来构建标签。最后我是选择了第二种。
特征构建：主要用到了lag和rolling聚合的mean、std、median特征。
模型选择：采用比赛常用模型：lightgbm、xgboost和catboost分别构建回归模型，最后并用TPE搜索最优参数组合。
训练方案：每个模型均使用5折交叉验证，以product_id均衡划分训练集和验证集，对测试集的预测值取平均即得到该模型最终的预测值。
模型融合：对三个5折交叉验证后的模型结果，采用stacking进行模型融合。
赛题得分：初赛最终成绩：0.70854分，排名：第71名；复赛最终成绩：0.74399分，排名：第46名。
总结结论：这是我参加比赛以来第二次进入复赛阶段的比赛，很有纪念意义，虽然跟前排大佬还有很大的差距，但是从中也学习到了很多知识，有机会下次再来~

赛题任务

特征构建

并没有构建过多的特征，因为我发现即使构建更多的lag或者rolling特征，也不会有提升，甚至是降分的。期间有通过根据特征重要性、shap、与标签的相关性来筛选特征，但筛选后的效果也不是很好。一直都没有找到比较强的特征。下面是部分代码：

feats_cols=[]
    
for i in range(1, 17):
    for f in ['label', 'order', 'start_stock', 'end_stock']:
        data[f+'_shift_%d'%i] = data.groupby('product_id')[f].shift(i+2)
        if i <= 12:
            feats_cols.append(f+'_shift_%d'%i)

for i in [3, 6, 12]:
    for f in ['label', 'order', 'start_stock', 'end_stock']:
        data[f+'_mean_%d'%i] = data[[f+'_shift_%d'%i for i in range(1, i+1)]].mean(axis=1)
        data[f+'_std_%d'%i] = data[[f+'_shift_%d'%i for i in range(1, i+1)]].std(axis=1)
        data[f+'_median_%d'%i] = data[[f+'_shift_%d'%i for i in range(1, i+1)]].median(axis=1)
        feats_cols.extend([f+'_mean_%d'%i,f+'_std_%d'%i,f+'_median_%d'%i])

训练方案

使用StratifiedKFold按product_id均衡划分训练集和验证集进行5折交叉验证，期间对比过使用KFold来划分，实验验证StratifiedKFold要更胜一筹，分别训练lightgbm、xgboost和catboost三个模型。训练代码如下：

def train_model_with_nfold(data,train,test,feat_cols,feats_cols,category_cols,
    lgb_params=None,xgb_params=None,cat_params=None,model_types=['lgb','xgb','cat'],fold_num=5,
    seeds=[2022],stratified=True,num_boost_round=10000,early_stopping_rounds=200,verbose=200,
    un_select_cols=[]
    ):

    score_lgb = np.zeros(fold_num)
    score_xgb = np.zeros(fold_num)
    score_cat = np.zeros(fold_num)
    score = np.zeros(fold_num)
    oof_lgb = np.zeros(len(train))
    oof_xgb = np.zeros(len(train))
    oof_cat = np.zeros(len(train))
    oof = np.zeros(len(train))
    pred_y = pd.DataFrame()

    for seed in seeds:
        for model_type in model_types:
            if stratified:
                kf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=seed)
            else:
                kf = KFold(n_splits=fold_num, shuffle=True, random_state=seed)
            if model_type == 'cat':
                feat_cols = [col for col in feats_cols if col not in un_select_cols+['label']]
                np.random.shuffle(feat_cols)
            if model_type == 'xgb':
                if 1001 in train['product_id'].values:
                    LE=LabelEncoder()
                    for col in category_cols:
                        LE.fit(data[col].astype(str))
                        train[col]=LE.transform(train[col].astype(str))
                        test[col]=LE.transform(test[col].astype(str))
            else:
                train = data[data['label'].notna()].reset_index(drop=True)
                test = data[data['label'].isna()].reset_index(drop=True)
            for fold, (train_idx, val_idx) in enumerate(kf.split(train[feat_cols], train['product_id'])):
                print(f'-----------------fold：{fold+1} -----------------seed：{seed} -----------------model_type：{model_type}')
                if model_type == 'lgb':
                    lgb_params['seed']=seed
                    tra = lgb.Dataset(train.loc[train_idx, feat_cols],train.loc[train_idx, 'label'])
                    val = lgb.Dataset(train.loc[val_idx, feat_cols],train.loc[val_idx, 'label'])
                    model = lgb.train(lgb_params, tra, valid_sets=[val], num_boost_round=num_boost_round,categorical_feature=category_cols,
                                    callbacks=[lgb.early_stopping(early_stopping_rounds), lgb.log_evaluation(verbose)])

                    score_lgb[fold]=model.best_score['valid_0']['rmse'] / len(seeds)
                    score[fold]=model.best_score['valid_0']['rmse'] / len(seeds) / len(model_types)
                    oof_lgb[val_idx] += model.predict(train.loc[val_idx, feat_cols], num_iteration=model.best_iteration) / len(seeds)
                    oof[val_idx] += model.predict(train.loc[val_idx, feat_cols], num_iteration=model.best_iteration) / len(seeds) / len(model_types)
                    pred_y[f'fold{fold}_seed{seed}_{model_type}'] = model.predict(test[feat_cols], num_iteration=model.best_iteration)
                elif model_type == 'xgb':
                    xgb_params['seed']=seed
                    train_matrix = xgb.DMatrix(train.loc[train_idx, feat_cols] , label=train.loc[train_idx, 'label'])
                    valid_matrix = xgb.DMatrix(train.loc[val_idx, feat_cols] , label=train.loc[val_idx, 'label'])
                    watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
                    model = xgb.train(xgb_params, train_matrix, num_boost_round=num_boost_round, evals=watchlist, verbose_eval=verbose, early_stopping_rounds=early_stopping_rounds)

                    score_xgb[fold]=model.best_score / len(seeds)
                    score[fold]=model.best_score / len(seeds) / len(model_types)
                    oof_xgb[val_idx] += model.predict(valid_matrix,iteration_range=(0,model.best_iteration+1)) / len(seeds)
                    oof[val_idx] += model.predict(valid_matrix,iteration_range=(0,model.best_iteration+1)) / len(seeds) / len(model_types)
                    pred_y[f'fold{fold}_seed{seed}_{model_type}'] = model.predict(xgb.DMatrix(test[feat_cols]),iteration_range=(0,model.best_iteration+1))
                else:
                    cat_params['random_seed']=seed
                    model = cat.CatBoostRegressor(num_boost_round=num_boost_round, **cat_params)
                    trn_x = train.loc[train_idx, feat_cols]
                    trn_y = train.loc[train_idx, 'label']
                    val_x = train.loc[val_idx, feat_cols]
                    val_y = train.loc[val_idx, 'label']
                    model.fit(trn_x, trn_y, eval_set=(val_x, val_y),cat_features=category_cols, use_best_model=True, verbose=verbose)
                    
                    score_cat[fold] = model.best_score_['validation']['RMSE'] / len(seeds)
                    score[fold] = model.best_score_['validation']['RMSE'] / len(seeds) / len(model_types)
                    oof_cat[val_idx] += model.predict(val_x) / len(seeds)
                    oof[val_idx] += model.predict(val_x) / len(seeds) / len(model_types)
                    pred_y[f'fold{fold}_seed{seed}_{model_type}'] = model.predict(test[feat_cols])

    print(f'score_lgb={score_lgb.mean()}\nscore_xgb={score_xgb.mean()}\nscore_cat={score_cat.mean()}\nscore={score.mean()}')

    return oof_lgb,oof_xgb,oof_cat,oof,pred_y

可以看到，加入了随机数种子seed的循环，但其实只用了一个2022，因为试验过多个seed的情况，效果不升反降，所以就没采用了；
还有，针对xgboost模型，如果是类别特征，最好是将它们进行LabelEncoder编码之后再输入到模型中训练，因为xgboost不支持处理类别特征，并且，在我的试验当中，即使是像本赛题中的product_id等数值型类别特征，编码与不编码，分数分别是0.654与0.577，相差还是很大的；
还有，针对catboost模型，发现输入特征的排列顺序对结果的预测有非常大的影响（对lgb和xgb的影响相对较小），所以用shuffle对特征进行随机排序，并对排序后所预测出的结果是否与其他两个模型的预测结果相当，而选择合适的shuffle次数后的特征顺序，作为catboost模型的特征输入。

模型融合

一开始的模型融合都是采用简单的算术平均融合，后来试了下stacking融合，确实效果要更好一点点（好4个千分点左右）。模型融合可谓是竞赛中最后的杀手锏，初赛中，lightgbm、xgboost和catboost三个模型最优的线上分数分别为：0.68144，0.69092，0.66796。采用stacking融合后，分数为0.70854，融合代码如下：

def RidgeCV(train,oof_lgb,oof_xgb,oof_cat,pred_y,fold_num=5,seed=2022):
    oof=pd.DataFrame([oof_lgb,oof_xgb,oof_cat]).T
    oof.columns=['lgb','xgb','cat']
    oof['product_id']=train['product_id']
    oof['label']=train['label']

    y=pd.DataFrame([pred_y[[i for i in pred_y.columns if 'lgb' in i]].mean(1),pred_y[[i for i in pred_y.columns if 'xgb' in i]].mean(1),pred_y[[i for i in pred_y.columns if 'cat' in i]].mean(1)]).T
    y.columns=['lgb','xgb','cat']

    kf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=seed)
    oof_res=np.zeros(len(oof))
    prediction=np.zeros(len(y))
    for fold, (train_idx, val_idx) in enumerate(kf.split(oof[['lgb','xgb','cat']], oof['product_id'])):
        train_x=oof.loc[train_idx,['lgb','xgb','cat']]
        train_y=oof.loc[train_idx,'label']
        
        valid_x=oof.loc[val_idx,['lgb','xgb','cat']]
        valid_y=oof.loc[val_idx,'label']
        
        model = Ridge(random_state=2022)
        model.fit(train_x,train_y)
        
        oof_res[val_idx]=model.predict(valid_x)
        prediction+=model.predict(y)/fold_num

    return oof_res,prediction

赛题得分

初赛

baseline：https://github.com/datawhalechina/competition-baseline/blob/master/competition/%E7%A7%91%E5%A4%A7%E8%AE%AF%E9%A3%9EAI%E5%BC%80%E5%8F%91%E8%80%85%E5%A4%A7%E8%B5%9B2022/%E5%95%86%E5%93%81%E9%94%80%E9%87%8F%E6%99%BA%E8%83%BD%E9%A2%84%E6%B5%8B%E6%8C%91%E6%88%98%E8%B5%9B_baseline.ipynb 线上得分为：0.41662
lightgbm：将baseline的特征采用我上述的代码构建之后，线上分数为0.65769；然后随机一次特征顺序后，为0.66355；再经过TPE参数搜索之后，为0.68144。
xgboost：将baseline的特征采用我上述的代码构建之后，线上分数为0.57729；然后将类别特征经过编码后，为0.65408；再经过TPE参数搜索之后，为0.69092，上分显著！
catboost：将baseline的特征采用我上述的代码构建之后，并且再随机一次特征顺序即两次后，线上分数为0.66796。期间也尝试了TPE参数搜索，并没有找到且很难找到最优的参数组合，原因可能是特征顺序对其影响本来就很大，所以不确定的因素太多了，参数的搜索也就更加困难了。
最后将三个模型最优的线上分数的结果，采用stacking融合，得到初赛最后的分数为：0.70854，排名：71/507（再一次成功晋级复赛~）