KDD cup 2020 推荐系统赛道介绍（含baseline 0.25）

最新推荐文章于 2023-03-29 18:30:44 发布

工藤旧一

最新推荐文章于 2023-03-29 18:30:44 发布

阅读量1.9k

点赞数

分类专栏： # CTR

本文链接：https://blog.csdn.net/weixin_45459911/article/details/106148695

版权

CTR 专栏收录该内容

28 篇文章 4 订阅

订阅专栏

KDD Debiasing 赛题理解

一、解决什么问题？

A．大主题：为了解决多模态数据情境下（短视频，微博，明星带货等新的零售业务场景），推荐系统面临的数据统筹，平衡和挖掘挑战。

B. 具体问题解决目前推荐系统的马太效应（曝光范围越来越窄的问题）。能获胜的解决方案需要在历史数据中出现很少的物品上预测表现良好，所以消除 click data 偏差至关重要。为了更贴近线下训练，线上测试的真实业务场景，训练集和测试集也有一定差距。（主要体现在 trends 和 item popularity）。 Train data， test data 的 timestamp 包括了很多时期，其中包括一些促销活动（可能需要 remove shift）

二、数据是怎样的？

数据格式：
1、item_id：商品的唯一标识符
2、txt_vec：项目的文本特征，它是由预先训练的模型生成的 128 维实值向量
3、img_vec：项目的图像特征，它是由预先训练的模型生成的 128 维实值向量
4、user_id：用户的唯一标识符时间：点击事件发生的时间戳，即：（unix_timestamp-random_number_1）/ random_number_2
5、user_age_level：用户所属的年龄段
6、user_gender：用户的性别，可以为空
7、user_city_level：用户所在城市的等级数据包括超过 10 天的数据，其中有一个 sales campaign。
It involves more than 1 million clicks, 100k items, and 30k users. The total size of the dataset is around 500MB. prediction 的。

三、预测什么？
“Query_time is the timestamp when the user clicks the next item. The task of this competition is to predict the next item clicked by each user that appeared in underexpose_test_qtime-T.csv. In particular, the participants need to recommend fifty items for each user. The participants will receive a positive score if any one of the fifty recommended items matches the ground-truth click. We ensure that the ground-truth next item is in underexpose_item_feat.csv. However, there can zero observed clicks in the training data, albeit not very likely.”
天池组委会按照比赛九个阶段的 timeline 释放数据密码。根据要求的时间轴（就是比赛进入到 phaseT+1 之前，七天时间内），提交 phase T 以前的所有预测结果（含 phase T）。phaseT release 出来以后是可以用 phaseT train data 训练之前 phase0-phaseT-1 的 predict。PhaseT 和 PhaseT-1 之间大约有 2/3 的数据是 overlapped 的。

四、如何提交？
根据要求的时间轴提交 phaseT 以前的所有预测结果（含 phaseT）。最近一次是 phase7， 2020-05-22 23:59:59 phase 7 data release ，2020-05-29 23:59:59 之前提交结果。
i.e. 在 phase7 的提交时，需要训练并预测之前所有 phase（phase0-phase7）的数据预测结果。
“Filename of the submission：underexpose_submit-T.csv The submitted file should be a CSV file with 51 columns. There is no need to include the headers, i.e., the names of the columns. The 51 columns of the submitted file should be:user_id, item_id_01, item_id_02, …, item_50. The ordering of these fifty items matters. Please place the items that are deemed most likely to be clicked by the user in the front. We have ensured that each user_id will not appear in more than one phase. Therefore, you don’t need to specify which phase each row of your submission is for.”

五、评价机制是什么？
一共有四个指标
（1）ndcg_50_full：所有用户的预测点击的 ndcg 的和
（2）ndcg_50_half：所有用户的预测点击的 ndcg 的和，但是只考虑那些冷门商品（总点击次数后 50%的商品，为了衡量对冷门商品的点击预测）
（3）hitrate_50_full：对所有用户的推荐的精准度，比如用户点击了商品 A，在推荐的 top50 里面包含了 A，则认为命中，命中的用户数除以总用户数，可认为是 precision （4）hitrate_50_half：同（3）但只考虑那些点击了冷门商品的用户，相当于在（3）的基础上对用户做了一个筛选。

我的理解是 phase 0-6 的 test data 就是 timeline 里面说的用来 improve 模型的 test data A。
Phase7-9 的 test data 就是用来最终评分的 test data B。9 个 phase 结束后，NDCG@50-full 在 top10%获得资格，然后比较 NDCG@50-rare 分数。

六、baseline——基于Item_CF

import pandas as pd
from tqdm import tqdm
from collections import defaultdict
import math

# fill user to 50 items  
def get_predict(df, pred_col, top_fill):  
    top_fill = [int(t) for t in top_fill.split(',')]  
    scores = [-1 * i for i in range(1, len(top_fill) + 1)]  
    ids = list(df['user_id'].unique())  
    fill_df = pd.DataFrame(ids * len(top_fill), columns=['user_id'])  
    fill_df.sort_values('user_id', inplace=True)  
    fill_df['item_id'] = top_fill * len(ids)  
    fill_df[pred_col] = scores * len(ids)  
    df = df.append(fill_df)  
    df.sort_values(pred_col, ascending=False, inplace=True)  
    df = df.drop_duplicates(subset=['user_id', 'item_id'], keep='first')  
    df['rank'] = df.groupby('user_id')[pred_col].rank(method='first', ascending=False)  
    df = df[df['rank'] <= 50]  
    df = df.groupby('user_id')['item_id'].apply(lambda x: ','.join([str(i) for i in x])).str.split(',', expand=True).reset_index()  
    return df  

now_phase = 4
train_path = './data/underexpose_train'  
test_path = './data/underexpose_test'  
recom_item = []  

whole_click = pd.DataFrame()  
for c in range(now_phase + 1):  
    print('phase:', c)  
    click_train = pd.read_csv(train_path + '/underexpose_train_click-{}.csv'.format(c), header=None,  names=['user_id', 'item_id', 'time'])  
    click_test = pd.read_csv(test_path + '/underexpose_test_click-{}.csv'.format(c,c), header=None,  names=['user_id', 'item_id', 'time'])  

    all_click = click_train.append(click_test)  
    whole_click = whole_click.append(all_click)  
    whole_click = whole_click.drop_duplicates(subset=['user_id','item_id','time'],keep='last')
    whole_click = whole_click.sort_values('time')

    item_sim_list, user_item = get_sim_item(whole_click, 'user_id', 'item_id', use_iif=False)  

    for i in tqdm(click_test['user_id'].unique()):  
        rank_item = recommend(item_sim_list, user_item, i, 500, 500)  
        for j in rank_item:  
            recom_item.append([i, j[0], j[1]])  
            
# find most popular items  
top50_click = whole_click['item_id'].value_counts().index[:50].values  
top50_click = ','.join([str(i) for i in top50_click])  

recom_df = pd.DataFrame(recom_item, columns=['user_id', 'item_id', 'sim'])  
result = get_predict(recom_df, 'sim', top50_click)  
result.to_csv('baseline.csv', index=False, header=None)

官方 FAQ 和 Evaluation code 链接：
https://tianchi.aliyun.com/forum/postDetail?postId=102089

知乎话题
https://www.zhihu.com/topic/19683418/hot

Phase timeline：
在这里插入图片描述
代码来自于鱼佬的baseline；

工藤旧一

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
KDD cup 2020 推荐系统赛道介绍（含baseline 0.25）

KDD Debiasing 赛题理解一、解决什么问题？A．大主题：为了解决多模态数据情境下（短视频，微博，明星带货等新的零售业务场景），推荐系统面临的数据统筹，平衡和挖掘挑战。B. 具体问题解决目前推荐系统的马太效应（曝光范围越来越窄的问题）。能获胜的解决方案需要在历史数据中出现很少的物品上预测表现良好，所以消除 click data 偏差至关重要。为了更贴近线下训练，线上测试的真实业务场景，训练集和测试集也有一定差距。（主要体现在 trends 和 item popularity）
复制链接

扫一扫