数据理解

最新推荐文章于 2022-09-05 09:52:24 发布

*Major*

最新推荐文章于 2022-09-05 09:52:24 发布

阅读量682

点赞数

本文链接：https://blog.csdn.net/qq_41375318/article/details/106134618

版权

$数据理解$

$赛题目标：根据某个用户 i d, 给出 50 个商品推荐$

一数据集：

所有的特征：

item_id（商品id）：the unique identifier of the item
txt_vec(商品文本特征)：the item’s text feature, which is a 128-dimensional real-valued vector produced by a pre-trained model
img_vec（商品图像特征）：the item’s image feature, which is a 128-dimensional real-valued vector produced by a pre-trained model

user_id（用户id）：the unique identifier of the user
time(点击的时间戳)：timestamp when the click event happens, i.e.,（unix_timestamp - random_number_1）/ random_number_2
user_age_level（用户的年龄等级）：the age group to which the user belongs
user_gender（用户性别）：the gender of the user, which can be empty
user_city_level（用户所在城市的级别）：the tier to which the user’s city belongs

数据采集背景信息

采集的数据超过10天，期间包含一个商业售卖活动。发生的点击超过100万，一共有10万商品、3万用户。数据集总大小在500MB左右

数据集文件介绍

（1）underexpose_item_feat.csv（商品特征）： the columns of which are: item_id, txt_vec, img_vec

在这里插入图片描述
一个商品有257维，第一维是商品id,其余的是文本和图像特征

（2）underexpose_user_feat.csv（用户特征）：the columns of which are: user_id, user_age_level, user_gender, user_city_level
在这里插入图片描述
（3）underexpose_train_click-T.zip（用户在什么时候点击商品）：user_id, item_id, time

（4）underexpose_test_click-T.csv（用户在什么时候点击商品）：user_id, item_id, time

在这里插入图片描述
（5）underexpose_test_qtime-T.csv（用户查询商品时间）：user_id, query_time（点击下一次商品的时间戳）

在这里插入图片描述

（6）query_time实际上是用户点击下一次商品的时间戳，我们的目标就是能够在用户点击下一次商品之前的时间，进行50种商品的推荐，最后能够和未公开的实际用户点击商品的数据集，进行评价指标计算

二提交要求：

参赛应该要根据underexpose_test_qtime，即预测出下一次用户点击的商品

提交文件名：underexpose_submit-T.csv
提交文件格式：

user_id	item_id_01	item_id_02	…	item_id_50
1	666	888	…	6

item_id_01, item_id_02, …, item_50的商品推荐应该默认是从最大推荐概率到最小，即item_id_01用户点击的概率是最大的

三评价指标：

比赛使用NDCG@50测评推荐列表质量

NDCG表示归一化折损累积增益
NDCG允许以实数形式进行相关性打分
NDCG这个名字可能有点吓人，但其背后的思想却很简单。一个推荐系统返回一些项并形成一个列表，我们想要计算这个列表有多好。每一项都有一个相关的评分值，通常这些评分值是一个非负数。这就是gain（增益）。此外，对于这些没有用户反馈的项，我们通常设置其增益为0。
现在，我们把这些分数相加，也就是Cumulative Gain（累积增益）。我们更愿意看那些位于列表前面的最相关的项，因此，在把这些分数相加之前，我们将每项除以一个递增的数（通常是该项位置的对数值），也就是折损值，并得到DCG。

具体度量标准是NDCG@50-full and NDCG@50-rare
NDCG@50-FULL是在整个测试集上计算的，即在underexpose_test_qtime-T.csv中的所有测试用例。
NDCG@50-rare是在underexpose_test_qtime-T.csv中的一半测试用例上计算的。所选的一半包括的案例中，其下一个项目的预测比过去的训练集中的另一半少，即underexpose_train_click-0.zip，underexpose_train_click-1.zip，…，underexpose_train_click-T.zip。
（蒙b）
在比赛的0-6个阶段是开发研究阶段，最终比赛者的成绩会基于7，8，9阶段的数据

*Major*

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数据理解

数据理解数据理解数据理解赛题目标：根据某个用户id,给出50个商品推荐赛题目标：根据某个用户id,给出50个商品推荐赛题目标：根据某个用户id,给出50个商品推荐一数据集：所有的特征：item_id（商品id）：the unique identifier of the itemtxt_vec(商品文本特征)：the item’s text feature, which is a 128-dimensional real-valued vector produced by a pre-tra
复制链接

扫一扫