Jester数据集

最新推荐文章于 2022-07-07 16:44:46 发布

不务正业的猿

最新推荐文章于 2022-07-07 16:44:46 发布

阅读量2.8k

点赞数 2

分类专栏：下载数据集文章标签： Jester数据集 Jester 数据集用户评论

本文链接：https://blog.csdn.net/ispeasant/article/details/108833399

版权

下载同时被 2 个专栏收录

198 篇文章 40 订阅

订阅专栏

数据集

169 篇文章 31 订阅

订阅专栏

原文：

4.1 Million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users: collected between April 1999 - May 2003.

Freely available for research use when acknowledged with the following reference:

Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.

(Aside: many papers, including ours, report Normalized Mean Absolute Error (NMAE) rates of approx 20%. How good is this compared with random guessing? In the Appendix to our paper, we show that if user ratings are uniformly distributed, random guessing yields NMAE = 33%.)

As a courtesy, if you use the data, I would appreciate knowing your name, what research group you are in, and the publications that may result.

The Jester Dataset (save to disk, then unzip to obtain Excel files):

jester-data-1.zip : (3.9MB) Data from 24,983 users who have rated 36 or more jokes, a matrix with dimensions 24983 X 101.

jester-data-2.zip : (3.6MB) Data from 23,500 users who have rated 36 or more jokes, a matrix with dimensions 23500 X 101.

jester-data-3.zip : (2.1MB) Data from 24,938 users who have rated between 15 and 35 jokes, a matrix with dimensions 24,938 X 101.

Format:

3 Data files contain anonymous ratings data from 73,421 users.

Data files are in .zip format, when unzipped, they are in Excel (.xls) format

Ratings are real values ranging from -10.00 to +10.00 (the value "99" corresponds to "null" = "not rated").

One row per user

The first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01 - 100.

The sub-matrix including only columns {5, 7, 8, 13, 15, 16, 17, 18, 19, 20} is dense. Almost all users have rated those jokes (see discussion of "universal queries" in the above paper).

译文：

在1999年4月至2003年5月期间，73421名用户对100个笑话进行了410万次连续评分（-10.00到+10.00）。

经以下参考确认后，可免费用于研究：

Eigentaste：一种恒定时间的协同过滤算法。Ken Goldberg，Theresa Roeder，Dhruv Gupta和Chris Perkins。信息检索，4（2），133-151。2001年7月。

（旁白：许多论文，包括我们的，报告标准化平均绝对误差（NMAE）率约为20%。这和随机猜测相比有多好？在本文的附录中，我们发现，如果用户评分是均匀分布的，随机猜测的结果是NMAE=33%。）

出于礼貌，如果您使用这些数据，我将非常感谢您知道您的姓名、您所在的研究小组以及可能产生的出版物。

Jester数据集（保存到磁盘，然后解压缩以获取Excel文件）：

jester-data-1.zip:（3.9MB）来自24983个用户的数据，这些用户给36个或更多的笑话打分，一个尺寸为24983x101的矩阵。

jester-data-2.zip:（3.6MB）来自23500个用户的数据，他们给36个或更多的笑话打分，一个尺寸为23500 X 101的矩阵。

jester-data-3.zip:（2.1MB）来自24938个用户的数据，这些用户的笑话评分在15到35个之间，这是一个尺寸为24938x101的矩阵。

格式：

3个数据文件包含来自73421个用户的匿名评级数据。