雅虎面向研究人员发布大规模机器学习数据集

据外媒报道,日前,雅虎推出了一个全新的“雅虎新闻推荐(Yahoo News Recommendation)”数据集,它被称为是有史以来对外公布最大的机器学习数据集。雅虎表示,这套数据集主要针对学术研究社团推出,这样他们在研究中再也不需要为无法获得大规模数据集而发愁了。

据悉,公开的数据集包括了1100亿个事件,其在未压缩的状态下总容量达13.5TB。

研究人员可以在数据集中找到匿名用户新闻交互数据等数据,这些数据则都是在去年早几个月从2000万名用户那里收集得到。

在Yahoo News Feed数据集中有来自用户与雅虎多个不同版块的互动数据,如雅虎电影、雅虎新闻、雅虎金融等。

另外,雅虎还在这套数据集中添加了一些人口分布数据,如性别、年龄及地理位置等。雅虎在一份声明中说道:“我们的目标是在大规模机器学习和推荐系统中促进独立研究的发展,并还要帮助在工业和学术研究之间创造一个公平竞争的环境。





本文作者:佚名
来源:51CTO
AG's News Topic Classification Dataset Version 3, Updated 09/09/2015 ORIGIN AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html . The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015). DESCRIPTION The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. The file classes.txt contains a list of classes corresponding to each label. The files train.csv and test.csv contain all the training samples as comma-sparated values. There are 3 columns in them, corresponding to class index (1 to 4), title and description. The title and description are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值