Advanced Topics in Data Mining Spring 2011

Books (PDFs):

Datasets:

SNAP network datasets

Wikipedia

Ratings and purchases (movies, music, etc.)

Yahoo! Webscope Catalog of datasets

  • Yahoo! Webscope dataset collection. Cotains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
  • Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.

Co-authorship and Citation Networks

Internet (Autonomous Systems) topology

Who trusts whom data at Trustlet

Stanford only datasets

  • Instant messenger buddy graph from March 2005. There are 227 million nodes and 7.3 billion undirected edges.
  • Altavista web graph from 2002. 1.4 billion nodes, 5.5 billion edges.
  • Memetracker2. 1 million blog posts, news media articles, tweets and facebook wall posts per hour for a period from August 1 to August 31 2010. 181GB of compressed data.
  • The New York Times Annotated Corpus: over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata.
  • TheFind: product information data (price, category, related products) extracted from 239 different websites.
  • Twitter: About 500 million tweets over a 7 month period. Data description.
  • Wikipedia: Complete revision history of Wikipedia -- every edit of every article with full content.
  • Wikipedia webserver logs: Hourly Wikipedia page access statistics.
  • Yahoo! Messenger: Instant Messenger graph with some additional information

Data can be accessed here. Email Jure if you do not have a password.

Other Datasets

  • Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.html
  • The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here. Find how to access web pages in the repository here.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值