Books (PDFs):
-
Mining Massive Datasets by A. Rajaraman, J. Ullman.
-
Networks, Crowds, and Markets: Reasoning About a Highly Connected World by D. Easley, J. Kleinberg.
-
Data-Intensive Text Processing with MapReduce by J. Lin, C. Dyer.
Datasets:
SNAP network datasets
Wikipedia
- Complete edit history of Wikipedia articles: Which user edited what article at what time.
- Wikipedia page to page link data
- DBpedia: A richly labeled graph of Wikipedia entities.
- Freebase: An entity graph of people, places and things.
Ratings and purchases (movies, music, etc.)
- Amazon product co-purchasing network: 600k products and all their metadata.
- KDD Cup 2011: 300M ratngs from 1M users on 600k songs, albums and artists.
- IMDB database: Everything about every movie ever made.
- Movielens: User movie rating data.
Yahoo! Webscope Catalog of datasets
- Yahoo! Webscope dataset collection. Cotains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
- Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.
Co-authorship and Citation Networks
- DBLP: Digital Bibliography & Library Project. More info.
- Arxiv citation and co-authorship networks: Data is from KDD 2003 Cup.
Internet (Autonomous Systems) topology
Who trusts whom data at Trustlet
- Trust network datasets from Trustlet.org
Stanford only datasets
- Instant messenger buddy graph from March 2005. There are 227 million nodes and 7.3 billion undirected edges.
- Altavista web graph from 2002. 1.4 billion nodes, 5.5 billion edges.
- Memetracker2. 1 million blog posts, news media articles, tweets and facebook wall posts per hour for a period from August 1 to August 31 2010. 181GB of compressed data.
- The New York Times Annotated Corpus: over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata.
- TheFind: product information data (price, category, related products) extracted from 239 different websites.
- Twitter: About 500 million tweets over a 7 month period. Data description.
- Wikipedia: Complete revision history of Wikipedia -- every edit of every article with full content.
- Wikipedia webserver logs: Hourly Wikipedia page access statistics.
- Yahoo! Messenger: Instant Messenger graph with some additional information
Data can be accessed here. Email Jure if you do not have a password.
Other Datasets
- Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.html
- The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here. Find how to access web pages in the repository here.