7 Important Data Science Papers

转载 2013年12月06日 11:21:13


It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.

Google Search

  • PageRank – This is the paper that explains the algorithm behind Google search.


  • MapReduce – This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.
  • Google File System – Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.


These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scabable.

Machine Learning

Bonus Paper

  • Random Forests – One of the most popular machine learning techniques. It is heavily used in Kaggle competitions, even by the winners.

Are there any other papers you feel should be on the list?


7 Steps for Learning Data Mining and Data Science

How to learn data mining and data science? I outline seven steps and point you to resources for beco...

计算机科学的十大研究论文:The Top 10 research papers in computer science by Mendeley readership

Since we recently announced our $10001 Binary Battle to promote applications built on the Mendeley ...

Stemming the words and word lemmatization —— Python Data Science CookBook

English grammar dictates how certain words are used in sentences. For example, perform, performing, ...

sampling brief —— python data science cookbook

simple random sampling  Typically, in scenarios where it’s very expensive to access the whole dat...

Removing stop words —— Python Data Science CookBook

remove stop words

EMI Music Data Science Hackathon冠军团队的技术报告

摘要 引言 预处理方法 wordscsv 用户 生成测试数据集 模型 分解模型 模型学习方法 线性回归 集成 后处理 摘要这篇报告描述了2012EMI音乐数据科学Hackathon中排名第一的盛大创新...

翻译:Getting Started With Python For Data Science

翻译:Getting Started With Python For Data Science 原文链接:http://www.kaggle.com/wiki/GettingStartedW...

02_R Programming for Data Science

1. Load code to R    (1) copy code to console    (2) put file into working directory, then source(...

深度学习:Hinton_Science_Reducing the dimensionality of data with neural networks

近日,闲来得空,又不停的听到Deep learning (DL)相关的突破~ 故来研究下Deep learning的相关东西~ 在Deep learning 的学习资源中找到,关于Deep...
您举报文章:7 Important Data Science Papers