机器学习数据挖掘-软件、网站、课程资源知识点汇总

最新推荐文章于 2024-05-04 02:20:34 发布

数据娃掘

最新推荐文章于 2024-05-04 02:20:34 发布

阅读量1.5k

点赞数

分类专栏： NLP/DeepLearning 文章标签：机器学习人工智能计算机视觉 nlp

NLP/DeepLearning 专栏收录该内容

319 篇文章 14 订阅

订阅专栏

以下文章转载自：http://blog.csdn.net/zouxy09/article/details/8102252 ，感谢原作者！！！

机器学习知识点学习

zouxy09@qq.com

http://blog.csdn.net/zouxy09

在学习机器学习的有关知识时，搜索到JerryLead的cnblog中的Machine Learning专栏，里面对于机器学习的部分算法和知识点讲解地很经典和透彻。所以Mark在这，多多学习！

http://www.cnblogs.com/jerrylead/tag/Machine%20Learning/

偏最小二乘法回归（Partial Least Squares Regression）

典型关联分析（Canonical Correlation Analysis）

增强学习（Reinforcement Learning and Control）

因子分析（Factor Analysis）

线性判别分析（Linear Discriminant Analysis）（二）

线性判别分析（Linear Discriminant Analysis）（一）

ICA扩展描述

独立成分分析（Independent Component Analysis）

主成分分析（Principal components analysis）-最小平方误差解释

主成分分析（Principal components analysis）-最大方差解释

在线学习（Online Learning）

（EM算法）The EM Algorithm

混合高斯模型（Mixtures of Gaussians）和EM算法

K-means聚类算法

规则化和模型选择（Regularization and model selection）

支持向量机（五）SMO算法

支持向量机（四）

支持向量机（三）核函数

支持向量机SVM（二）

支持向量机SVM（一）

判别模型、生成模型与朴素贝叶斯方法

对线性回归，logistic回归和一般回归的认识

===========================================================================================

原文转自： http://suanfazu.com/discussion/27/18%E5%90%8D%E6%A0%A1%E6%95%B0%E6%8D%AE%E6%8C%96%E6%8E%98%E5%8F%8A%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E8%AF%BE%E7%A8%8B%E8%B5%84%E6%BA%90%E6%B1%87%E6%80%BB ，谢谢

北美+德国18名校的数据挖掘、数据分析、人工智能及机器学习课程资源汇总。

Quora问答

在线课程

Concepts in Computing with Data, Berkeley
Practical Machine Learning, Berkeley
Artificial Intelligence, Berkeley
Visualization, Berkeley
Data Mining and Analytics in Intelligent Business Services, Berkeley
Data Science and Analytics: Thought Leaders, Berkeley
Machine Learning, Stanford
Paradigms for Computing with Data, Stanford
Mining Massive Data Sets, Stanford
Data Visualization, Stanford
Algorithms for Massive Data Set Analysis, Stanford
Research Topics in Interactive Data Analysis, Stanford
Data Mining, Stanford
Machine Learning, CMU
Statistical Computing, CMU
Machine Learning with Large Datasets, CMU
Machine Learning, MIT
Data Mining, MIT
Statistical Learning Theory and Applications, MIT
Data Literacy, MIT
Introduction to Data Mining, UIUC
Learning from Data, Caltech
Introduction to Statistics, Harvard
Data-Intensive Information Processing Applications, University of Maryland
Dealing with Massive Data, Columbia
Data-Driven Modeling, Columbia
Introduction to Data Mining and Analysis, Georgia Tech
Computational Data Analysis: Foundations of Machine Learning and Da..., Georgia Tech
Applied Statistical Computing, Iowa State
Data Visualization, Rice
Data Warehousing and Data Mining, NYU
Data Mining in Engineering, Toronto
Machine Learning and Data Mining, UC Irvine
Knowledge Discovery from Data, Cal Poly
Large Scale Learning, University of Chicago
Data Science: Large-scale Advanced Data Analysis, University of Florida
Strategies for Statistical Data Analysis, Universit?t Leipzig

讨论会议

Data Bootcamp, Strata 2011
Machine Learning Summer School, Purdue 2011
Looking at Data

书籍

在线视频

====================================================================================

一、c++开源机器学习库

1）mlpack is a C++ machine learning library.

2）PLearn is a C++ library aimed at research and development in the field of statistical machine learning algorithms. Its originality is to allow to easily express, directly in C++ in a straightforward manner, complex non-linear functions to be optimized.

3）Waffles- C++ Machine Learning。
4）Torch7 provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation

5）SHARK is a modular C++ library for the design and optimization of adaptive systems. It provides methods for linear and nonlinear optimization, in particular evolutionary and gradient-based algorithms, kernel-based learning algorithms and neural networks, and various other machine learning techniques. SHARK serves as a toolbox to support real world applications as well as research in different domains of computational intelligence and machine learning. The sources are compatible with the following platforms: Windows, Solaris, MacOS X, and Linux.

6）Dlib-ml is an open source library, targetedat both engineers and research scientists, which aims to provide a similarly rich environment fordeveloping machine learning software in the C++ language.

7) Eblearn is an object-oriented C++ library that implements various machine learning models, including energy-based learning, gradient-based learning for machine composed of multiple heterogeneous modules. In particular, the library provides a complete set of tools for building, training, and running convolutional networks.

8) Machine Learning Open Source Software :Journal of Machine Learning Research:http://jmlr.csail.mit.edu/mloss/.

9) search in google: c++ site:jmlr.csail.mit.edu filetype:pdf , Machine Learning Toolkit

10) SIGMA: Large-Scale and Parallel Machine-Learning Tool Kit

11)http://sourceforge.net/directory/science-

1.机器学习开源软件网（收录了各种机器学习的各种编程语言学术与商业的开源软件）

http://mloss.org

2 偶尔找到的机器学习资源网：（也非常全，1和2基本收录了所有ML的经典开源软件了）

http://www.dmoz.org/Computers/Artificial_Intelligence/Machine_Learning/Software/

3 libsvm （支持向量机界最牛的，不用多说了，台湾大学的林教授的杰作）

http://www.csie.ntu.edu.tw/~cjlin/libsvm/

4 WEKA （基于java的机器学习算法最全面最易用的开源软件）

http://www.cs.waikato.ac.nz/ml/weka/

5 scikit (本人最喜欢的一个基于python的机器学习软件，代码写得非常好，而且官方的文档非常全，所有都有例子，算法也齐全，开发也活跃
，强烈推荐给大家用）

https://pypi.python.org/pypi/scikit-learn/

6 OpenCv(最牛的开源计算机视觉库了，前途无可限量，做图像处理与模式识别的一定要用，总不能整天抱着matlab做实验和工业界脱节吧，但是有一定难度)

http://opencv.willowgarage.com/wiki/

7 Orange (基于c++和python接口的机器学习软件，界面漂亮，调用方便,可以同时学习C＋＋和python，还有可视化的功能，）

http://orange.biolab.si/

8 Mallet (基于JAVA实现的机器学习库，主要用于自然语言处理方面，特色是马尔可夫模型和随机域做得好，可和WEKA互补）

http://mallet.cs.umass.edu/

9 NLTK(PYTHON的自然处理开源库，非常易用，也强大，还有几本orelly的经典教程）

http://nltk.org/

10 lucene(基于java的包括nutch,solr,hadoop,mahout等全套，是做信息检索和搜索引擎的同志们必学的开源软件了，学JAVA的必学）

http://lucene.apache.org/

Additional：

1.pyml(a python module for machine learning，支持svm/knn/k-means==)

http://mlpy.sourceforge.net/

2.mahout(阿帕奇基金下项目，其主要是可以与hadoop进行天然结合，从而并行运行，在鲁棒性方面很好)

http://mahout.apache.org/

3.milk(python的机器学习工具包，主要是针对监督学习，包括svm/knn/决策树)

http://pypi.python.org/pypi/milk/

4.Octave(Andrew NG课上推荐使用的，类似matlab)

http://www.gnu.org/software/octave/

以下转载自：http://cvchina.net/thread-667-1-1.html，谢谢。

以下工具绝大多数都是开源的，基于GPL、Apache等开源协议，使用时请仔细阅读各工具的license statement

I. Information Retrieval
1. Lemur/Indri
The Lemur Toolkit for Language Modeling and Information Retrieval
http://www.lemurproject.org/
Indri:
Lemur's latest search engine

2. Lucene/Nutch
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Lucene是apache的顶级开源项目，基于Apache 2.0协议，完全用java编写，具有perl, c/c++, dotNet等多个port
http://lucene.apache.org/
http://www.nutch.org/

3. WGet
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
http://www.gnu.org/software/wget/wget.html

II. Natural Language Processing
1. EGYPT: A Statistical Machine Translation Toolkit
http://www.clsp.jhu.edu/ws99/projects/mt/
包括GIZA等四个工具

2. GIZA++ (Statistical Machine Translation)
http://www.fjoch.com/GIZA++.html
GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.
Franz Josef Och先后在德国Aachen大学，ISI(南加州大学信息科学研究所)和Google工作。GIZA++现已有Windows移植版本，对IBM 的model 1-5有很好支持。

3. PHARAOH (Statistical Machine Translation)
http://www.isi.edu/licensed-sw/pharaoh/
a beam search decoder for phrase-based statistical machine translation models

4. OpenNLP:
http://opennlp.sourceforge.net/
包括Maxent等20多个工具

btw: 这些SMT的工具还都喜欢用埃及相关的名字命名，像什么GIZA、PHARAOH、Cairo等等。Och在ISI时开发了GIZA++，PHARAOH也是由来自ISI的Philipp Koehn 开发的，关系还真是复杂啊

5. MINIPAR by Dekang Lin (Univ. of Alberta, Canada)
MINIPAR is a broad-coverage parser for the English language. An evaluation with the SUSANNE corpus shows that MINIPAR achieves about 88% precision and 80% recall with respect to dependency relationships. MINIPAR is very efficient, on a Pentium II 300 with 128MB memory, it parses about 300 words per second.
binary填一个表后可以免费下载
http://www.cs.ualberta.ca/~lindek/minipar.htm

6. WordNet
http://wordnet.princeton.edu/
WordNet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. Different relations link the synonym sets.
WordNet was developed by the Cognitive Science Laboratory at Princeton University under the direction of Professor George A. Miller (Principal Investigator).
WordNet最新版本是2.1 (for Windows & Unix-like OS)，提供bin, src和doc。
WordNet的在线版本是http://wordnet.princeton.edu/perl/webwn

7. HowNet
http://www.keenage.com/
HowNet is an on-line common-sense knowledge base unveiling inter-conceptual relations and inter-attribute relations of concepts as connoting in lexicons of the Chinese and their English equivalents.
由CAS的Zhendong Dong & Qiang Dong开发，是一个类似于WordNet的东东

8. Statistical Language Modeling Toolkit
http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html
The CMU-Cambridge Statistical Language Modeling toolkit is a suite of UNIX software tools to facilitate the construction and testing of statistical language models.

9. SRI Language Modeling Toolkit
www.speech.sri.com/projects/srilm/
SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation. It has been under development in the SRI Speech Technology and Research Laboratory since 1995.

10. ReWrite Decoder
http://www.isi.edu/licensed-sw/rewrite-decoder/
The ISI ReWrite Decoder Release 1.0.0a by Daniel Marcu and Ulrich Germann. It is a program that translates from one natural languge into another using statistical machine translation.

11. GATE (General Architecture for Text Engineering)
http://gate.ac.uk/
A Java Library for Text Engineering

III. Machine Learning
1. YASMET: Yet Another Small MaxEnt Toolkit (Statistical Machine Learning)
http://www.fjoch.com/YASMET.html
由Franz Josef Och编写。此外，OpenNLP项目里有一个java的MaxEnt工具，使用GIS估计参数，由东北大学的张乐(目前在英国留学)port为C++版本

2. LibSVM
由国立台湾大学(ntu)的Chih-Jen Lin开发，有C++，Java，perl，C#等多个语言版本
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC ), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM ). It supports multi-class classification.

3. SVM Light
由cornell的Thorsten Joachims在dortmund大学时开发，成为LibSVM之后最为有名的SVM软件包。开源，用C语言编写，用于ranking问题
http://svmlight.joachims.org/

4. CLUTO
http://www-users.cs.umn.edu/~karypis/cluto/
a software package for clustering low- and high-dimensional datasets
这个软件包只提供executable/library两种形式，不提供源代码下载

5. CRF++
http://chasen.org/~taku/software/CRF++/
Yet Another CRF toolkit for segmenting/labelling sequential data
CRF(Conditional Random Fields)，由HMM/MEMM发展起来，广泛用于IE、IR、NLP领域

6. SVM Struct
http://www.cs.cornell.edu/People/tj/svm_light/svm_struct.html
同SVM Light，均由cornell的Thorsten Joachims开发。
SVMstruct is a Support Vector Machine (SVM) algorithm for predicting multivariate outputs. It performs supervised learning by approximating a mapping
h: X --> Y
using labeled training examples (x1,y1), ..., (xn,yn).
Unlike regular SVMs, however, which consider only univariate predictions like in classification and regression, SVMstruct can predict complex objects y like trees, sequences, or sets. Examples of problems with complex outputs are natural language parsing, sequence alignment in protein homology detection, and markov models for part-of-speech tagging.
SVMstruct can be thought of as an API for implementing different kinds of complex prediction algorithms. Currently, we have implemented the following learning tasks:
SVMmulticlass: Multi-class classification. Learns to predict one of k mutually exclusive classes. This is probably the simplest possible instance of SVMstruct and serves as a tutorial example of how to use the programming interface.
SVMcfg: Learns a weighted context free grammar from examples. Training examples (e.g. for natural language parsing) specify the sentence along with the correct parse tree. The goal is to predict the parse tree of new sentences.
SVMalign: Learning to align sequences. Given examples of how sequence pairs align, the goal is to learn the substitution matrix as well as the insertion and deletion costs of operations so that one can predict alignments of new sequences.
SVMhmm: Learns a Markov model from examples. Training examples (e.g. for part-of-speech tagging) specify the sequence of words along with the correct assignment of tags (i.e. states). The goal is to predict the tag sequences for new sentences.

IV. Misc:
1. Notepad++: 一个开源编辑器，支持C#，perl，CSS等几十种语言的关键字，功能可与新版的UltraEdit，Visual Studio .NET媲美
http://notepad-plus.sourceforge.net

2. WinMerge: 用于文本内容比较，找出不同版本的两个程序的差异
winmerge.sourceforge.net/

3. OpenPerlIDE: 开源的perl编辑器，内置编译、逐行调试功能
open-perl-ide.sourceforge.net/
ps: 论起编辑器偶见过的最好的还是VS .NET了，在每个function前面有+/-号支持expand/collapse，支持区域copy/cut/paste，使用ctrl+ c/ctrl+x/ctrl+v可以一次选取一行，使用ctrl+k+c/ctrl+k+u可以comment/uncomment多行，还有还有...... Visual Studio .NET is really kool

4. Berkeley DB
http://www.sleepycat.com/
Berkeley DB不是一个关系数据库，它被称做是一个嵌入式数据库：对于c/s模型来说，它的client和server共用一个地址空间。由于数据库最初是从文件系统中发展起来的，它更像是一个key-value pair的字典型数据库。而且数据库文件能够序列化到硬盘中，所以不受内存大小限制。BDB有个子版本Berkeley DB XML，它是一个xml数据库：以xml文件形式存储数据？BDB已被包括microsoft、google、HP、ford、motorola等公司嵌入到自己的产品中去了
Berkeley DB (libdb) is a programmatic toolkit that provides embedded database support for both traditional and client/server applications. It includes b+tree, queue, extended linear hashing, fixed, and variable-length record access methods, transactions, locking, logging, shared memory caching, database recovery, and replication for highly available systems. DB supports C, C++, Java, PHP, and Perl APIs.
It turns out that at a basic level Berkeley DB is just a very high performance, reliable way of persisting dictionary style data structures - anything where a piece of data can be stored and looked up using a unique key. The key and the value can each be up to 4 gigabytes in length and can consist of anything that can be crammed in to a string of bytes, so what you do with it is completely up to you. The only operations available are "store this value under this key", "check if this key exists" and "retrieve the value for this key" so conceptually it's pretty simple - the complicated stuff all happens under the hood.
case study:
Ask Jeeves uses Berkeley DB to provide an easy-to-use tool for searching the Internet.
Microsoft uses Berkeley DB for the Groove collaboration software
AOL uses Berkeley DB for search tool meta-data and other services.
Hitachi uses Berkeley DB in its directory services server product.
Ford uses Berkeley DB to authenticate partners who access Ford's Web applications.
Hewlett Packard uses Berkeley DB in serveral products, including storage, security and wireless software.
Google uses Berkeley DB High Availability for Google Accounts.
Motorola uses Berkeley DB to track mobile units in its wireless radio network products.

11. R
http://www.r-project.org/
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.
R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
R统计软件与MatLab类似，都是用在科学计算领域的。

转自：http://kapoc.blogdriver.com/kapoc/1268927.html

==========================================================================

以下转载自：http://www.cvchina.info/2011/05/01/website-machinelearning/#comment-1868，最初来源据说是demonstrate 的 blog

这里搜集了一些常见的和 machine learning 相关的网站，按照 topic 来分。

Gaussian Processes

http://www.gaussianprocess.org 包括相关的书籍（有 Carl Edward Rasmussen 的书），相关的程序以及分类的 paper 列表。这也是由 Carl 自己维护的，他应该是将 GP 引入 machine learning 最早的人之一了吧，Hinton 的学生。

Nonparametric Bayesian Methods

http://www.cs.berkeley.edu/~jordan/npb.html 这个一看就知道是 Jordan 维护的，主要包括 Dirichlet process 以及相关的其他随机过程在 machine learning 里面如何进行建模，如何进行 approximate inference。主要是文章列表。

Probabilistic Graphical Model

http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html 是 Kevin Murphy 所维护的关于 Bayesian belief networks 的介绍，含有最基本的概念、相关的文献和软件的链接。罕见的 UCB 出来的不是 Jordan 的学生（老板是 Stuart Russel）。
http://www.cs.berkeley.edu/~jordan/graphical.html 是 Jordan 系关于这个方面的论文汇编。
http://www.inference.phy.cam.ac.uk/hmw26/crf/ 是关于 Conditional Random Fields 方面论文和软件的收集，由 Hanna Wallach 维护。

Compressed Sensing

http://www-dsp.rice.edu/cs 这是 Rice 大学维护的论文分类列表、软件链接等。推荐 Emmanuel Candès 所写的tutorial，这人是 David Donoho 的学生。

Tensor

http://csmr.ca.sandia.gov/~tgkolda/pubs/index.html 关于 tensor 的一些偏数学的文章。

Deep Belief Network

http://www.cs.toronto.edu/~hinton/csc2515/deeprefs.html 是 Geoffrey Hinton 为研究生开设的 machine learning 课程的 DBN 的 reading list。

Kernel Methods

http://www.cs.berkeley.edu/~jordan/kernels.html 是 Jordan 维护的关于 kernel methods 的文章列表。

Markov Logic

http://ai.cs.washington.edu/pubs 是 UW AI 组的文章，里面关于 Markov logic 的比较多，因为 Pedro Domingos 就是这个组的。

=========================================================================================

机器学习与人工智能学习资源导引

本文转载自： http://mindhacks.cn/2008/09/11/machine-learning-and-ai-resources/ ，多谢！！

我经常在 TopLanguage 讨论组上推荐一些书籍，也经常问里面的牛人们搜罗一些有关的资料，人工智能、机器学习、自然语言处理、知识发现（特别地，数据挖掘）、信息检索这些无疑是 CS 领域最好玩的分支了（也是互相紧密联系的），这里将最近有关机器学习和人工智能相关的一些学习资源归一个类：

首先是两个非常棒的 Wikipedia 条目，我也算是 wikipedia 的重度用户了，学习一门东西的时候常常发现是始于 wikipedia 中间经过若干次 google ，然后止于某一本或几本著作。

第一个是“人工智能的历史”（History of Artificial Intelligence），我在讨论组上写道：

而今天看到的这篇文章是我在 wikipedia 浏览至今觉得最好的。文章名为《人工智能的历史》，顺着 AI 发展时间线娓娓道来，中间穿插无数牛人故事，且一波三折大气磅礴，可谓"事实比想象更令人惊讶"。人工智能始于哲学思辨，中间经历了一个没有心理学（尤其是认知神经科学的）的帮助的阶段，仅通过牛人对人类思维的外在表现的归纳、内省，以及数学工具进行探索，其间最令人激动的是 Herbert Simon （决策理论之父，诺奖，跨领域牛人）写的一个自动证明机，证明了罗素的数学原理中的二十几个定理，其中有一个定理比原书中的还要优雅，Simon 的程序用的是启发式搜索，因为公理系统中的证明可以简化为从条件到结论的树状搜索（但由于组合爆炸，所以必须使用启发式剪枝）。后来 Simon 又写了 GPS （General Problem Solver），据说能解决一些能良好形式化的问题，如汉诺塔。但说到底 Simon 的研究毕竟只触及了人类思维的一个很小很小的方面 —— Formal Logic，甚至更狭义一点 Deductive Reasoning （即不包含 Inductive Reasoning , Transductive Reasoning (俗称 analogic thinking）。还有诸多比如 Common Sense、Vision、尤其是最为复杂的 Language 、Consciousness 都还谜团未解。还有一个比较有趣的就是有人认为 AI 问题必须要以一个物理的 Body 为支撑，一个能够感受这个世界的物理规则的身体本身就是一个强大的信息来源，基于这个信息来源，人类能够自身与时俱进地总结所谓的 Common-Sense Knowledge （这个就是所谓的 Emboddied Mind 理论。），否则像一些老兄直接手动构建 Common-Sense Knowledge Base ，就很傻很天真了，须知人根据感知系统从自然界获取知识是一个动态的自动更新的系统，而手动构建常识库则无异于古老的 Expert System 的做法。当然，以上只总结了很小一部分我个人觉得比较有趣或新颖的，每个人看到的有趣的地方不一样，比如里面相当详细地介绍了神经网络理论的兴衰。所以我强烈建议你看自己一遍，别忘了里面链接到其他地方的链接。

顺便一说，徐宥同学打算找时间把这个条目翻译出来，这是一个相当长的条目，看不动 E 文的等着看翻译吧:)

第二个则是“人工智能”（Artificial Intelligence）。当然，还有机器学习等等。从这些条目出发能够找到许多非常有用和靠谱的深入参考资料。

然后是一些书籍

书籍：

1. 《Programming Collective Intelligence》，近年出的入门好书，培养兴趣是最重要的一环，一上来看大部头很容易被吓走的:P

2. Peter Norvig 的《AI, Modern Approach 2nd》（无争议的领域经典）。

3. 《The Elements of Statistical Learning》，数学性比较强，可以做参考了。

4. 《Foundations of Statistical Natural Language Processing》，自然语言处理领域公认经典。

5. 《Data Mining, Concepts and Techniques》，华裔科学家写的书，相当深入浅出。

6. 《Managing Gigabytes》，信息检索好书。

7. 《Information Theory：Inference and Learning Algorithms》，参考书吧，比较深。

相关数学基础（参考书，不适合拿来通读）：

1. 线性代数：这个参考书就不列了，很多。

2. 矩阵数学：《矩阵分析》，Roger Horn。矩阵分析领域无争议的经典。

3. 概率论与统计：《概率论及其应用》，威廉·费勒。也是极牛的书，可数学味道太重，不适合做机器学习的。于是讨论组里的 Du Lei 同学推荐了《All Of Statistics》并说到

机器学习这个方向，统计学也一样非常重要。推荐All of statistics，这是CMU的一本很简洁的教科书，注重概念，简化计算，简化与Machine Learning无关的概念和统计内容，可以说是很好的快速入门材料。

4. 最优化方法：《Nonlinear Programming, 2nd》非线性规划的参考书。《Convex Optimization》凸优化的参考书。此外还有一些书可以参考 wikipedia 上的最优化方法条目。要深入理解机器学习方法的技术细节很多时候（如SVM）需要最优化方法作为铺垫。

王宁同学推荐了好几本书：

《Machine Learning, Tom Michell》, 1997.
老书，牛人。现在看来内容并不算深，很多章节有点到为止的感觉，但是很适合新手（当然，不能"新"到连算法和概率都不知道）入门。比如决策树部分就很精彩，并且这几年没有特别大的进展，所以并不过时。另外，这本书算是对97年前数十年机器学习工作的大综述，参考文献列表极有价值。国内有翻译和影印版，不知道绝版否。

《Modern Information Retrieval, Ricardo Baeza-Yates et al》. 1999
老书，牛人。貌似第一本完整讲述IR的书。可惜IR这些年进展迅猛，这本书略有些过时了。翻翻做参考还是不错的。另外，Ricardo同学现在是Yahoo Research for Europe and Latin Ameria的头头。

《Pattern Classification (2ed)》, Richard O. Duda, Peter E. Hart, David G. Stork
大约也是01年左右的大块头，有影印版，彩色。没读完，但如果想深入学习ML和IR，前三章（介绍，贝叶斯学习，线性分类器）必修。

还有些经典与我只有一面之缘，没有资格评价。另外还有两本小册子，论文集性质的，倒是讲到了了不少前沿和细节，诸如索引如何压缩之类。可惜忘了名字，又被我压在箱底，下次搬家前怕是难见天日了。

（呵呵，想起来一本：《Mining the Web – Discovering Knowledge from Hypertext Data》）

说一本名气很大的书：《Data Mining: Practical Machine Learning Tools and Techniques》。Weka 的作者写的。可惜内容一般。理论部分太单薄，而实践部分也很脱离实际。DM的入门书已经不少，这一本应该可以不看了。如果要学习了解 Weka ，看文档就好。第二版已经出了，没读过，不清楚。

信息检索方面，Du Lei 同学再次推荐：

信息检索方面的书现在建议看Stanford的那本《Introduction to Information Retrieval》，这书刚刚正式出版，内容当然up to date。另外信息检索第一大牛Croft老爷也正在写教科书，应该很快就要面世了。据说是非常pratical的一本书。

对信息检索有兴趣的同学，强烈推荐翟成祥博士在北大的暑期学校课程，这里有全slides和阅读材料：http://net.pku.edu.cn/~course/cs410/schedule.html

maximzhao 同学推荐了一本机器学习：

加一本书：Bishop, 《Pattern Recognition and Machine Learning》. 没有影印的，但是网上能下到。经典中的经典。Pattern Classification 和这本书是两本必读之书。《Pattern Recognition and Machine Learning》是很新（07年），深入浅出，手不释卷。

最后，关于人工智能方面（特别地，决策与判断），再推荐两本有意思的书，

一本是《Simple Heuristics that Makes Us Smart》

另一本是《Bounded Rationality: The Adaptive Toolbox》

不同于计算机学界所采用的统计机器学习方法，这两本书更多地着眼于人类实际上所采用的认知方式，以下是我在讨论组上写的简介：

这两本都是德国ABC研究小组（一个由计算机科学家、认知科学家、神经科学家、经济学家、数学家、统计学家等组成的跨学科研究团体）集体写的，都是引起领域内广泛关注的书，尤其是前一本，後一本则是对 Herbert Simon （决策科学之父，诺奖获得者）提出的人类理性模型的扩充研究），可以说是把什么是真正的人类智能这个问题提上了台面。核心思想是，我们的大脑根本不能做大量的统计计算，使用fancy的数学手法去解释和预测这个世界，而是通过简单而鲁棒的启发法来面对不确定的世界（比如第一本书中提到的两个后来非常著名的启发法：再认启发法（cognition heuristics）和选择最佳（Take the Best）。当然，这两本书并没有排斥统计方法就是了，数据量大的时候统计优势就出来了，而数据量小的时候统计方法就变得非常糟糕；人类简单的启发法则充分利用生态环境中的规律性（regularities），都做到计算复杂性小且鲁棒。

关于第二本书的简介：

1. 谁是 Herbert Simon

2. 什么是 Bounded Rationality

3. 这本书讲啥的：

我一直觉得人类的决策与判断是一个非常迷人的问题。这本书简单地说可以看作是《决策与判断》的更全面更理论的版本。系统且理论化地介绍人类决策与判断过程中的各种启发式方法（heuristics）及其利弊（为什么他们是最优化方法在信息不足情况下的快捷且鲁棒的逼近，以及为什么在一些情况下会带来糟糕的后果等，比如学过机器学习的都知道朴素贝叶斯方法在许多情况下往往并不比贝叶斯网络效果差，而且还速度快；比如多项式插值的维数越高越容易overfit，而基于低阶多项式的分段样条插值却被证明是一个非常鲁棒的方案）。

在此提一个书中提到的例子，非常有意思：两个团队被派去设计一个能够在场上接住抛过来的棒球的机器人。第一组做了详细的数学分析，建立了一个相当复杂的抛物线近似模型（因为还要考虑空气阻力之类的原因，所以并非严格抛物线），用于计算球的落点，以便正确地接到球。显然这个方案耗资巨大，而且实际运算也需要时间，大家都知道生物的神经网络中生物电流传输只有百米每秒之内，所以 computational complexity 对于生物来说是个宝贵资源，所以这个方案虽然可行，但不够好。第二组则采访了真正的运动员，听取他们总结自己到底是如何接球的感受，然后他们做了这样一个机器人：这个机器人在球抛出的一开始一半路程啥也不做，等到比较近了才开始跑动，并在跑动中一直保持眼睛于球之间的视角不变，后者就保证了机器人的跑动路线一定会和球的轨迹有交点；整个过程中这个机器人只做非常粗糙的轨迹估算。体会一下你接球的时候是不是眼睛一直都盯着球，然后根据视线角度来调整跑动方向？实际上人类就是这么干的，这就是 heuristics 的力量。

相对于偏向于心理学以及科普的《决策与判断》来说，这本书的理论性更强，引用文献也很多而经典，而且与人工智能和机器学习都有交叉，里面也有不少数学内容，全书由十几个章节构成，每个章节都是由不同的作者写的，类似于 paper 一样的，很严谨，也没啥废话，跟《Psychology of Problem Solving》类似。比较适合 geeks 阅读哈。

另外，对理论的技术细节看不下去的也建议看看《决策与判断》这类书（以及像《别做正常的傻瓜》这样的傻瓜科普读本），对自己在生活中做决策有莫大的好处。人类决策与判断中使用了很多的 heuristics ，很不幸的是，其中许多都是在适应几十万年前的社会环境中建立起来的，并不适合于现代社会，所以了解这些思维中的缺点、盲点，对自己成为一个良好的决策者有很大的好处，而且这本身也是一个非常有趣的领域。

（完）

P.S. 大家有什么好的资料请至讨论组上留言。

数据娃掘

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
机器学习数据挖掘-软件、网站、课程资源知识点汇总

以下文章转载自：http://blog.csdn.net/zouxy09/article/details/8102252 ，感谢原作者！！！机器学习知识点学习zouxy09@qq.comhttp://blog.csdn.net/zouxy09 在学习机器学习的有关知识时，搜索到JerryLead的cnblog中的Machine Le
复制链接

扫一扫