一直想对top 10 algorithms in data mining 中的算法做一些分析介绍,也作为自己的一个回顾。但一直都没有时间来做,现在终于抽出一些时间来写点东西。
首先对该事件做一个介绍。事件发生于The 2006 IEEE International Conference on Data Mining(ICDM)。最后根据活动整理出了paper。电子版PDF已对选出的10个算法的来源与贡献做了简单的介绍。文中对这top 10的算法如何选出,选举的过程也进行了详细描述。为了方便阅读,我在这里再说明一下:
三步流程:
A. 提名
ICDM2006上邀请ACM KDD Innovation Aword 和IEEE ICDM Research Contributions Aword 获奖者参与top 10 大算法的提名。每人各提名10种他认为最重要的算法,同时给出提名该算法的理由,该算法的代表性论文。所提名的算法必须是在该领域被广泛研究和引用的论文
B. 审核
通过Google Scholar对每个提名算法引用进行审核。以此删除名单中引用低于50的论文。最后剩下18种算法。
C. 投票
邀请了:
(a). KDD06/ICDM06和SDM06的程序委员会的成员
(b).ACM KDD创新奖和IEEE ICDM研究贡献奖获得者
最后通过投票排名选出Top 10 算法。
此处顺便列出审核阶段结束后产生的18种算法:
A. Classification
- C4.5 (1993) C4.5: programs for Machine Learning
- CART(1984) classification and Regression Trees
- K Nearest Neighbors(KNN) (1996) Discriminant Adaptive Nearest Neighbor Classification
- Naïve Bayes(2001) Idiot’s Bayes: Not So Stupid After All?Internat
B. Statistical Learning
- SVM(1995) The Nature of Statistical Learning Theory
- EM(2000) Finite Mixture Models
C. Association Analysis
- Apriori(1994) Fast Algorithms for Mining Association Rules
- FP. Tree(2000) Mining Frequent patterns without candidate generation
D. Link Mining
- Page Rank(1998) The anatomy of a large-scale hyperlinked environment
- HITS(1998) Authoritative source in a hyperlinked environment
E. Clustering
- K-Means(1967) Some methods for classification and analysis of multivariate observations
- BIRCH(1996) BIRCH: an efficient data clustering method for very large databases
F. Bagging and Boosting
- AdaBoost(1997) A decision-theoretic generalization of on-line learning and an application to boosting
G. Sequential Patterns
- GSP(1996) Mining Sequential Patterns: Generalizations and Performance Improvements
- PrefixSpan(2001) PrefixSpan: Mining Sequential Patterns Efficiently by Projected Pattern Growth
H. Integrated Mining
- CBA(1998) Integrating classification and association rule mining
I. Rough Sets
- Finding reduct(1992) Rough Sets: Theoretical Aspects of Reasoning about Data
J. Graph Mining
- gSpan(2002) gSpan: Graph-Based Substructure Pattern Mining
最后投票产生的Top 10 算法为:
对该10大算法的使用指导已经出版。同名《The Top Ten Algorithms in Data Mining》
ICMD 2006 会议投票的前10与该结果相同,可见该结果得到数据挖掘领域的普遍认可。后面我将按排名顺序对算法做一些介绍