This paper considers the problems of learning concepts from large-scale data sets. The way we take is completely classification algorithm independent. Firstly, the original problem is decomposed into a series of smaller two-class sub-problems which are easier to be solved. Secondly we present two principles, namely the shrink and expansion principles, to restore the global solution from the intermediate results learned from the sub-problems. In the theoretical analysis, this procedure of integration is described as a statistical inference of a posterior probability and is degraded as the min-max principles in the special case considering 0-1 outputs. We also propose a revised approach which reduces the computational complexity of the training and testing stage to a linear level . Finally, experiments on both the synthetic and text-classification data are demonstrated. The results indicate that our methods are effective to large scale problems.
Publications
Learning Concepts from Large-Scale Data Sets by Pairwise Coupling with Probabilistic Outputs F. Zhou and B. Lu International Joint Conference on Neural Networks (IJCNN), 2007 [Paper 1M] [Slides 1M]
Research on Ensemble Learning F. Zhou and B. Lu Master Thesis, 2007 [Paper 2MB (in Chinese)] [Slides 3MB (in English)]
Ensemble learning techniques have been demonstrated to be an effective way to reduce the error of a base learner across a wide variety of tasks. The basic idea is to vote together the predictions of a set of classifiers that have been trained slightly differently for the same task. This work proposed a novel ensemble learning algorithm called Triskel, which learns an ensemble of classifiers that are biased to have high precision for one particular class. Triskel outperforms boosting on a variety of real-world tasks, in terms of both accuracy and training time. It represents a middle ground between covering algorithms and ensemble techniques such as boosting. See recent publications for more details.
A Probabilistic Framework for Ensemble Learning AbstractThis paper considers the problems of learning concepts from large-scale data sets. The way we take