论文题目: Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples
论文优势: 几个概念在当时比较新.
论文劣势 (现在的观点): 数据集太小,方法比较简单.
可以作为我们工作的比较算法.
- Why semi-supervised?
Unlabeled data can help improving the performance of the learner. - Why co-training?
Different views have their own pros and cons. - How to create different views?
Natural (image, text explaination)
Feature selection (100 -> 30, 25, 28)
What if a view is insufficient? Does not matter. It depends on your objective: theoritical support (ICML), or peformance only (ICDM)? - Why ensemble?
4.1 Why bagging? Simply construct a number of classifiers and vote.
4.2 Why boosting? Theoretical support. AdaBoosting. - Why CV data sampling?
Slightly change the data distribution. More importantly, removal some unwanted samples (outliers).
Basic ideas:
- From one data to multiple data.
Serve for classifiers. - From one classifier to multiple classifiers.
2.1 Use different samples.
2.2 Use different views of the same samples. - How to integrate (ensemble) different classifiers?
3.1 During training. Select and label unlabeled samples for each other.
3.2 After prediction. Simple voting or weighted voting.
Sampling strategies:
- Random sampling with replacement. Enough times will incur good results.
- Cross validation. Partition to 10 parts, each time use 9 parts.
Confidence of an unlabeled instance.
- If the classifier is an SVM, the distance to the classification hyperplane.
- If the classifier is a decision tree, the purity of the leaf node.
- If the classifier is kNN, the purity of neighors.