论文下载
bib:
@INPROCEEDINGS{NigamGhani2000CoEM,
title = {Analyzing the Effectiveness and Applicability of Co-Training},
author = {Kamal Nigam and Rayid Ghani},
booktitle = {CIKM},
year = {2000},
pages = {86--93}
}
1. 摘要
Recently there has been significant interest in supervised learning algorithms that combine labeled and unlabeled data for text learning tasks.
The co-training setting [1] applies to datasets that have a
natural separation of their features
into two disjoint sets.
We demonstrate that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not.
When a natural split does not exist, co-training algorithms that manufacture a feature split may out-perform algorithms not using a split.
These results help explain why co-training algorithms are both discriminative in nature and robust to the assumptions of their embedded classifiers.
2. 算法描述
大概是因为这篇论文写的时间太久远了吧(2000),害的我读半天找不到算法的描述。我只能说对于算法的描述太少了,不仔细读根本找不到, 我是用关键字搜索才找到的。
这里插入一下两个名词的理解,Incremental
(增量),Iterative
(迭代)。 Incremental
是说每次会有新的数据添加进训练集(带有伪标签的无标记数据)。Iterative
是训练数据的总量在迭代过程中是没有变化的(数量没有变)。注意,Co-EM
算法算是Incremental
的,原因是算法初始化是给所有的无标记数据都打上了伪标签。
This suggests that incremental algorithms may outperform iterative algorithms, so long as they are not led astray by a few mislabeled documents in the early rounds of using the unlabeled data.
简单描述,就是Co-EM
就是简单的将Co-train
和EM
算法融合在了一起。在论文中,为了控制特征分割(feature split)对于实验结果的影响,提出了两个版本的算法,一个使用特征分割,一个不是用特征分割。
The first, co-EM, is an iterative algorithm that uses the feature split. It proceeds by initializing the A-feature-set naive Bayes classifier from the labeled data only. Then, A probabilistically labels all the unlabeled data. The B-feature-set classifier then trains using the labeled data and the unlabeled data with A’s labels. B then relabels the data for use by A, and this process iterates until the classifiers converge. A and B predictions are combined together as co-training embedded classiers are. In practice, co-EM converges as quickly as EM does, and experimentally we run co-EM for 10 iterations.
Co-EM
:
- 分割两个特征子集,仅在有标记数据的两个特征子集上分别训练两个分类器A, B。
- A 给
所有的无标记数据
打上伪标签,B在有标签和带有A伪标签的无标记数据上训练。 - B 给所有的无标记数据打上伪标签,A在有标签和带有B伪标签的无标记数据上训练。
- 重复2, 3步,直到收敛。
Self-training is an incremental algorithm that does not use the split of the features. Initially, self-training builds a single naive Bayes classifier using the labeled training data and all the features. Then it labels the unlabeled data and converts the most confidently predicted document of each class into a labeled training example. This iterates until all the unlabeled documents are given labels.
self-training
由于没有划分特征子集,所以只有一个分类器给自己打伪标签。每次不是将所有的无标记数据都打上伪标签,这个版本是选择高置信度(most confidence)的无标记样本打伪标签。
Method | Use Feature Split(Yes) | Use Feature Split(No) |
---|---|---|
Incremental | co-training | self-training |
Iterative | co-EM | EM |
3. 总结
论文本身是不难理解的。EM
算法后面了解一下具体的内容。