【论文阅读】Analyzing the Effectiveness and Applicability of Co-training

最新推荐文章于 2024-10-31 11:46:22 发布

来日可期1314

最新推荐文章于 2024-10-31 11:46:22 发布

阅读量363

点赞数 1

分类专栏：论文阅读文章标签：论文阅读

本文链接：https://blog.csdn.net/ssjq123/article/details/130043298

版权

论文阅读专栏收录该内容

29 篇文章 0 订阅

订阅专栏

论文下载
bib:

@INPROCEEDINGS{NigamGhani2000CoEM,
	title		= {Analyzing the Effectiveness and Applicability of Co-Training},
	author 		= {Kamal Nigam and Rayid Ghani},
	booktitle 	= {CIKM},
	year 		= {2000},
	pages 		= {86--93}
}

1. 摘要

Recently there has been significant interest in supervised learning algorithms that combine labeled and unlabeled data for text learning tasks.

The co-training setting [1] applies to datasets that have a natural separation of their features into two disjoint sets.

We demonstrate that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not.

When a natural split does not exist, co-training algorithms that manufacture a feature split may out-perform algorithms not using a split.

These results help explain why co-training algorithms are both discriminative in nature and robust to the assumptions of their embedded classifiers.

2. 算法描述

大概是因为这篇论文写的时间太久远了吧（2000），害的我读半天找不到算法的描述。我只能说对于算法的描述太少了，不仔细读根本找不到，我是用关键字搜索才找到的。

这里插入一下两个名词的理解，Incremental（增量），Iterative（迭代）。 Incremental是说每次会有新的数据添加进训练集（带有伪标签的无标记数据）。Iterative是训练数据的总量在迭代过程中是没有变化的（数量没有变）。注意，Co-EM算法算是Incremental的，原因是算法初始化是给所有的无标记数据都打上了伪标签。

This suggests that incremental algorithms may outperform iterative algorithms, so long as they are not led astray by a few mislabeled documents in the early rounds of using the unlabeled data.

简单描述，就是Co-EM就是简单的将Co-train 和EM算法融合在了一起。在论文中，为了控制特征分割（feature split）对于实验结果的影响，提出了两个版本的算法，一个使用特征分割，一个不是用特征分割。

The first, co-EM, is an iterative algorithm that uses the feature split. It proceeds by initializing the A-feature-set naive Bayes classifier from the labeled data only. Then, A probabilistically labels all the unlabeled data. The B-feature-set classifier then trains using the labeled data and the unlabeled data with A’s labels. B then relabels the data for use by A, and this process iterates until the classifiers converge. A and B predictions are combined together as co-training embedded classiers are. In practice, co-EM converges as quickly as EM does, and experimentally we run co-EM for 10 iterations.

Co-EM:

分割两个特征子集，仅在有标记数据的两个特征子集上分别训练两个分类器A， B。
A 给所有的无标记数据打上伪标签，B在有标签和带有A伪标签的无标记数据上训练。
B 给所有的无标记数据打上伪标签，A在有标签和带有B伪标签的无标记数据上训练。
重复2， 3步，直到收敛。

Self-training is an incremental algorithm that does not use the split of the features. Initially, self-training builds a single naive Bayes classifier using the labeled training data and all the features. Then it labels the unlabeled data and converts the most confidently predicted document of each class into a labeled training example. This iterates until all the unlabeled documents are given labels.