Co-Training vs Self-Training-CSDN博客

本文介绍了半监督学习中的两种重要算法：self-training和co-training。self-training通过已标记数据训练初始模型，并以此模型预测未标记数据的标签，迭代更新直至所有数据被标记。co-training则适用于多视角学习场景，利用不同特征集分别训练多个分类器，各分类器间相互补充，共同提高预测准确性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

首先，在实际做classification的场景中，经常会遇到只有少量的labeled data而更多的data都是unlabeled 的情况。co-training和self-training这两个算法即是用来解决这样情况的。

下面分别描述这两种算法：

1.Self-training:

用已有的Labled data先建立一个分类器，建好之后用它去estimate那些unlabeled的data.

之后，之前的labeled data加上新estimate出来的 “pseudo-labeled” unlabeled data一起，再train出来一个新的分类器。

重负上述步骤，直到所有unlabeled data都被归类进去。

2.Co-training:

used in special cases of the more general multi-view learning.

即当要training的数据，可以从不同的views来看待的时候。举个例子，在做网页分类（web-page classification）这个模型时候，feature的来源有两个部分，一是URL features of the websites 记为 A, 二是text features of the websites 记为 B.

co-training的算法是：

• Inputs: An initial collection of labeled documents and one of unlabeled documents.

• Loop while there exist documents without class labels:

• Build classifier A using the A portion of each document.

• Build classifier B using the B portion of each document.

• For each class C, pick the unlabeled document about which classifier A is most confident that its class label is C and add it to the collection of labeled documents.

• For each class C, pick the unlabeled document about which classifier B is most confident that its class label is C and add it to the collection of labeled documents.

• Output: Two classifiers, A and B, that predict class labels for new documents. These predictions can be combined by multiplying together and then renormalizing their class probability scores.

即两组用features A,B分别做两个分类器，单独每个分类器里面用self-training的方法分别进行training的迭代（每次增加新的unlabeled数据），最后使用两个self-training结束的分类器，一起进行prediction.

其主要的思路是，对于那些可以feature可以天然split的数据，用每组feature做出不同的分类器，不同features做出来的分类器可以相互互补

最后总结：

co-training和self-training之前最直观的区别就是：在学习的过程中，前者有两个分类器(classifier)，而后者仅有一个分类器。

转载于:https://www.cnblogs.com/xiaotu1617234/p/8273972.html