今天看了一篇关于半监督算法的论文:Using Weighted Nearest Neighbor to Benefit from Unlabeled Data
对整篇论文做了一些总结:
一.简介
1.半监督算法的必要性:where often the unlabeled examples greatly outnumber the labeled examples 标签好的类往往大大少于未标签的类,因此我们可以考虑从未标签的类当中提取一些可供参考的信息来提高分类器的准确率。
2.半监督算法的大致流程:The examples from the unlabeled set are "pre-labeled" by an initial classifer that is build using the limited
available training data. By choosing appropriate weights for this prelabeled data, the nearest neighbor classifer consistently improves on the original classifer.首先用有标签的类去训练分类器,然后用这个初始分类器去预测未标签的类。然后给未分类数据选择合适的权重,用最近邻居分类器去提高初始分类器的准确率。
available training data. By choosing appropriate weights for this prelabeled data, the nearest neighbor classifer consistently improves on the original classifer.首先用有标签的类去训练分类器,然后用这个初始分类器去预测未标签的类。然后给未分类数据选择合适的权重,用最近邻居分类器去提高初始分类器的准确率。
3.the key to semi-supervised learning is the prior assumption of consistency, that allows for exploiting the geometric structure of the data distribution.
半监督算法的关键是前提假设的一致性,这样就可以发现数据的几何分布。
4.Close data points should belong to the same class and decision boundaries should lie in regions of low data density; this is also called the "cluster assumption".
距离相互靠近的点应该同属于同一类,决策边界应该落在数据低密度的区域,即“假设聚类”。
5.该论文提出的半监督算法流程:In this paper, we introduce a very simple two-stage approach that uses the available unlabeled data to improve on the predictions made when learning only from the labeled examples. In a first stage, it uses an of-the-shelf classifier to
build a model based on the small amount of available training data, and in the second stage it uses that m
build a model based on the small amount of available training data, and in the second stage it uses that m