（二）Semi-supervised（半监督学习）李宏毅

最新推荐文章于 2023-11-26 15:45:00 发布

ImangoCloud

最新推荐文章于 2023-11-26 15:45:00 发布

阅读量2.3k

点赞数 1

分类专栏：深度学习文章标签：深度学习机器学习

原文链接：https://speech.ee.ntu.edu.tw/~hylee/ml/2016-fall.html

版权

深度学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

介绍

半监督学习数据： $\{(x^r,\hat{y}^r)\}^R_{r=1},\{ x^u\}^{R+U}_{u=R}$ 。unlabled的数据集U远大于R。

因此，用于测试集的数据的特征也可以用来进行半监督学习，只不过不能使用它的标签，称之为Transductive learning;
如果没有标注的数据不是测试集特征，称之为Inductive learning。
参考：李宏毅2016机器学习Semi-supervised

大纲

普通模型的半监督学习
低密度low-density分离假设
平滑假设

普通模型的半监督学习

如下图所示是一个普遍意义的计算更新流程：
在这里插入图片描述

普通模型

低密度分离假设

就是在分界处的数据量很少，是低密度的。
Self-training
两种类型label的不同。
是一种Hard label（一个数据属于一个类，向一个类贡献）；
上一小节的模型是soft label（一个数据向多个标签贡献）。
在这里插入图片描述
对于神经网络来说，如果设置成soft label，是不能进行优化的，必须设置为Hard，这个就基于低密度分离假设。

熵正则化
使得在labeled的数据上尽可能准确，在unlabled数据上熵尽可能小
在这里插入图片描述

含义是几种分类的概率要尽可能的集中，越小越好。
在这里插入图片描述

平滑假设

x的分布是不均匀的，有些地方很集中，有些地方很分散。
如果 $x^1$ 和 $x^2$ 在高密度区域非常接近(用基于图的路径描述)，则 $\hat{y}^1$ 和 $\hat{y}^2$ 很相似。connected by high density path

用基于图的路径描述
基于k近邻
基于最大阈值
类别具有传递性

定义smoothness

$S=\frac{1}{2}\sum_{i,j}{w_{i,j}(y^i-y^j)^2}$
越小月smooth
矩阵化运算

新的损失函数
$L=\sum_{x^r}{C(y^r,\hat{y}^r)} +\lambda S$
第二部分就是依据于网络参数。

半监督学习代码

半监督的学习代码位置在每一个epoch循环刚开始的位置

对unlabled的数据使用model进行train得到伪数据集；
对train_set和pseudo_set使用ConcatDataset进行合并；
使用DataLoader对合并后的数据集进行导入。

# Whether to do semi-supervised learning.
do_semi = False

for epoch in range(n_epochs):
    # ---------- TODO ----------
    # In each epoch, relabel the unlabeled dataset for semi-supervised learning.
    # Then you can combine the labeled dataset and pseudo-labeled dataset for the training.
    if do_semi:
        # Obtain pseudo-labels for unlabeled data using trained model.
        pseudo_set = get_pseudo_labels(unlabeled_set, model)

        # Construct a new dataset and a data loader for training.
        # This is used in semi-supervised learning only.
        concat_dataset = ConcatDataset([train_set, pseudo_set])
        train_loader = DataLoader(concat_dataset, batch_size=batch_size, shuffle=True, num_workers=8, pin_memory=True)

    # ---------- Training ----------
    # Make sure the model is in train mode before training.
    model.train()
    # 后面就是正常的训练流程

其中的get_pseudo_labels为利用当前最新model得到伪标签的函数（还未看懂）。

def get_pseudo_labels(dataset, model, threshold=0.65):
    # This functions generates pseudo-labels of a dataset using given model.
    # It returns an instance of DatasetFolder containing images whose prediction confidences exceed a given threshold.
    # You are NOT allowed to use any models trained on external data for pseudo-labeling.
    device = "cuda" if torch.cuda.is_available() else "cpu"

    # Construct a data loader.
    data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)

    # Make sure the model is in eval mode.
    model.eval()
    # Define softmax function.
    softmax = nn.Softmax(dim=-1)

    # Iterate over the dataset by batches.
    for batch in tqdm(data_loader):
        img, _ = batch

        # Forward the data
        # Using torch.no_grad() accelerates the forward process.
        with torch.no_grad():
            logits = model(img.to(device))

        # Obtain the probability distributions by applying softmax on logits.
        probs = softmax(logits)

        # ---------- TODO ----------
        # Filter the data and construct a new dataset.

    # # Turn off the eval mode.
    model.train()
    return dataset

ImangoCloud

关注

1
点赞
踩
13

收藏

觉得还不错? 一键收藏
1
评论
（二）Semi-supervised（半监督学习）李宏毅

介绍半监督学习数据：{(xr,y^r)}r=1R,{xu}u=RR+U\{(x^r,\hat{y}^r)\}^R_{r=1},\{ x^u\}^{R+U}_{u=R}{(xr,y^r)}r=1R,{xu}u=RR+U。unlabled的数据集U远大于R。因此，用于测试集的数据的特征也可以用来进行半监督学习，只不过不能使用它的标签，称之为Transductive learning;如果没有标注的数据不是测试集特征，称之为Inductive learning。参考：李宏毅2016机器学习Semi-
复制链接

扫一扫