批次数据处理

最新推荐文章于 2022-11-12 10:41:41 发布

ox180x

最新推荐文章于 2022-11-12 10:41:41 发布

阅读量154

点赞数

文章标签：深度学习机器学习 pytorch python 聚类

本文链接：https://blog.csdn.net/ox180x/article/details/124095398

版权

在数据处理里，常使用DataSet和DataLoader，关于具体使用此处不介绍，对于每一个batch_size里的数据来讲，
一般数据是shuffle=True，即表示打乱顺序，能够使数据更无规律和更为随机。但是如果对于数据样本不定长的情况或者说分布不均匀的情况下，
要使其定长，做法就是pad到一个固定长度，如果长短分布差距大呢？？

比如：

1
2

example1: 我有一个小摩托，我从来也不急，骑着我的小摩托，从此把它骑。
example2: 我爱南京。

如果将example2和example1pad到一个固定长度，pad太多。(虽然你可以扯进来mask，但是要在model的每一layer都要设置mask,关于mask此处忽略。)

如何在一个batch_size样本里面使数据分布更为接近？

kmean algo for clustering the feature by length.

kmeans等聚类算法就可以辅助解决，以每条数据的长度为依据。



def kmeans(x, k, max_it=32):
    r"""
    KMeans algorithm for clustering the sentences by length.

    Args:
        x (list[int]):
            The list of sentence lengths.
        k (int):
            The number of clusters.
            This is an approximate value. The final number of clusters can be less or equal to `k`.
        max_it (int):
            Maximum number of iterations.
            If centroids does not converge after several iterations, the algorithm will be early stopped.

    Returns:
        list[float], list[list[int]]:
            The first list contains average lengths of sentences in each cluster.
            The second is the list of clusters holding indices of data points.

    Examples:
        >>> x = torch.randint(10,20,(10,)).tolist()
        >>> x
        [15, 10, 17, 11, 18, 13, 17, 19, 18, 14]
        >>> centroids, clusters = kmeans(x, 3)
        >>> centroids
        [10.5, 14.0, 17.799999237060547]
        >>> clusters
        [[1, 3], [0, 5, 9], [2, 4, 6, 7, 8]]
    """

    # the number of clusters must not be greater than the number of datapoints
    x, k = torch.tensor(x, dtype=torch.float), min(len(x), k)
    # collect unique datapoints
    d = x.unique()
    # initialize k centroids randomly
    c = d[torch.randperm(len(d))[:k]]
    # assign each datapoint to the cluster with the closest centroid
    dists, y = torch.abs_(x.unsqueeze(-1) - c).min(-1)

    for _ in range(max_it):
        # if an empty cluster is encountered,
        # choose the farthest datapoint from the biggest cluster and move that the empty one
        mask = torch.arange(k).unsqueeze(-1).eq(y)
        none = torch.where(~mask.any(-1))[0].tolist()
        while len(none) > 0:
            for i in none:
                # the biggest cluster
                b = torch.where(mask[mask.sum(-1).argmax()])[0]
                # the datapoint farthest from the centroid of cluster b
                f = dists[b].argmax()
                # update the assigned cluster of f
                y[b[f]] = i
                # re-calculate the mask
                mask = torch.arange(k).unsqueeze(-1).eq(y)
            none = torch.where(~mask.any(-1))[0].tolist()
        # update the centroids
        c, old = (x * mask).sum(-1) / mask.sum(-1), c
        # re-assign all datapoints to clusters
        dists, y = torch.abs_(x.unsqueeze(-1) - c).min(-1)
        # stop iteration early if the centroids converge
        if c.equal(old):
            break
    # assign all datapoints to the new-generated clusters
    # the empty ones are discarded
    assigned = y.unique().tolist()
    # get the centroids of the assigned clusters
    centroids = c[assigned].tolist()
    # map all values of datapoints to buckets
    clusters = [torch.where(y.eq(i))[0].tolist() for i in assigned]

    return centroids, clusters

这样使用的话，那么batch_size就木有啥子用处了，有可能一个batch样本量为1,也有可能为设置的上限。

但是转念一想，其实也可以不用如此复杂，直接sorted by length也是可以的嘛，具体解释就忽略了。

ox180x

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
批次数据处理

在数据处理里，常使用DataSet和DataLoader，关于具体使用此处不介绍，对于每一个batch_size里的数据来讲，一般数据是shuffle=True，即表示打乱顺序，能够使数据更无规律和更为随机。但是如果对于数据样本不定长的情况或者说分布不均匀的情况下，要使其定长，做法就是pad到一个固定长度，如果长短分布差距大呢？？比如：12example1: 我...
复制链接

扫一扫