批次数据处理

在数据处理里,常使用DataSet和DataLoader,关于具体使用此处不介绍,对于每一个batch_size里的数据来讲,
一般数据是shuffle=True,即表示打乱顺序,能够使数据更无规律和更为随机。但是如果对于数据样本不定长的情况或者说分布不均匀的情况下,
要使其定长,做法就是pad到一个固定长度,如果长短分布差距大呢??

比如:

1
2
example1: 我有一个小摩托,我从来也不急,骑着我的小摩托,从此把它骑。
example2: 我爱南京。

如果将example2example1pad到一个固定长度,pad太多。(虽然你可以扯进来mask,但是要在model的每一layer都要设置mask,关于mask此处忽略。)

如何在一个batch_size样本里面使数据分布更为接近?

kmean algo for clustering the feature by length.

kmeans等聚类算法就可以辅助解决,以每条数据的长度为依据。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73


def kmeans(x, k, max_it=32):
r"""
KMeans algorithm for clustering the sentences by length.

Args:
x (list[int]):
The list of sentence lengths.
k (int):
The number of clusters.
This is an approximate value. The final number of clusters can be less or equal to `k`.
max_it (int):
Maximum number of iterations.
If centroids does not converge after several iterations, the algorithm will be early stopped.

Returns:
list[float], list[list[int]]:
The first list contains average lengths of sentences in each cluster.
The second is the list of clusters holding indices of data points.

Examples:
>>> x = torch.randint(10,20,(10,)).tolist()
>>> x
[15, 10, 17, 11, 18, 13, 17, 19, 18, 14]
>>> centroids, clusters = kmeans(x, 3)
>>> centroids
[10.5, 14.0, 17.799999237060547]
>>> clusters
[[1, 3], [0, 5, 9], [2, 4, 6, 7, 8]]
"""

# the number of clusters must not be greater than the number of datapoints
x, k = torch.tensor(x, dtype=torch.float), min(len(x), k)
# collect unique datapoints
d = x.unique()
# initialize k centroids randomly
c = d[torch.randperm(len(d))[:k]]
# assign each datapoint to the cluster with the closest centroid
dists, y = torch.abs_(x.unsqueeze(-1) - c).min(-1)

for _ in range(max_it):
# if an empty cluster is encountered,
# choose the farthest datapoint from the biggest cluster and move that the empty one
mask = torch.arange(k).unsqueeze(-1).eq(y)
none = torch.where(~mask.any(-1))[0].tolist()
while len(none) > 0:
for i in none:
# the biggest cluster
b = torch.where(mask[mask.sum(-1).argmax()])[0]
# the datapoint farthest from the centroid of cluster b
f = dists[b].argmax()
# update the assigned cluster of f
y[b[f]] = i
# re-calculate the mask
mask = torch.arange(k).unsqueeze(-1).eq(y)
none = torch.where(~mask.any(-1))[0].tolist()
# update the centroids
c, old = (x * mask).sum(-1) / mask.sum(-1), c
# re-assign all datapoints to clusters
dists, y = torch.abs_(x.unsqueeze(-1) - c).min(-1)
# stop iteration early if the centroids converge
if c.equal(old):
break
# assign all datapoints to the new-generated clusters
# the empty ones are discarded
assigned = y.unique().tolist()
# get the centroids of the assigned clusters
centroids = c[assigned].tolist()
# map all values of datapoints to buckets
clusters = [torch.where(y.eq(i))[0].tolist() for i in assigned]

return centroids, clusters

这样使用的话,那么batch_size就木有啥子用处了,有可能一个batch样本量为1,也有可能为设置的上限。

但是转念一想,其实也可以不用如此复杂,直接sorted by length也是可以的嘛,具体解释就忽略了。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值