语法
padded_batch(batch_size, padded_shapes=None, padding_values=None, drop_remainder=False,name=None)
该函数可以将数据集的连续元素合并到padded batch中。即将输入数据集的多个连续元素合并到单个元素中。与tf.data.Dataset.batch
类似,返回的结果将有一个额外的维度,即batch_size
。如果batch_size
未将输入元素的数量
N
N
N平均分割,且drop_remainder
为False
,则最后一个元素的batch_size
为N % batch_size
。如果程序依赖于具有相同外部尺寸的batch_size
,则应将drop_rements
参数设置为True
,以防止生成较小的批。
与tf.data.Dataset.batch
不同的是,被处理的输入元素可能具有不同的形状,该函数将每个向量填充到padded_shapes
中的相应形状。padded_shapes
参数确定输出元素中每个向量的每个维度的结果形状:
- 如果向量的维数被设定为常数,则每个向量将被填充到该长度。
- 如果向量的维数没有被设定,则每个向量将填充到所有元素的最大长度。
参数
参数 | 意义 |
---|---|
batch_size | [tf.int64 /tf.Tensor ]表示要在单个批次中组合的此数据集的连续元素数。 |
padded_shapes | [可选,tf.int64 /tf.TensorShape ]表示在处理之前每个输入元素的各个向量应填充到的形状。如果未设置,则将所有向量都填充到该批次中的最大尺寸。如果任何向量具有未知列组,则必须设置填充形状。 |
padding_values | [可选,tf.TensorShape ]表示用于各个向量的填充值。None 表示应填充默认值。数值类型的默认值为0,字符串类型的默认值为空字符串。padding_values 应具有与输入数据集相同的结构。如果padding_values 是单个元素,并且输入数据集有多个组件,则相同的padding_values 将用于填充数据集的每个组件。如果padding_values 是标量,则将广播其值以匹配每个向量的形状。 |
drop_remainder | [可选, tf.bool /tf.Tensor ]表示如果最后一批元素少于批次大小,是否应删除最后一批元素,默认为False 。 |
name | [可选]tf.data 操作的名称 |
返回值
返回值 | 意义 |
---|---|
Dataset | 一个tf.data.Dataset 的数据集。 |
异常
异常类型 | 意义 |
---|---|
ValueError | 如果组件具有未知列组,并且未设置padded_shapes 参数。 |
TypeError | 填充的padding_values 与原向量的类型不符。 |
实例
A = (tf.data.Dataset
.range(1, 5, output_type=tf.int32)
.map(lambda x: tf.fill([x], x)))
for element in A.as_numpy_iterator():
print(element)
输出:
[1]
[2 2]
[3 3 3]
[4 4 4 4]
使用每个batch
最小的batch_size
来填充:
B = A.padded_batch(2)
for element in B.as_numpy_iterator():
print(element)
输出:
[[1 0]
[2 2]]
[[3 3 3 0]
[4 4 4 4]]
使用一个固定的batch_size
来填充:
C = A.padded_batch(2, padded_shapes=5)
for element in C.as_numpy_iterator():
print(element)
输出:
[[1 0 0 0 0]
[2 2 0 0 0]]
[[3 3 3 0 0]
[4 4 4 4 0]]
使用特定的值padding_values
来填充:
D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
for element in D.as_numpy_iterator():
print(element)
输出:
[[ 1 -1 -1 -1 -1]
[ 2 2 -1 -1 -1]]
[[ 3 3 3 -1 -1]
[ 4 4 4 4-1]]
多维数组可以独立地填充:
elements = [([1, 2, 3], [10]),
([4, 5], [11, 12])]
dataset = tf.data.Dataset.from_generator(
lambda: iter(elements), (tf.int32, tf.int32))
for element in dataset.as_numpy_iterator():
print(element)
输出:
(array([1, 2, 3]), array([10]))
(array([4, 5]), array([11, 12]))
第一个批次使用padded_shapes=4
及padding_values=-1
来填充,第二个批次使用padding_values=100
及最小长度来填充:
dataset = dataset.padded_batch(2,
padded_shapes=([4], [None]),
padding_values=(-1, 100))
list(dataset.as_numpy_iterator())
使用单一值对多维数组进行填充:
E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
for element in E.as_numpy_iterator():
print(element)
输出:
(array([[ 1, -1],
[ 2, 2]]), array([[ 1, -1],
[ 2, 2]]))
(array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]]), array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]]))
函数实现
def padded_batch(self,
batch_size,
padded_shapes=None,
padding_values=None,
drop_remainder=False,
name=None):
"""Combines consecutive elements of this dataset into padded batches.
This transformation combines multiple consecutive elements of the input
dataset into a single element.
Like `tf.data.Dataset.batch`, the components of the resulting element will
have an additional outer dimension, which will be `batch_size` (or
`N % batch_size` for the last element if `batch_size` does not divide the
number of input elements `N` evenly and `drop_remainder` is `False`). If
your program depends on the batches having the same outer dimension, you
should set the `drop_remainder` argument to `True` to prevent the smaller
batch from being produced.
Unlike `tf.data.Dataset.batch`, the input elements to be batched may have
different shapes, and this transformation will pad each component to the
respective shape in `padded_shapes`. The `padded_shapes` argument
determines the resulting shape for each dimension of each component in an
output element:
* If the dimension is a constant, the component will be padded out to that
length in that dimension.
* If the dimension is unknown, the component will be padded out to the
maximum length of all elements in that dimension.
>>> A = (tf.data.Dataset
... .range(1, 5, output_type=tf.int32)
... .map(lambda x: tf.fill([x], x)))
>>> # Pad to the smallest per-batch size that fits all elements.
>>> B = A.padded_batch(2)
>>> for element in B.as_numpy_iterator():
... print(element)
[[1 0]
[2 2]]
[[3 3 3 0]
[4 4 4 4]]
>>> # Pad to a fixed size.
>>> C = A.padded_batch(2, padded_shapes=5)
>>> for element in C.as_numpy_iterator():
... print(element)
[[1 0 0 0 0]
[2 2 0 0 0]]
[[3 3 3 0 0]
[4 4 4 4 0]]
>>> # Pad with a custom value.
>>> D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
>>> for element in D.as_numpy_iterator():
... print(element)
[[ 1 -1 -1 -1 -1]
[ 2 2 -1 -1 -1]]
[[ 3 3 3 -1 -1]
[ 4 4 4 4 -1]]
>>> # Components of nested elements can be padded independently.
>>> elements = [([1, 2, 3], [10]),
... ([4, 5], [11, 12])]
>>> dataset = tf.data.Dataset.from_generator(
... lambda: iter(elements), (tf.int32, tf.int32))
>>> # Pad the first component of the tuple to length 4, and the second
>>> # component to the smallest size that fits.
>>> dataset = dataset.padded_batch(2,
... padded_shapes=([4], [None]),
... padding_values=(-1, 100))
>>> list(dataset.as_numpy_iterator())
[(array([[ 1, 2, 3, -1], [ 4, 5, -1, -1]], dtype=int32),
array([[ 10, 100], [ 11, 12]], dtype=int32))]
>>> # Pad with a single value and multiple components.
>>> E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
>>> for element in E.as_numpy_iterator():
... print(element)
(array([[ 1, -1],
[ 2, 2]], dtype=int32), array([[ 1, -1],
[ 2, 2]], dtype=int32))
(array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]], dtype=int32), array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]], dtype=int32))
See also `tf.data.experimental.dense_to_sparse_batch`, which combines
elements that may have different shapes into a `tf.sparse.SparseTensor`.
Args:
batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
consecutive elements of this dataset to combine in a single batch.
padded_shapes: (Optional.) A (nested) structure of `tf.TensorShape` or
`tf.int64` vector tensor-like objects representing the shape to which
the respective component of each input element should be padded prior
to batching. Any unknown dimensions will be padded to the maximum size
of that dimension in each batch. If unset, all dimensions of all
components are padded to the maximum size in the batch. `padded_shapes`
must be set if any component has an unknown rank.
padding_values: (Optional.) A (nested) structure of scalar-shaped
`tf.Tensor`, representing the padding values to use for the respective
components. None represents that the (nested) structure should be padded
with default values. Defaults are `0` for numeric types and the empty
string for string types. The `padding_values` should have the same
(nested) structure as the input dataset. If `padding_values` is a single
element and the input dataset has multiple components, then the same
`padding_values` will be used to pad every component of the dataset.
If `padding_values` is a scalar, then its value will be broadcasted
to match the shape of each component.
drop_remainder: (Optional.) A `tf.bool` scalar `tf.Tensor`, representing
whether the last batch should be dropped in the case it has fewer than
`batch_size` elements; the default behavior is not to drop the smaller
batch.
name: (Optional.) A name for the tf.data operation.
Returns:
Dataset: A `Dataset`.
Raises:
ValueError: If a component has an unknown rank, and the `padded_shapes`
argument is not set.
TypeError: If a component is of an unsupported type. The list of supported
types is documented in
https://www.tensorflow.org/guide/data#dataset_structure.
"""
if padded_shapes is None:
padded_shapes = get_legacy_output_shapes(self)
for i, shape in enumerate(nest.flatten(padded_shapes)):
# A `tf.TensorShape` is only false if its *rank* is unknown.
if not shape:
raise ValueError(f"You must provide `padded_shapes` argument because "
f"component {i} has unknown rank.")
return PaddedBatchDataset(
self,
batch_size,
padded_shapes,
padding_values,
drop_remainder,
name=name)