深入浅出TensorFlow2函数——tf.data.Dataset.padded_batch

von Neumann

已于 2023-03-16 20:23:42 修改

阅读量1.3w

点赞数 5

分类专栏：深入浅出TensorFlow2函数文章标签：人工智能深度学习 tensorflow padded_batch

于 2021-12-15 23:03:49 首次发布

本文链接：https://blog.csdn.net/hy592070616/article/details/121894562

版权

深入浅出TensorFlow2函数专栏收录该内容

25 篇文章 21 订阅

订阅专栏

分类目录：《深入浅出TensorFlow2函数》总目录

语法

padded_batch(batch_size, padded_shapes=None, padding_values=None, drop_remainder=False,name=None)

该函数可以将数据集的连续元素合并到padded batch中。即将输入数据集的多个连续元素合并到单个元素中。与tf.data.Dataset.batch类似，返回的结果将有一个额外的维度，即batch_size。如果batch_size未将输入元素的数量 $N$ 平均分割，且drop_remainder为False，则最后一个元素的batch_size为N % batch_size。如果程序依赖于具有相同外部尺寸的batch_size，则应将drop_rements参数设置为True，以防止生成较小的批。

与tf.data.Dataset.batch不同的是，被处理的输入元素可能具有不同的形状，该函数将每个向量填充到padded_shapes中的相应形状。padded_shapes参数确定输出元素中每个向量的每个维度的结果形状：

如果向量的维数被设定为常数，则每个向量将被填充到该长度。
如果向量的维数没有被设定，则每个向量将填充到所有元素的最大长度。

参数

参数	意义
batch_size	[`tf.int64` /`tf.Tensor`]表示要在单个批次中组合的此数据集的连续元素数。
padded_shapes	[可选，`tf.int64` /`tf.TensorShape`]表示在处理之前每个输入元素的各个向量应填充到的形状。如果未设置，则将所有向量都填充到该批次中的最大尺寸。如果任何向量具有未知列组，则必须设置填充形状。
padding_values	[可选，`tf.TensorShape`]表示用于各个向量的填充值。`None`表示应填充默认值。数值类型的默认值为0，字符串类型的默认值为空字符串。`padding_values`应具有与输入数据集相同的结构。如果`padding_values`是单个元素，并且输入数据集有多个组件，则相同的`padding_values`将用于填充数据集的每个组件。如果`padding_values`是标量，则将广播其值以匹配每个向量的形状。
drop_remainder	[可选， `tf.bool` /`tf.Tensor`]表示如果最后一批元素少于批次大小，是否应删除最后一批元素，默认为`False`。
name	[可选]`tf.data`操作的名称

返回值

返回值	意义
Dataset	一个`tf.data.Dataset`的数据集。

异常

异常类型	意义
ValueError	如果组件具有未知列组，并且未设置`padded_shapes`参数。
TypeError	填充的`padding_values`与原向量的类型不符。

实例

A = (tf.data.Dataset
     .range(1, 5, output_type=tf.int32)
     .map(lambda x: tf.fill([x], x)))
for element in A.as_numpy_iterator():
  print(element)

输出：

[1]
[2 2]
[3 3 3]
[4 4 4 4]

使用每个batch最小的batch_size来填充：

B = A.padded_batch(2)
for element in B.as_numpy_iterator():
  print(element)

输出：

[[1 0]
[2 2]]
[[3 3 3 0]
[4 4 4 4]]

使用一个固定的batch_size来填充：

C = A.padded_batch(2, padded_shapes=5)
for element in C.as_numpy_iterator():
  print(element)

输出：

[[1 0 0 0 0]
[2 2 0 0 0]]
[[3 3 3 0 0]
[4 4 4 4 0]]

使用特定的值padding_values来填充：

D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
for element in D.as_numpy_iterator():
  print(element)

输出：

[[ 1 -1 -1 -1 -1]
[ 2 2 -1 -1 -1]]
[[ 3 3 3 -1 -1]
[ 4 4 4 4-1]]

多维数组可以独立地填充：

elements = [([1, 2, 3], [10]),
            ([4, 5], [11, 12])]
dataset = tf.data.Dataset.from_generator(
    lambda: iter(elements), (tf.int32, tf.int32))

for element in dataset.as_numpy_iterator():
  print(element)

输出：

(array([1, 2, 3]), array([10]))
(array([4, 5]), array([11, 12]))

第一个批次使用padded_shapes=4及padding_values=-1来填充，第二个批次使用padding_values=100及最小长度来填充：

dataset = dataset.padded_batch(2,
    padded_shapes=([4], [None]),
    padding_values=(-1, 100))
list(dataset.as_numpy_iterator())

使用单一值对多维数组进行填充：

E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
for element in E.as_numpy_iterator():
  print(element)

输出：

(array([[ 1, -1],
[ 2, 2]]), array([[ 1, -1],
[ 2, 2]]))
(array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]]), array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]]))

函数实现

  def padded_batch(self,
                   batch_size,
                   padded_shapes=None,
                   padding_values=None,
                   drop_remainder=False,
                   name=None):
    """Combines consecutive elements of this dataset into padded batches.
    This transformation combines multiple consecutive elements of the input
    dataset into a single element.
    Like `tf.data.Dataset.batch`, the components of the resulting element will
    have an additional outer dimension, which will be `batch_size` (or
    `N % batch_size` for the last element if `batch_size` does not divide the
    number of input elements `N` evenly and `drop_remainder` is `False`). If
    your program depends on the batches having the same outer dimension, you
    should set the `drop_remainder` argument to `True` to prevent the smaller
    batch from being produced.
    Unlike `tf.data.Dataset.batch`, the input elements to be batched may have
    different shapes, and this transformation will pad each component to the
    respective shape in `padded_shapes`. The `padded_shapes` argument
    determines the resulting shape for each dimension of each component in an
    output element:
    * If the dimension is a constant, the component will be padded out to that
      length in that dimension.
    * If the dimension is unknown, the component will be padded out to the
      maximum length of all elements in that dimension.
    >>> A = (tf.data.Dataset
    ...      .range(1, 5, output_type=tf.int32)
    ...      .map(lambda x: tf.fill([x], x)))
    >>> # Pad to the smallest per-batch size that fits all elements.
    >>> B = A.padded_batch(2)
    >>> for element in B.as_numpy_iterator():
    ...   print(element)
    [[1 0]
     [2 2]]
    [[3 3 3 0]
     [4 4 4 4]]
    >>> # Pad to a fixed size.
    >>> C = A.padded_batch(2, padded_shapes=5)
    >>> for element in C.as_numpy_iterator():
    ...   print(element)
    [[1 0 0 0 0]
     [2 2 0 0 0]]
    [[3 3 3 0 0]
     [4 4 4 4 0]]
    >>> # Pad with a custom value.
    >>> D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
    >>> for element in D.as_numpy_iterator():
    ...   print(element)
    [[ 1 -1 -1 -1 -1]
     [ 2  2 -1 -1 -1]]
    [[ 3  3  3 -1 -1]
     [ 4  4  4  4 -1]]
    >>> # Components of nested elements can be padded independently.
    >>> elements = [([1, 2, 3], [10]),
    ...             ([4, 5], [11, 12])]
    >>> dataset = tf.data.Dataset.from_generator(
    ...     lambda: iter(elements), (tf.int32, tf.int32))
    >>> # Pad the first component of the tuple to length 4, and the second
    >>> # component to the smallest size that fits.
    >>> dataset = dataset.padded_batch(2,
    ...     padded_shapes=([4], [None]),
    ...     padding_values=(-1, 100))
    >>> list(dataset.as_numpy_iterator())
    [(array([[ 1,  2,  3, -1], [ 4,  5, -1, -1]], dtype=int32),
      array([[ 10, 100], [ 11,  12]], dtype=int32))]
    >>> # Pad with a single value and multiple components.
    >>> E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
    >>> for element in E.as_numpy_iterator():
    ...   print(element)
    (array([[ 1, -1],
           [ 2,  2]], dtype=int32), array([[ 1, -1],
           [ 2,  2]], dtype=int32))
    (array([[ 3,  3,  3, -1],
           [ 4,  4,  4,  4]], dtype=int32), array([[ 3,  3,  3, -1],
           [ 4,  4,  4,  4]], dtype=int32))
    See also `tf.data.experimental.dense_to_sparse_batch`, which combines
    elements that may have different shapes into a `tf.sparse.SparseTensor`.
    Args:
      batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
        consecutive elements of this dataset to combine in a single batch.
      padded_shapes: (Optional.) A (nested) structure of `tf.TensorShape` or
        `tf.int64` vector tensor-like objects representing the shape to which
        the respective component of each input element should be padded prior
        to batching. Any unknown dimensions will be padded to the maximum size
        of that dimension in each batch. If unset, all dimensions of all
        components are padded to the maximum size in the batch. `padded_shapes`
        must be set if any component has an unknown rank.
      padding_values: (Optional.) A (nested) structure of scalar-shaped
        `tf.Tensor`, representing the padding values to use for the respective
        components. None represents that the (nested) structure should be padded
        with default values.  Defaults are `0` for numeric types and the empty
        string for string types. The `padding_values` should have the same
        (nested) structure as the input dataset. If `padding_values` is a single
        element and the input dataset has multiple components, then the same
        `padding_values` will be used to pad every component of the dataset.
        If `padding_values` is a scalar, then its value will be broadcasted
        to match the shape of each component.
      drop_remainder: (Optional.) A `tf.bool` scalar `tf.Tensor`, representing
        whether the last batch should be dropped in the case it has fewer than
        `batch_size` elements; the default behavior is not to drop the smaller
        batch.
      name: (Optional.) A name for the tf.data operation.
    Returns:
      Dataset: A `Dataset`.
    Raises:
      ValueError: If a component has an unknown rank, and the `padded_shapes`
        argument is not set.
      TypeError: If a component is of an unsupported type. The list of supported
        types is documented in
        https://www.tensorflow.org/guide/data#dataset_structure.
    """
    if padded_shapes is None:
      padded_shapes = get_legacy_output_shapes(self)
      for i, shape in enumerate(nest.flatten(padded_shapes)):
        # A `tf.TensorShape` is only false if its *rank* is unknown.
        if not shape:
          raise ValueError(f"You must provide `padded_shapes` argument because "
                           f"component {i} has unknown rank.")
    return PaddedBatchDataset(
        self,
        batch_size,
        padded_shapes,
        padding_values,
        drop_remainder,
        name=name)