深入浅出TensorFlow2函数——tf.data.Dataset.padded_batch

分类目录:《深入浅出TensorFlow2函数》总目录


语法
padded_batch(batch_size, padded_shapes=None, padding_values=None, drop_remainder=False,name=None)

该函数可以将数据集的连续元素合并到padded batch中。即将输入数据集的多个连续元素合并到单个元素中。与tf.data.Dataset.batch类似,返回的结果将有一个额外的维度,即batch_size。如果batch_size未将输入元素的数量 N N N平均分割,且drop_remainderFalse,则最后一个元素的batch_sizeN % batch_size。如果程序依赖于具有相同外部尺寸的batch_size,则应将drop_rements参数设置为True,以防止生成较小的批。

tf.data.Dataset.batch不同的是,被处理的输入元素可能具有不同的形状,该函数将每个向量填充到padded_shapes中的相应形状。padded_shapes参数确定输出元素中每个向量的每个维度的结果形状:

  • 如果向量的维数被设定为常数,则每个向量将被填充到该长度。
  • 如果向量的维数没有被设定,则每个向量将填充到所有元素的最大长度。
参数
参数意义
batch_size[tf.int64 /tf.Tensor]表示要在单个批次中组合的此数据集的连续元素数。
padded_shapes[可选,tf.int64 /tf.TensorShape]表示在处理之前每个输入元素的各个向量应填充到的形状。如果未设置,则将所有向量都填充到该批次中的最大尺寸。如果任何向量具有未知列组,则必须设置填充形状。
padding_values[可选,tf.TensorShape]表示用于各个向量的填充值。None表示应填充默认值。数值类型的默认值为0,字符串类型的默认值为空字符串。padding_values应具有与输入数据集相同的结构。如果padding_values是单个元素,并且输入数据集有多个组件,则相同的padding_values将用于填充数据集的每个组件。如果padding_values是标量,则将广播其值以匹配每个向量的形状。
drop_remainder[可选, tf.bool /tf.Tensor]表示如果最后一批元素少于批次大小,是否应删除最后一批元素,默认为False
name[可选]tf.data操作的名称
返回值
返回值意义
Dataset一个tf.data.Dataset的数据集。
异常
异常类型意义
ValueError如果组件具有未知列组,并且未设置padded_shapes参数。
TypeError填充的padding_values与原向量的类型不符。
实例
A = (tf.data.Dataset
     .range(1, 5, output_type=tf.int32)
     .map(lambda x: tf.fill([x], x)))
for element in A.as_numpy_iterator():
  print(element)

输出:

[1]
[2 2]
[3 3 3]
[4 4 4 4]

使用每个batch最小的batch_size来填充:

B = A.padded_batch(2)
for element in B.as_numpy_iterator():
  print(element)

输出:

[[1 0]
[2 2]]
[[3 3 3 0]
[4 4 4 4]]

使用一个固定的batch_size来填充:

C = A.padded_batch(2, padded_shapes=5)
for element in C.as_numpy_iterator():
  print(element)

输出:

[[1 0 0 0 0]
[2 2 0 0 0]]
[[3 3 3 0 0]
[4 4 4 4 0]]

使用特定的值padding_values来填充:

D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
for element in D.as_numpy_iterator():
  print(element)

输出:

[[ 1 -1 -1 -1 -1]
[ 2 2 -1 -1 -1]]
[[ 3 3 3 -1 -1]
[ 4 4 4 4-1]]

多维数组可以独立地填充:

elements = [([1, 2, 3], [10]),
            ([4, 5], [11, 12])]
dataset = tf.data.Dataset.from_generator(
    lambda: iter(elements), (tf.int32, tf.int32))

for element in dataset.as_numpy_iterator():
  print(element)

输出:

(array([1, 2, 3]), array([10]))
(array([4, 5]), array([11, 12]))

第一个批次使用padded_shapes=4padding_values=-1来填充,第二个批次使用padding_values=100及最小长度来填充:

dataset = dataset.padded_batch(2,
    padded_shapes=([4], [None]),
    padding_values=(-1, 100))
list(dataset.as_numpy_iterator())

使用单一值对多维数组进行填充:

E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
for element in E.as_numpy_iterator():
  print(element)

输出:

(array([[ 1, -1],
[ 2, 2]]), array([[ 1, -1],
[ 2, 2]]))
(array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]]), array([[ 3, 3, 3, -1],
[ 4, 4, 4, 4]]))

函数实现
  def padded_batch(self,
                   batch_size,
                   padded_shapes=None,
                   padding_values=None,
                   drop_remainder=False,
                   name=None):
    """Combines consecutive elements of this dataset into padded batches.
    This transformation combines multiple consecutive elements of the input
    dataset into a single element.
    Like `tf.data.Dataset.batch`, the components of the resulting element will
    have an additional outer dimension, which will be `batch_size` (or
    `N % batch_size` for the last element if `batch_size` does not divide the
    number of input elements `N` evenly and `drop_remainder` is `False`). If
    your program depends on the batches having the same outer dimension, you
    should set the `drop_remainder` argument to `True` to prevent the smaller
    batch from being produced.
    Unlike `tf.data.Dataset.batch`, the input elements to be batched may have
    different shapes, and this transformation will pad each component to the
    respective shape in `padded_shapes`. The `padded_shapes` argument
    determines the resulting shape for each dimension of each component in an
    output element:
    * If the dimension is a constant, the component will be padded out to that
      length in that dimension.
    * If the dimension is unknown, the component will be padded out to the
      maximum length of all elements in that dimension.
    >>> A = (tf.data.Dataset
    ...      .range(1, 5, output_type=tf.int32)
    ...      .map(lambda x: tf.fill([x], x)))
    >>> # Pad to the smallest per-batch size that fits all elements.
    >>> B = A.padded_batch(2)
    >>> for element in B.as_numpy_iterator():
    ...   print(element)
    [[1 0]
     [2 2]]
    [[3 3 3 0]
     [4 4 4 4]]
    >>> # Pad to a fixed size.
    >>> C = A.padded_batch(2, padded_shapes=5)
    >>> for element in C.as_numpy_iterator():
    ...   print(element)
    [[1 0 0 0 0]
     [2 2 0 0 0]]
    [[3 3 3 0 0]
     [4 4 4 4 0]]
    >>> # Pad with a custom value.
    >>> D = A.padded_batch(2, padded_shapes=5, padding_values=-1)
    >>> for element in D.as_numpy_iterator():
    ...   print(element)
    [[ 1 -1 -1 -1 -1]
     [ 2  2 -1 -1 -1]]
    [[ 3  3  3 -1 -1]
     [ 4  4  4  4 -1]]
    >>> # Components of nested elements can be padded independently.
    >>> elements = [([1, 2, 3], [10]),
    ...             ([4, 5], [11, 12])]
    >>> dataset = tf.data.Dataset.from_generator(
    ...     lambda: iter(elements), (tf.int32, tf.int32))
    >>> # Pad the first component of the tuple to length 4, and the second
    >>> # component to the smallest size that fits.
    >>> dataset = dataset.padded_batch(2,
    ...     padded_shapes=([4], [None]),
    ...     padding_values=(-1, 100))
    >>> list(dataset.as_numpy_iterator())
    [(array([[ 1,  2,  3, -1], [ 4,  5, -1, -1]], dtype=int32),
      array([[ 10, 100], [ 11,  12]], dtype=int32))]
    >>> # Pad with a single value and multiple components.
    >>> E = tf.data.Dataset.zip((A, A)).padded_batch(2, padding_values=-1)
    >>> for element in E.as_numpy_iterator():
    ...   print(element)
    (array([[ 1, -1],
           [ 2,  2]], dtype=int32), array([[ 1, -1],
           [ 2,  2]], dtype=int32))
    (array([[ 3,  3,  3, -1],
           [ 4,  4,  4,  4]], dtype=int32), array([[ 3,  3,  3, -1],
           [ 4,  4,  4,  4]], dtype=int32))
    See also `tf.data.experimental.dense_to_sparse_batch`, which combines
    elements that may have different shapes into a `tf.sparse.SparseTensor`.
    Args:
      batch_size: A `tf.int64` scalar `tf.Tensor`, representing the number of
        consecutive elements of this dataset to combine in a single batch.
      padded_shapes: (Optional.) A (nested) structure of `tf.TensorShape` or
        `tf.int64` vector tensor-like objects representing the shape to which
        the respective component of each input element should be padded prior
        to batching. Any unknown dimensions will be padded to the maximum size
        of that dimension in each batch. If unset, all dimensions of all
        components are padded to the maximum size in the batch. `padded_shapes`
        must be set if any component has an unknown rank.
      padding_values: (Optional.) A (nested) structure of scalar-shaped
        `tf.Tensor`, representing the padding values to use for the respective
        components. None represents that the (nested) structure should be padded
        with default values.  Defaults are `0` for numeric types and the empty
        string for string types. The `padding_values` should have the same
        (nested) structure as the input dataset. If `padding_values` is a single
        element and the input dataset has multiple components, then the same
        `padding_values` will be used to pad every component of the dataset.
        If `padding_values` is a scalar, then its value will be broadcasted
        to match the shape of each component.
      drop_remainder: (Optional.) A `tf.bool` scalar `tf.Tensor`, representing
        whether the last batch should be dropped in the case it has fewer than
        `batch_size` elements; the default behavior is not to drop the smaller
        batch.
      name: (Optional.) A name for the tf.data operation.
    Returns:
      Dataset: A `Dataset`.
    Raises:
      ValueError: If a component has an unknown rank, and the `padded_shapes`
        argument is not set.
      TypeError: If a component is of an unsupported type. The list of supported
        types is documented in
        https://www.tensorflow.org/guide/data#dataset_structure.
    """
    if padded_shapes is None:
      padded_shapes = get_legacy_output_shapes(self)
      for i, shape in enumerate(nest.flatten(padded_shapes)):
        # A `tf.TensorShape` is only false if its *rank* is unknown.
        if not shape:
          raise ValueError(f"You must provide `padded_shapes` argument because "
                           f"component {i} has unknown rank.")
    return PaddedBatchDataset(
        self,
        batch_size,
        padded_shapes,
        padding_values,
        drop_remainder,
        name=name)
  • 5
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

von Neumann

您的赞赏是我创作最大的动力~

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值