tf.contrib.training.bucket_by_sequence_length(
input_length,
tensors,
batch_size,
bucket_boundaries,
num_threads=1,
capacity=32,
bucket_capacities=None,
shapes=None,
dynamic_pad=False,
allow_smaller_final_batch=False,
keep_input=True,
shared_name=None,
name=None
)
作用:把差不多长的句子放在一起
Args:
input_length: int32 scalar Tensor, the sequence length of tensors.
tensors: The list or dictionary of tensors, representing a single element, to bucket. Nested lists are not supported.
batch_size: The new batch size pulled from the queue (all queues will have the same size). If a list is passed in then each bucket will have a different batch_size. (python int, int32 scalar or
iterable of integers of length num_buckets).
bucket_boundaries: int list, increasing non-negative numbers. The edges of the buckets to use when bucketing tensors.
Two extra buckets are created, one for input_length < bucket_boundaries[0] and one for input_length >= bucket_boundaries[-1].
num_threads: An integer. The number of threads enqueuing tensors.
capacity: An integer. The maximum number of minibatches in the top queue, and also the maximum number of elements within each bucket.
bucket_capacities: (Optional) None or a list of integers, the capacities of each bucket. If None, capacity is used (default). If specified, it must be a list of integers of length one larger than bucket_boundaries. Its i-th element is used as capacity for the i-th bucket queue.
shapes: (Optional) The shapes for each example. Defaults to the inferred shapes for tensors.
dynamic_pad: Boolean. Allow variable dimensions in input shapes. The given dimensions are padded upon dequeue so that tensors
within a batch have the same shapes.
allow_smaller_final_batch: (Optional)
Boolean. If True, allow the final batches to be smaller if there are
insufficient items left in the queues.
keep_input: A bool scalar Tensor. If
provided, this tensor controls whether the input is added to the queue or not.
If it evaluates True, then tensors are added to the bucket; otherwise they are
dropped. This tensor essentially acts as a filtering mechanism.
shared_name: (Optional). If set, the queues will be shared under the given name across multiple sessions.
name: (Optional) A name for the operations.
Returns:
A tuple (sequence_length, outputs) where sequence_length is a 1-D Tensor of size batch_size and outputs is a list or dictionary of batched, bucketed, outputs corresponding to elements of tensors.
Raises:
·
TypeError: if bucket_boundaries is not a
list of python integers.
·
ValueError: if bucket_boundaries is empty
or contains non-increasing values or if batch_size is a list and it’s length
doesn’t equal the number of buckets.
tf.data.experimental.bucket_by_sequence_length
https://runebook.dev/zh-CN/docs/tensorflow/data/experimental/bucket_by_sequence_length
很好的示例
https://github.com/wcarvalho/jupyter_notebooks/blob/ebe762436e2eea1dff34bbd034898b64e4465fe4/tf.bucket_by_sequence_length/bucketing%20practice.ipynb