map:
import tensorflow as tf
dataset = tf.data.TextLineDataset(['G:/git_open/tfrs-learn/data/test1','G:/git_open/tfrs-learn/data/test2'],num_parallel_reads=3)
resutl = dataset.map(lambda x:tf.strings.split(x,sep=" "),num_parallel_calls=3)
for each in resutl:
print(each)
注意:此处的路径必须是绝对路径;
当num_parallel_calls设置大于1时,返回结果的行顺序是无序的
reduce:求和
tf.data.Dataset.range(5).reduce(np.int64(0), lambda x, y: x + y).numpy()
filter:
import tensorflow as tf
dataset = tf.data.Dataset.range(100)
dataset = dataset.filter(lambda x:x<20)
for each in dataset:
print(each)
group_by_window:分别求区间内的奇数和,偶数和
#-*- coding: utf-8 -*-
import tensorflow as tf
dataset = tf.data.Dataset.range(11)
window_size = 30
key_func = lambda x: x%2
reduce_func = lambda key, dataset: tf.data.Dataset.from_tensors(dataset.reduce(tf.constant(0,dtype=tf.int64),lambda x,y:x+y))
dataset = dataset.group_by_window(
key_func=key_func,
reduce_func=reduce_func,
window_size=window_size)
for elem in dataset.as_numpy_iterator():
print(elem)
其中,
1、key_func表示对元素如何映射聚类,聚类完成后为对应的{key,newdatset}
2、window_size表示对于key_func处理后的每一个数据集,获取连续元素的窗口大小(批大小),每一个窗口中的数据都相当于一个小型的dataset
3、对于窗口映射后的每一个小型dataset,进入reduce_func 执行操作,其中recude_func对于每一个dataset执行了求和的reduce操作,并将结果转化为dataset返回
shuffle:
dataset = tf.data.Dataset.range(3)
dataset = dataset.shuffle(buffer_size=3, reshuffle_each_iteration=True)
dataset = dataset.repeat(2)
# [1, 0, 2, 1, 2, 0]
dataset = tf.data.Dataset.range(3)
dataset = dataset.shuffle(buffer_size=3, reshuffle_each_iteration=False)
dataset = dataset.repeat(2)
# [1, 0, 2, 1, 0, 2]
其中,buffer_size表示从数据流顺序缓存的数据量,并在这个子数据集进行洗牌,一旦缓存中的一个数据被选中了,那么新的数据就会补充进去。原文解释如下:
if your dataset contains 10,000 elements but buffer_size
is set to 1,000, then shuffle
will initially select a random element from only the first 1,000 elements in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer
skip:就是跳过前多少个元素
import tensorflow as tf
dataset = tf.data.Dataset.range(11).skip(3)
for elem in dataset.as_numpy_iterator():
print(elem)
take:就是获取头部几个元素
#-*- coding: utf-8 -*-
import tensorflow as tf
dataset = tf.data.Dataset.range(11).take(3)
for elem in dataset.as_numpy_iterator():
print(elem)
repeat:重复获取元素的个数:
#-*- coding: utf-8 -*-
import tensorflow as tf
dataset = tf.data.Dataset.range(11).repeat(3)
for elem in dataset.as_numpy_iterator():
print(elem)
unique:去除重复的元素:
#-*- coding: utf-8 -*-
import tensorflow as tf
dataset = tf.data.Dataset.range(11).repeat(3).unique()
for elem in dataset.as_numpy_iterator():
print(elem)
padded_batch:每一个批次数据进行自动对齐操作
#-*- coding: utf-8 -*-
import tensorflow as tf
dataset = (tf.data.Dataset.range(1, 5, output_type=tf.int32).map(lambda x: tf.fill([x], x)))
#[1]
#[2 2]
#[3 3 3]
#[4 4 4 4]
B = dataset.padded_batch(2,padded_shapes=3)
for elem in B.as_numpy_iterator():
print(elem)
其中,padded_shapes表示填充的最大维度,如果不写,默认为每一个批次的最大维度值
apply自定义方法:
#-*- coding: utf-8 -*-
import tensorflow as tf
dataset = tf.data.Dataset.range(1, 6, output_type=tf.int32)
def custem_function(ds:tf.data.Dataset):
return tf.data.Dataset.from_tensors([tf.reduce_sum(each) for each in ds])
fill_data_set = dataset.batch(2).apply(custem_function)
for elem in fill_data_set.as_numpy_iterator():
print(elem)
其中,
1、自定义方法的入参为dataset本身
2、batch之后的数据每一个batch会变成一个tensor,而不是dataset
prefetch:预获取,提前准备数据(线程并行,减少从磁盘或中间处理批数据到导致的时间损耗)
#-*- coding: utf-8 -*-
import tensorflow as tf
dataset = tf.data.Dataset.range(1, 6, output_type=tf.int32)
fill_data_set =dataset.batch(2).prefetch(2)
for elem in fill_data_set.as_numpy_iterator():
print(elem)
在batch之后的prefetch预获取的单位是每一个batch