tensorflow数据流编写 python版

最新推荐文章于 2022-02-03 12:16:46 发布

东城地瓜

最新推荐文章于 2022-02-03 12:16:46 发布

阅读量163

点赞数

文章标签： tensorflow python 深度学习

本文链接：https://blog.csdn.net/weixin_39422563/article/details/103579654

版权

文章介绍一种数据流生成方法，作为训练或者测试的输入，参考bert官方代码：
省略了大部分的数据处理逻辑，只有代码框架

#从文本文件读入待处理数据
writer = tf.python_io.TFRecordWriter(output_file)
# exmaple是你从原始文件读入的内容，可以做一些基本的过滤切割操作
for example in examples:
    #填充数据
    features = collections.OrderedDict()
    #convert_sentence_to_id是具体的处理逻辑，需要对文本做tokenize，并且填充
    input_ids = convert_sentence_to_id(example["sentence"])
    assert len(input_ids) == max_seq_len
    # 这里input_ids就是你要灌入模型的输入，做一下长度检查，避免一些明显的错误
    features["input_ids"] = tf.train.Feature(int64_list=tf.train.Int64List(value=input_ids))
    #features["label"] = ... 应该至少包括input和label两个feature
    tf_example = tf.train.Example(features=tf.train.Features(feature=features))
    writer.write(tf_example.SerializeToString())
 writer.close()

生成TF文件之后就可以生成数据流了
指定epoch和batch_size

name_to_features = {
    "input_ids": tf.FixedLenFeature([seq_len], tf.int64),
    "label": tf.FixedLenFeature([], tf.int64)
}
#读入刚刚生成的文件
d = tf.data.TFRecordDataset(input_file)
d = d.repeat(count=epoch)
if is_training:
    d = d.shuffle()

#如果在GPU上运行 不要用py_func，虽然用py_func就可以不用tf着一套feature
d = .apply(tf.contrib.data.map_and_batch(
       lambda record: tf.parse_single_example(record, name_to_features),
       batch_size=batch_size,
       drop_remainder=True))

使用生成的数据流：

iterator = d.make_initializable_iterator()
input_data = iterator.get_next()
input_ids = input_data["input_ids"]
label = input_data["labels"]
'''
这里就可以当作两个输入矩阵用了，shape是batch_size*single_shape
iterator也需要初始化：sess.run(iterator.initializer)
'''