TFRecord存储维度（秩、rank、dimension）较多的数据以及创建Dataset的过程

最新推荐文章于 2023-11-03 22:40:57 发布

ssssossss

最新推荐文章于 2023-11-03 22:40:57 发布

阅读量646

点赞数

分类专栏： Tensorflow 文章标签： tensorflow 深度学习神经网络自然语言处理

本文链接：https://blog.csdn.net/ssssossss/article/details/104046743

版权

Tensorflow 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

TFRecord存储维度（秩、rank、dimension）较多的数据以及创建Dataset的过程

第一次接触tensorflow的Dataset和Estimator是在阅读BERT的特定任务的代码，原本用低阶API需要写很长的代码，在Estimator模式中简化了许多。
原本代码中的输入数据（即Dataset中的Example）的每个特征（即每个Example含有的Feature）的秩都为1，即矢量如v=[1,2,3]，其rank为1，shape为(3,)。而后续为了引入更多新的特征，比如charCNN或者charRNN来捕捉词语的形态特征，则需要在原来每个时间步的维度上再扩展一个维度，用于放置该时间步的字母。如[‘Are’, ’ you’, ‘OK’]，则输入为[[‘A’,‘r’,‘e’],[‘y’,‘o’,‘u’],[‘O’,‘K’]]，此时该Feature的rank为2，shape为(3,3)（此处将‘OK’ pad为长度为3的序列即可）。
那么这种多维度，rank>=2的形式的Feature应该怎么存储呢，后续又应该怎么读出到Dataset并解析呢。

保留该Feature的Shape信息后拉直（Flatten）Feature

这里借YJango大神的例子来举个栗子，然后再写写我的啦。

大神的例子

在这里插入图片描述

这里有三个example，每个example都有四类feature，分别是标量、向量、矩阵和张量，它们的shape分别为()，(3,)，(2,3)和(806,806,3)。

写入tfrecord

那应该怎么写入这些形态各异的特征呢？两种方法。

将其flatten成list形式，即rank=1的向量形式，然后按照list形式写入，如int64_list = tf.train.Int64List(value=输入)或float_list = tf.train.FloatList(value=输入)。
转成string类型：将张量用.tostring()转换成string类型，再用tf.train.Feature(bytes_list=tf.train.BytesList(value=[input.tostring()]))来存储。

这两种方法都会丢失数据的维度，因此需要将其存储以备后续使用或者提前将这些参数预设好即可。

# 打开一个tfrecord文件，准备进行写入
writer = tf.python_io.TFRecordWriter('%s.tfrecord' %'test') 
# 这里我们将会写3个样本，每个样本里有4个feature：标量，向量，矩阵，张量
for i in range(3):
    # 创建字典
    features={}
    # 写入标量，类型Int64，由于是标量，所以"value=[scalars[i]]" 变成list
    features['scalar'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[scalars[i]]))
    
    # 写入向量，类型float，本身就是list，所以"value=vectors[i]"没有中括号
    features['vector'] = tf.train.Feature(float_list = tf.train.FloatList(value=vectors[i]))
    
    # 写入矩阵，类型float，本身是矩阵，一种方法是将矩阵flatten成list
    features['matrix'] = tf.train.Feature(float_list = tf.train.FloatList(value=matrices[i].reshape(-1)))
    # 然而矩阵的形状信息(2,3)会丢失，需要存储形状信息，随后可转回原形状
    features['matrix_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=matrices[i].shape))
    
    # 写入张量，类型float，本身是三维张量，另一种方法是转变成字符类型存储，随后再转回原类型
    features['tensor'] = tf.train.Feature(bytes_list=tf.train.BytesList(value=[tensors[i].tostring()]))
    # 存储丢失的形状信息(806,806,3)
    features['tensor_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=tensors[i].shape))
    
    # 将存有所有feature的字典送入tf.train.Features中
    tf_features = tf.train.Features(feature= features)
    # 再将其变成一个样本example
    tf_example = tf.train.Example(features = tf_features)
    # 序列化该样本
    tf_serialized = tf_example.SerializeToString()
    # 写入一个序列化的样本
    writer.write(tf_serialized)
    # 由于上面有循环3次，所以到此我们已经写了3个样本
 # 关闭文件    
writer.close()

建立Datasets

由于从tfrecord文件中导入的样本是刚才写入的tf_serialized序列化样本，所以我们需要对每一个样本进行解析。
这里就用dataset.map(parse_function)来对dataset里的每个样本进行相同的解析操作。而parse_function的解析过程几乎就是上述过程的逆过程。此外，我们还能在parse_function里进行很多其他操作，比如转换数据的dtype，给每个数据加入噪音等等。总之，在parse_function内，我们处理的对象就是一个序列化后的serialized_example，我们要对serialized_example进行解码获得example，然后返回这个example。
其解析函数的写法为：

def parse_function(example_proto):
    # 只接受一个输入：example_proto，也就是序列化后的样本tf_serialized
    dics = {# 这里没用default_value，随后的都是None
            'scalar': tf.FixedLenFeature(shape=(), dtype=tf.int64, default_value=None), 
             
            # vector的shape刻意从原本的(3,)指定成(1,3)
            'vector': tf.FixedLenFeature(shape=(1,3), dtype=tf.float32), 
            
            # 因为这里还不知道matrix的shape，所以使用 VarLenFeature来解析。
            'matrix': tf.VarLenFeature(dtype=dtype('float32')), 
            'matrix_shape': tf.FixedLenFeature(shape=(2,), dtype=tf.int64), 
            
            # tensor在写入时 使用了toString()，shape是()
            # 但这里的type不是tensor的原type，而是字符化后所用的tf.string，随后再回转成原tf.uint8类型
            'tensor': tf.FixedLenFeature(shape=(), dtype=tf.string), 
            'tensor_shape': tf.FixedLenFeature(shape=(3,), dtype=tf.int64)
            }
            
    # 把序列化样本和解析字典送入函数里得到解析的样本
    parsed_example = tf.parse_single_example(example_proto, dics)
    # 解码字符
    parsed_example['tensor'] = tf.decode_raw(parsed_example['tensor'], tf.uint8)
    # 稀疏表示 转为 密集表示
    parsed_example['matrix'] = tf.sparse_tensor_to_dense(parsed_example['matrix'])
    # 转变matrix形状
    parsed_example['matrix'] = tf.reshape(parsed_example['matrix'], parsed_example['matrix_shape'])
    # 转变tensor形状
    parsed_example['tensor'] = tf.reshape(parsed_example['tensor'], parsed_example['tensor_shape'])
    # 返回所有feature
    return parsed_example

此处如果我们有matrix的shape的一些信息，就并不需要用VarLenFeature进行解析，可以直接将matrix的shape中每个数相乘即可得到flatten后的matrix的list的信息，即’matrix’: tf.FixedLenFeature(shape=[matrix.shape()[0]*matrix.shape()[1]],dtype=dtype(‘float32’))。
写好解析函数以后，将这个解析函数作为dataset的map方法的输入即可。
剩下的batch，shuffle等操作就不再赘述了。建立迭代器的操作有这篇博客讲得很好了。

我的破例子

def filed_based_convert_examples_to_features(
        examples, tokenizer, output_file):
    """
    :param examples:
    :param tokenizer:
    :param output_file:
    :param mode:
    :return: number of small example
    """
    num_examples = 0
    writer = tf.python_io.TFRecordWriter(output_file)
    # 遍历训练数据
    for (ex_index, example) in enumerate(examples):
        # 对于每一个训练样本,
        feature_list = convert_single_example(example, tokenizer)
        num_examples += len(feature_list)

        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        def flatten(tensor):
            return sum(tensor, [])

        for f in feature_list:
            if num_examples%5000 == 0:
                tf.logging.info("Writing example %d of %d" % (num_examples, len(examples)))
            features = collections.OrderedDict()
            # f.input_ids，f.input_mask，f.segment_ids，f.tag_ids为向量，即形如[1,2,3,4...]，
            # 其shape为(max_seq_length,)
            features["input_ids"] = create_int_feature(f.input_ids) 
            features["input_mask"] = create_int_feature(f.input_mask) 
            features["segment_ids"] = create_int_feature(f.segment_ids) 
            # f.char_ids为矩阵，形如[[1,2,3],[4,5,6],[7,8,9]],
            # 其shape为(max_seq_length, max_char_length)
            # 因此要先将其flatten为向量，shape为(max_seq_length*max_char_length,)，再转化为feature
            features["char_ids"] = create_int_feature(flatten(f.char_ids)) 
            features["tag_ids"] = create_int_feature(f.tag_ids) # 为向量
            # 最后放进字典里，传入tf.train.Example
            tf_example = tf.train.Example(features=tf.train.Features(feature=features))
            writer.write(tf_example.SerializeToString())
    writer.close()
    return num_examples


def file_based_input_fn_builder(input_file, seq_length, char_length, is_training, drop_remainder):
	# 与上述相反的解码过程
    name_to_features = {
        "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
        "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "char_ids": tf.FixedLenFeature([seq_length * char_length], tf.int64),
        "tag_ids": tf.FixedLenFeature([seq_length], tf.int64),
    }

    def _decode_record(record, name_to_features):
        example = tf.parse_single_example(record, name_to_features)
        for name in list(example.keys()):
            t = example[name]
            # 将int64转为int32，因为上面只有tf.train.Int64List而没有int32
            if t.dtype == tf.int64:
                t = tf.to_int32(t)
            example[name] = t
        # char_ids将其reshape回来即可
        example["char_ids"] = tf.reshape(example["char_ids"],
                                         shape=(seq_length, char_length))
        return example

    def input_fn(params):
        batch_size = params["batch_size"]
        d = tf.data.TFRecordDataset(input_file)
        if is_training:
            d = d.repeat()
            d = d.shuffle(buffer_size=100)
        d = d.apply(tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features),
            batch_size=batch_size,
            drop_remainder=drop_remainder
        ))
        return d

    return input_fn

第二个函数返回的是一个函数的闭包，主要用于estimator模式下的数据输入。这是本人基于BERT做NER改进的charCNN-BERT-CRF模型，有兴趣的可以去我GitHub看看哈。

写这篇博客的初衷

为啥要写这篇博客呢？因为我在解决这个问题时走了一个弯路，就是使用了FeatureList。即将每个单词的字母切分作为Feature，然后添加为FeatureList的元素。然而FeatureList的解码相对比较复杂难写，尽管程序没有报错，但是在运行时，却显示读出的样本数为0，即无法读出样本，一个样本都没有进入网络。当然有了前面提到的方法，这个FeatureList的作用到底大不大呢，应用广不广呢，Feature和它相比有什么做不到的地方吗（我好像看到目标识别好像有用到这个作为data pipeline）？这几天如果有时间我再根据这篇博客介绍的方法试试，到时再更新啦！也欢迎各位大佬对我进行指正！
今晚对Featurelist的方法实现了一下，发现也是可以实现同样的功能，代码如下：

def filed_based_convert_examples_to_features(
        examples, tokenizer, output_file):
    """
    :param examples:
    :param tokenizer:
    :param output_file:
    :param mode:
    :return: number of small example
    """
    num_examples = 0
    writer = tf.python_io.TFRecordWriter(output_file)
    # 遍历训练数据
    for (ex_index, example) in enumerate(examples):
        # 对于每一个训练样本,
        example_list = convert_single_example(example, tokenizer)
        num_examples += len(example_list)

        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f
		# 新增一个转化featurelist的方法
        def create_feature_list(values_list):
            fl = tf.train.FeatureList(
                feature=[tf.train.Feature(int64_list=tf.train.Int64List(value=values)) for values in values_list])
            return fl

        def flatten(tensor):
            return sum(tensor, [])

        for f in example_list:
            if num_examples%5000 == 0:
                tf.logging.info("Writing example %d of %d" % (num_examples, len(examples)))
            features = collections.OrderedDict()
            # 给featurelists也申请一个字典
            features_list = collections.OrderedDict()
            features["input_ids"] = create_int_feature(f.input_ids)
            features["input_mask"] = create_int_feature(f.input_mask)
            features["segment_ids"] = create_int_feature(f.segment_ids)
            features["tag_ids"] = create_int_feature(f.tag_ids)
            # 这里转化为featurelist，但是感觉这样的写法，其实也没有方便多少啦！
            # 个人感觉featurelist的用法应该不是单纯这样用的，不然就这样最多也只是能转个二维，有啥意思呢？
            # 欢迎各位指正啦
            features_list["char_ids"] = create_feature_list(f.char_ids)
            # 这里要用SequenceExample啦！同理分别将features和featurelists装进context和feature_lists
            tf_example = tf.train.SequenceExample(context=tf.train.Features(feature=features),
                                                  feature_lists=tf.train.FeatureLists(feature_list=features_list))
            writer.write(tf_example.SerializeToString())
    writer.close()
    return num_examples


def file_based_input_fn_builder(input_file, seq_length, char_length, is_training, drop_remainder):
    name_to_features = {
        "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
        "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "tag_ids": tf.FixedLenFeature([seq_length], tf.int64),
    }
    # featurelist的解码
    name_to_features_list = {
        "char_ids": tf.FixedLenSequenceFeature([char_length], tf.int64),
    }

    def _decode_record(record, name_to_features, name_to_features_list):
    	# 这里有两个返回值，一个返回feature即context的内容，另一份是featurelist即sequence的内容
        context_example, sequence_example = tf.parse_single_sequence_example(record,
                                                   context_features=name_to_features,
                                                   sequence_features=name_to_features_list)
        for name in list(context_example.keys()):
            t = context_example[name]
            if t.dtype == tf.int64:
                t = tf.to_int32(t)
            context_example[name] = t

        for name in list(sequence_example.keys()):
            tl = sequence_example[name]
            if tl.dtype == tf.int64:
                tl = tf.to_int32(tl)
            sequence_example[name] = tl

        return context_example, sequence_example

    def input_fn(params):
        batch_size = params["batch_size"]
        d = tf.data.TFRecordDataset(input_file)
        if is_training:
            d = d.repeat()
            d = d.shuffle(buffer_size=100)
        d = d.apply(tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features, name_to_features_list),
            batch_size=batch_size,
            drop_remainder=drop_remainder
        ))
        return d

    return input_fn

def main(_):
    tf.logging.set_verbosity(tf.logging.INFO)
    train_data_dir = ['training-PHI-Gold-Set2']
    wordpiece_vocab = tokenization_ner.build_wordpiece_vocab(root_path, bert_path, 'vocab.txt')
    wptokenizer = tokenization_ner.WPTokenizer(wordpiece_vocab, FLAGS.max_seq_length, FLAGS.max_char_length)
    train_file = os.path.join(FLAGS.output_dir, "train.tf_record")
    if not os.path.exists(os.path.join(FLAGS.output_dir, "train.tf_record")):
        train_examples = load_examples(train_data_dir)
        num_train_examples = filed_based_convert_examples_to_features(train_examples, wptokenizer, train_file)
    train_input_fn = file_based_input_fn_builder(
        input_file=train_file,
        seq_length=FLAGS.max_seq_length,
        char_length=FLAGS.max_char_length,
        is_training=True,
        drop_remainder=True)

    params = {}
    params["batch_size"] = FLAGS.train_batch_size
    dataset = train_input_fn(params)

    iterator = dataset.make_one_shot_iterator()

    with tf.Session() as sess:
        for _ in range(1):
            try:
                context, sequence = sess.run(iterator.get_next())
                print(sequence['char_ids'])
            except tf.errors.OutOfRangeError:
                break