如何解析TFRecord的Features数据并获取稀疏矩阵SparseTensor的数据

weixin_45672316

已于 2023-06-13 16:53:45 修改

阅读量161

点赞数

文章标签： python 开发语言

于 2023-06-13 16:49:41 首次发布

本文链接：https://blog.csdn.net/weixin_45672316/article/details/131191269

版权

相信很多小伙伴在解析tensoflow数据的时候都遇到过数据取不出来的问题，尤其是在采用tf.data.TFRecordDataset方法时遇到的解析不了TFRecord的Features的时候。这种数据形式可以通过tf.train.Example()读取出来：
代码

raw_dataset = tf.data.TFRecordDataset(filenames)
for raw_record in raw_dataset.take(3):
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    # 在这里你就可以打印或者取出对应的记录
    print(example)

打印出来的部分结果如下：

features {
  feature {
    key: "bucketized_user_age"
    value {
      float_list {
        value: 45.0
      }
    }
  }
  feature {
    key: "movie_genres"
    value {
      int64_list {
        value: 7
      }
    }
  }

可是这数据怎么获取呢？可以打印出来，就是存不到数组里，这时候就像知道背上哪块地方痒，都看到了就是挠不到。
于是我先采用了tf.io.parse_example方法，该方法可以将Features结构转化为字典型的Tensor（具体用法大家可以查一下）:
代码：

raw_dataset = tf.data.TFRecordDataset(filenames)
# 这里是真正的读取代码
# 因为这个raw_dataset是iterator，可以用take的方法取出对应的N个
features = {
        'movie_id': tf.io.FixedLenFeature([], tf.string),
        "movie_genres": tf.io.VarLenFeature(tf.int64),
}
for raw_record in raw_dataset.take(3):
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    # 在这里你就可以打印或者取出对应的记录
    # print(example)
    example1 = tf.io.parse_example(example,features);#参数1是序列化的数据集，这里用之前取到的example来测试，参数2是要提取的features集合
    print("example1 =",example1)
    print("type(example) =", type(example))

结果不出意料报错：

ValueError: Attempt to convert a value (features {...}) with an unsupported
 type (<class 'tensorflow.core.example.example_pb2.Example'>) to a Tensor.

这网上关于tf.train.Example()的相关文章本来就少，咋查也查不到啊，于是干脆换成如下方法，读出来了，

features = {
        'movie_id': tf.io.FixedLenFeature([], tf.string),
        'movie_genres': tf.io.VarLenFeature(tf.int64),
}
# Parse features, using the above template.
def parse_record(record):
    return tf.io.parse_single_example(record, features=features)
def read_tf_records():
    filenames = ["movielens-train.tfrecord-00000-of-00001"]
    data = tf.data.TFRecordDataset(filenames)
    data = data.map(parse_record)
    data = data.repeat()
    # Shuffle data.
    # data = data.shuffle(buffer_size=10)
    # Batch data (aggregate records together).
    data = data.batch(batch_size=10)
    # Prefetch batch (pre-load batch for faster consumption).
    data = data.prefetch(buffer_size=1)
    print("type(data) =",type(data))

    for record in data.take(1):
        print("**********************************************************************************")
        print("type(record) is: ", type(record))
        print("生成dataset================================================")
        dataset=tf.data.Dataset.from_tensor_slices(record)
        print("dataset生成完成=============================================")
        print("dataset=",dataset)
        print("==================================================================================")
        print("record[movie_id] is: ",record['movie_id'])
        print("type(record['movie_id'])=",type(record['movie_id']))
        print("record[movie_genres] is: ", record['movie_genres'])
        print("type(record['movie_genres'])=", type(record['movie_genres']))
        print("==================================================================================")
        print(record['movie_id'].numpy())

SparseTensor也不是问题，通过填充稀疏矩阵的方法tf.sparse.to_dense成功读取并通过numpy()转换成数组：

#将稀疏矩阵SparseTensor转换为稠密矩阵DenseTensor
        sparse_movie_genres = record['movie_genres']
        dense_movie_genres = tf.sparse.to_dense(sparse_movie_genres, default_value=0)#矩阵中的缺失值用0填充
        print(dense_movie_genres)
        #tensor转数组
        array_dense_movie_genres = dense_movie_genres.numpy()
        print("tensor转数组")
        print(array_dense_movie_genres)

最终转换结果

tf.Tensor(
[[ 7  0  0  0]
 [ 4 14  0  0]
 [ 4  0  0  0]
 [ 5  7  0  0]
 [10 16  0  0]
 [ 7 16  0  0]
 [ 2  3  4 12]
 [ 0  5 14  0]
 [ 4  0  0  0]
 [ 0  1 15 18]], shape=(10, 4), dtype=int64)
tensor转数组
[[ 7  0  0  0]
 [ 4 14  0  0]
 [ 4  0  0  0]
 [ 5  7  0  0]
 [10 16  0  0]
 [ 7 16  0  0]
 [ 2  3  4 12]
 [ 0  5 14  0]
 [ 4  0  0  0]
 [ 0  1 15 18]]