13_Loading & Preprocessing Data with TF 3_TF Datasets_images[index, ...,0]_plt images_profiling data

本文链接：https://blog.csdn.net/Linli522362242/article/details/108039793

本文深入探讨了TensorFlow中数据集的高效加载与预处理技术，包括使用TFDS库下载常见数据集如MNIST，以及从多个CSV文件加载和预处理数据的方法。文章还介绍了如何利用TFRecord文件格式存储和读取数据，以及如何在训练过程中优化输入管道避免瓶颈。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

13_Loading and Preprocessing Data from multiple CSV with TensorFlow_custom training loop_TFRecord
https://blog.csdn.net/Linli522362242/article/details/107704824

13_Loading and Preprocessing Data from multiple CSV with TensorFlow 2_Feature Columns_TF eXtended
https://blog.csdn.net/Linli522362242/article/details/107933572

The TensorFlow Datasets (TFDS) Project

The TensorFlow Datasets project makes it very easy to download common datasets, from small ones like MNIST or Fashion MNIST to huge datasets like ImageNet (you will need quite a bit of disk space!). The list includes image datasets, text datasets
(including translation datasets), and audio and video datasets. You can visit https://homl.info/tfds to view the full list, along with a description of each dataset.

TFDS is not bundled with TensorFlow, so you need to install the tensorflowdatasets library (e.g., using pip https://blog.csdn.net/Linli522362242/article/details/108037567). Then call the tfds.load() function, and it will download the data you want (unless it was already downloaded earlier) and return the data as a dictionary of datasets (typically one for training and one for testing, but this depends on the dataset you choose).
For example, let’s download MNIST:

import tensorflow_datasets as tfds

datasets = tfds.load(name="mnist")
datasets

The load() function shuffles each data shard it downloads (only for the training set). This may not be sufficient, so it’s best to shuffle the training data some more.

mnist_train, mnist_test = datasets["train"], datasets["test"]

Find available datasets

All dataset builders are subclass of tfds.core.DatasetBuilder. To get the list of available builders, uses tfds.list_builders() or look at our catalog.

print(tfds.list_builders())

You can then apply any transformation you want (typically shuffling, batching, and prefetching), and you’re ready to train your model. Here is a simple example:

mnist_train, mnist_test = datasets["train"], datasets["test"]

plt.figure( figsize=(6,3) )
#mnist_train = mnist_train.repeat(5).shuffle(10000).batch(32).prefetch(1)
mnist_train = mnist_train.repeat(5).batch(32).prefetch(1) # for getting same data
for item in mnist_train:
    images = item["image"]
    labels = item["label"]
    
    for index in range(5):
        #print(len(images))# 32 since batch(32)
        plt.subplot(1,5, index+1)
        
        image = images[index, ...,0]#indices end with 0 since images[index]: (28, 28, 1) and '0' select [1]th element
        label = labels[index].numpy()
        
        plt.imshow(image, cmap="binary")
        plt.title(label)
        plt.axis("off")
    break # just showing part of the first batch

The load() function shuffles each data shard it downloads (only for the training set). This may not be sufficient, so it’s best to shuffle the training data some more.

Note that each item in the dataset is a dictionary containing both the features and the labels. But Keras expects each item to be a tuple containing two elements (again, the features and the labels). You could transform the dataset using the map() method, like this:
Figure 13-2. Loading and preprocessing data from multiple CSV files https://blog.csdn.net/Linli522362242/article/details/107704824

datasets = tfds.load(name="mnist")

mnist_train, mnist_test = datasets["train"], datasets["test"]
#mnist_train = mnist_train.repeat(5).shuffle(10000).batch(32)
mnist_train = mnist_train.repeat(5).batch(32) # for getting same data          ###########
mnist_train = mnist_train.map( lambda items: (items["image"], items["label"]) )###########
mnist_train = mnist_train.prefetch(1)

for images, labels in mnist_train.take(1):
    print(images.shape)
    print(labels.numpy())

datasets = tfds.load(name="mnist")

mnist_train, mnist_test = datasets["train"], datasets["test"]
#mnist_train = mnist_train.repeat(5).shuffle(10000).batch(32)
mnist_train = mnist_train.repeat(5) # for getting same data                    ###########
mnist_train = mnist_train.map( lambda items: (items["image"], items["label"]) )###########
mnist_train = mnist_train.batch(32)                                            ###########
mnist_train = mnist_train.prefetch(1)

for images, labels in mnist_train.take(1):
    print(images.shape)
    print(labels.numpy())

datasets = tfds.load(name="mnist")
mnist_train, mnist_test = datasets["train"], datasets["test"]
#mnist_train = mnist_train.repeat(5).shuffle(10000).batch(32)
mnist_train = mnist_train.repeat(5) # for getting same data                    ###########
mnist_train = mnist_train.map( lambda items: (items["image"], items["label"]) )###########
mnist_train = mnist_train.batch(32)                                            ###########
mnist_train = mnist_train.prefetch(1)

for images, labels in mnist_train.take(1):
    n_cols = 10
    n_rows = len(labels)//10+1
    plt.figure( figsize=(n_cols*1.2, n_rows*1.2) ) # 0.2: for the extra space
    
    for row in range(n_rows):
        for col in range(len(labels)-row*n_cols):
            index = n_cols * row + col
            
            plt.subplot( n_rows, n_cols, index+1 ) # the subplot's index start 1
            plt.imshow(images[index,...,0], cmap="binary", interpolation="nearest" )
            plt.axis("off") #remove the axis
            plt.title(labels.numpy()[index], fontsize=12)
        
plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

But it’s simpler to ask the load() function to do this for you by setting as_supervised=True (obviously this works only for labeled datasets). You can also specify the batch size if you want. Then you can pass the dataset directly to your tf.keras model:

https://blog.csdn.net/Linli522362242/article/details/104124771
Equation 4-22. Cross entropy cost function
is equal to 1 if the target class for the ith instance is k; otherwise, it is equal to 0.

Notice that when there are just two classes (K = 2), this cost function is equivalent to the Logistic Regression’s cost function (log loss; see Equation 4-17 OR (L2 regularization ) Before building this model, recall that our objective is to minimize the cost function in regularized logistic regression( divided by m(instances) for calculating the partial derivative of the cost function since since the weight update in logistic regression via gradient descent):
).https://blog.csdn.net/Linli522362242/article/details/96480059

https://blog.csdn.net/Linli522362242/article/details/106433059
There are two ways to handle labels in multiclass classification:
– Encoding the labels via categorical encoding (also known as one-hot encoding) andn using categorical_crossentropy as a loss function
– Encoding the labels as integers and using the sparse_categorical_crossentropy loss function
https://blog.csdn.net/Linli522362242/article/details/106562190

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

datasets = tfds.load(name="mnist", batch_size=32, as_supervised=True)###########
mnist_train = datasets["train"].repeat().prefetch(1)
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28,1]), #to 1D
    keras.layers.Lambda( lambda images: tf.cast(images, tf.float32) ),#since types: {image: tf.uint8, label: tf.int64}>
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", # 0~9 class_label
              optimizer=keras.optimizers.SGD(lr=1e-3), 
              metrics=["accuracy"])
model.fit(mnist_train, steps_per_epoch = 60000//32, epochs=5)

https://www.tensorflow.org/datasets/catalog/mnist

This was quite a technical chapter, and you may feel that it is a bit far from the abstract beauty of neural networks, but the fact is Deep Learning often involves large amounts of data, and knowing how to load, parse, and preprocess it efficiently is a
crucial skill to have. In the next chapter, we will look at convolutional neural networks, which are among the most successful neural net architectures for image processing and many other applications.

TensorFlow Hub

TF2 SavedModel

This is a SavedModel in TensorFlow 2 format. Using it requires TensorFlow 2 (or 1.15) and TensorFlow Hub 0.5.0 or newer.

Overview

This module is in the SavedModel 2.0(https://www.tensorflow.org/hub/tf2_saved_model) format and was created to help preview TF2.0 functionalities. It is based on https://tfhub.dev/google/nnlm-en-dim50/1.

Text embedding based on feed-forward Neural-Net Language Models[1] with pre-built OOV. Maps from text to 50-dimensional embedding vectors.

Example use

The saved model can be loaded directly:

import tensorflow_hub as hub

embed = hub.load("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1")
embeddings = embed(["cat is on the mat", "dog is in the fog"])

It can also be used within Keras:

The tensorflow_hub library provides the class hub.KerasLayer that gets initialized with the URL (or filesystem path) of a SavedModel and then provides the computation from the SavedModel, including its pre-trained weights.

hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1", output_shape=[50],
                           input_shape=[], dtype=tf.string)

model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.summary()

Details
Based on NNLM(Neural-Net Language Models) with two hidden layers.

Input
The module takes a batch of sentences in a 1-D tensor of strings as input.

Preprocessing
The module preprocesses its input by splitting on spaces.

Out of vocabulary tokens
Small fraction of the least frequent tokens and embeddings (~2.5%) are replaced by hash buckets. Each hash bucket is initialized using the remaining embedding vectors that hash to the same bucket.

Sentence embeddings

Word embeddings are combined into sentence embedding using the sqrtn combiner (see tf.nn.embedding_lookup_sparse).

sentences = tf.constant(["It was a great movie", "The actors were amazing"])
embeddings = hub_layer(sentences)
embeddings

3x16+2=50

print(embeddings.shape, embeddings.dtype)

https://www.tensorflow.org/hub/tf2_saved_model

Exercises

1. Why would you want to use the Data API?
Ingesting a large dataset and preprocessing it efficiently can be a complex engineering challenge. The Data API makes it fairly simple. It offers many features, including loading data from various sources (such as text or binary files), reading
data in parallel from multiple sources, transforming it, interleaving the records, shuffling the data, batching it, and prefetching it.

2. What are the benefits of splitting a large dataset into multiple files?

Splitting a large dataset into multiple files makes it possible to shuffle it at a coarse level(dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)) before shuffling it at a finer level using a shuffling buffer(dataset = dataset.shuffle(shuffle_buffer_size)). It also makes it possible to handle huge datasets that do not fit on a single machine. It’s also simpler to manipulate thousands of small files rather than one huge file; for example, it’s easier to split the data into multiple subsets. Lastly, if the data is split across multiple files spread across multiple servers, it is possible to download several files from different servers simultaneously, which improves the bandwidth usage.
https://blog.csdn.net/Linli522362242/article/details/107704824

import numpy as np
 
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs( housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")
    
    filepaths = []
    m = len(data)
    
    # file_idx: group indices of multiple arrays
    # row_indices: element indices in an array
    # np.split==> [array([ 0.,  1.,  2.]), array([ 3.,  4.,  5.]), array([ 6.,  7.,  8.])]
    # np.array_split==>[array([ 0.,  1.,  2.]), array([ 3.,  4.,  5.]), array([ 6.,  7.])]
    for file_idx, row_indices in enumerate( np.array_split(np.arange(m), n_parts) ): # iterate files
        part_csv = path_format.format(name_prefix, file_idx)#os.path.join(housing_dir, "my_{}_{:02d}.csv")
        filepaths.append(part_csv)
        
        #"t": refers to the text mode
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
                
            for row_idx in row_indices: # iterate rows
                # str()主要面向用户，其目的是可读性，返回形式为用户友好性和可读性都较强的字符串类型；
                # repr() 函数将对象转化为供解释器读取的形式
                # 当我们想所有环境下都统一显示的话，可以重构__repr__方法；
                # 当我们想在不同环境下"支持不同的显示"，例如终端用户显示使用__str__，
                # 而程序员在开发期间则使用底层的__repr__来显示，实际上__str__只是覆盖了__repr__
                # 以得到更友好的用户显示。
                f.write(",".join([repr(col) for col in data[row_idx]
                                 ])
                       )
                f.write("\n")
                
    return filepaths

train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
                                      # y_target
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)
 
train_filepaths = save_to_multiple_csv_files( train_data, "train", header, n_parts=20 )
valid_filepaths = save_to_multiple_csv_files( valid_data, "valid", header, n_parts=10 )
test_filepaths = save_to_multiple_csv_files( test_data, "test", header, n_parts=10 )

# scaler = StandardScaler()
# scaler.fit(X_train)
# X_mean = scaler.mean_
# X_std = scaler.scale_
 
n_inputs = 8 # X_train.shape[-1] # X_train.shape=(11610, 8)
 
@tf.function
def preprocess(line):
    defs =[0.]*n_inputs + [tf.constant([], dtype=tf.float32)] # record_defaults
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean)/X_std, y

def csv_reader_dataset(filepaths, repeat=1, 
                       n_readers=5,# number of files or filepaths
                       n_read_threads=None,
                       shuffle_buffer_size=10000,
                       n_parse_threads=5,
                       batch_size=32
                      ):
    ######### pick multiple files randomly and read them simultaneously, interleaving交错 their records #########
    # list_files() function returns a dataset that "shuffles" the file paths, then repeat 'repeat' times
    dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
 
    # interleave() method to read from cycle_length(=n_readers) files at a time and 
    # "interleave their lines" (called cycle) : reading one line at a time from each until all datasets are out of items
    # Then it will get the next five file paths from the 'dataset' and interleave them the same way,
    # and so on until it runs out of file paths
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1), # skip the header row via map_func
        cycle_length=n_readers,# 'interleave' pull cycle_length(=n_readers) file paths(1 by 1) from the 'dataset'
        num_parallel_calls=n_read_threads
    )#and for each one(filepath) it will call the function you gave it(lambda) to create a new dataset(TextLineDataset)
    #the interleave dataset including cycle_length(=n_readers) datasets
    
    ############## Then on top of that you can add a shuffling buffer using the shuffle() method ##############
    dataset = dataset.shuffle(shuffle_buffer_size)    
    dataset = dataset.map(preprocess, #split and combine to form x_train and y_train, then scale
                          num_parallel_calls=n_parse_threads)
    dataset = dataset.batch(batch_size)#group the items of the previous dataset in batches of 'batch_size' items
    
    return dataset.prefetch(1)

tf.random.set_seed(42)
 
train_set = csv_reader_dataset(train_filepaths, batch_size=3)
for X_batch, y_batch in train_set.take(2):
    print('X =', X_batch)
    print('y =', y_batch)
    print()

3. During training, how can you tell that your input pipeline is the bottleneck? What can you do to fix it?
https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras

The Profile tab opens the Overview page which shows you a high-level summary of your model performance. Looking at the Step-time Graph on the right, you can see that the model is highly input bound (i.e., it spends a lot of time in the data input piepline). The Overview page also gives you recommendations on potential next steps you can follow to optimize your model performance.
You can use TensorBoard to visualize profiling data: if the GPU is not fully utilized then your input pipeline is likely to be the bottleneck. You can fix it by making sure it reads and preprocesses the data in multiple threads in parallel, and ensuring it prefetches a few batches. If this is insufficient to get your GPU to 100% usage during training, make sure your preprocessing code is optimized. You can also try saving the dataset into multiple TFRecord files, and if necessary perform some of the preprocessing ahead of time so that it does not need to be done on the fly during training (TF Transform can help with this). If necessary, use a machine with more CPU and RAM, and ensure that the GPU bandwidth is large enough.
https://blog.csdn.net/Linli522362242/article/details/107704824


# train_filepaths = save_to_multiple_csv_files( train_data, "train", header, n_parts=20 ) 
 
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)
 
model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1),
])
 
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error # returns one loss per instance(with few features)
 
@tf.function
def train(model, n_epochs, batch_size=32, 
          n_reader=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
    train_set = csv_reader_dataset(train_filepaths, 
                                   repeat = n_epochs, # list_files() returns a dataset that "shuffles" the file paths, then repeat
                                   n_readers=n_readers, # interleave() read from cycle_length files at a time
                                   n_read_threads = n_read_threads,#num_parallel_calls
                                   shuffle_buffer_size=shuffle_buffer_size, # shuffle
                                   n_parse_threads=n_parse_threads, #map(preprocess, num_parallel_calls = n_parse_threads)
                                   batch_size=batch_size, #batch()
                                   #prefetch()
                                  )
    ##############################################
    n_steps_per_epoch = len(X_train) // batch_size # 11610 // 32 = 362 steps per epochs
    total_steps = n_epochs * n_steps_per_epoch # if n_epochs=5, then total_steps=1810
    global_step = 0
    ##############################################
    
    for X_batch, y_batch in train_set.take(total_steps): # 11610//32 * 5 =1810 (taken times) # each time, take 32 instances
        
        #tracking
        global_step +=1
        if tf.equal( global_step % 100, 0):
            #'\r' moves the cursor ahead on current row
            tf.print("\rGlobal step", global_step, "/", total_steps )
            
        with tf.GradientTape() as tape:
            y_pred = model(X_batch) # prediction
            main_loss = tf.reduce_mean( loss_fn(y_batch, y_pred) )# returns a mean loss per batch
            loss = tf.add_n([main_loss] + model.losses) #model.losses: there is one "regularization loss" per layer
            
        # compute the gradient of the loss with regard to each trainable variable    
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients( zip(gradients, model.trainable_variables))
 
        ######################## constraint ########################
 
train(model, 5)

4. Can you save any binary data to a TFRecord file, or only serialized protocol buffers?

A TFRecord file is composed of a sequence of arbitrary binary records: you can store absolutely any binary data you want in each record. However, in practice most TFRecord files contain sequences of serialized protocol buffers. This makes it possible to benefit from the advantages of protocol buffers, such as the fact that they can be read easily across multiple platforms and languages and their definition can be updated later in a backward-compatible way.

5. Why would you go through the hassle of converting all your data to the Example protobuf format? Why not use your own protobuf definition?

The Example protobuf format has the advantage that TensorFlow provides some operations to parse it (the tf.io.parse*example() functions) without you having to define your own format. It is sufficiently flexible to represent instances in
most datasets. However, if it does not cover your use case, you can define your own protocol buffer, compile it using protoc (setting the --descriptor_set_out and --include_imports arguments to export the protobuf descriptor), and use the tf.io.decode_proto() function to parse the serialized protobufs (see the “Custom protobuf ” section of the notebook for an example). It’s more complicated, and it requires deploying the descriptor along with the model, but it can be done.

########################################################################
https://blog.csdn.net/Linli522362242/article/details/107704824
First let's write a simple protobuf definition:


%%writefile person.proto
syntax = "proto3";
message Person{
    string name =1;
    int32 id=2;
    repeated string email = 3;
}

Once you have a definition in a .proto file, you can compile it. This requires protoc, the protobuf compiler, to generate access classes in Python (or some other language).

And let's compile it (the --descriptor_set_out and --include_imports options are only required for the tf.io.decode_proto() example below):

所在的源目录 生成python代码(person_pb2.py)

!protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports

!protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports

!dir person* # !ls person* #note 'ls' is for linux system

from person_pb2 import Person
 
person = Person(name='Al', id = 123, email=["a@b.com"]) # create a Person
print(person) # display the Person

person.email.append('c@d.com') # add an email address
s = person.SerializePartialToString()# serialize the object to a byte string
s

person_tf = tf.io.decode_proto(
    bytes=s,
    message_type="Person",
    field_names=["name", "id", "email"],
    output_types=[tf.string, tf.int32, tf.string],
    descriptor_source="person.desc"
)
 
person_tf

person_tf.values

Here is the definition of the tf.train.Example protobuf:

The numbers 1, 2, and 3 are the field identifiers: they will be used in each record’s binary representation

syntax = "proto3";
 
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
    oneof kind {
        # The numbers 1, 2, and 3 are the field identifiers: 
        # they will be used in each record’s binary representation
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };

Here is how you could create a tf.train.Example representing the same person as earlier and write it to a TFRecord file:

import tensorflow as tf
 
BytesList = tf.train.BytesList
FloatList = tf.train.FloatList
Int64List = tf.train.Int64List
 
Feature = tf.train.Feature
Features = tf.train.Features
Example = tf.train.Example
 
person_example = Example(
    #message 'Example' { Features 'features' = 1; };
    features = Features(
        #message 'Features' { map<string, Feature> 'feature' = 1; };
        feature={
            #map<string, Feature>
            "name": Feature( bytes_list=BytesList( value=[b"Alice"] ) ),
            "id": Feature( int64_list=Int64List( value=[123] ) ),
            "emails": Feature( bytes_list=BytesList( value=[b"a@b.com", 
                                                            b"c@d.com"]
                                                   )
                             )
        }
    )
)

Now that we have an Example protobuf, we can serialize it by calling its SerializeToString() method, then write the resulting data to a TFRecord file:


with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

The following code defines a description dictionary, then it iterates over the TFRecord Dataset and parses(based on description) the serialized Example protobuf this dataset contains:

Instead of parsing examples one by one using tf.io.parse_single_example(), you may want to parse them batch by batch using tf.io.parse_example():

feature_description = {
                                # feature’s shape, type, and default value
    "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
            # only the type (if the length of the feature’s list may vary
    "emails": tf.io.VarLenFeature(tf.string)
}
 
# for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
#     parsed_example = tf.io.parse_single_example(serialized_example,
#                                                 feature_description)
# parsed_example
 
dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(10)
 
for serialized_examples in dataset:
    parsed_examples = tf.io.parse_example(serialized_examples, feature_description)
 
parsed_example

Handling Lists of Lists Using the SequenceExample Protobuf

Here is the definition of the SequenceExample protobuf:


syntax = "proto3";
 
message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
  Features context = 1;
  FeatureLists feature_lists = 2;
};

# from tensorflow.train import FeatureList, FeatureLists, SequenceExample
 
BytesList = tf.train.BytesList
Int64List = tf.train.Int64List
 
Feature = tf.train.Feature
Features = tf.train.Features
 
FeatureList = tf.train.FeatureList
FeatureLists = tf.train.FeatureLists
SequenceExample = tf.train.SequenceExample
 
# Features context = 1;
context = Features(
    # message Features { map<string, Feature> feature = 1; };
    feature={
        # map<string, Feature>;
        "author_id": Feature( int64_list=Int64List(value=[123]) ),
        "title": Feature( bytes_list=BytesList(value=[b"A", b"desert", b"place", b"."]) ),
        "pub_date": Feature( int64_list=Int64List(value=[1623, 12, 25]) ),
    }   
)
                                                      
content = [["When", "shall", "we", "three", "meet", "again", "?"],
           ["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]
          ]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
            ["When", "the", "battle", "'s", "lost", "and", "won", "."]
           ]                                                    
 
# message Feature {
#     oneof kind {
#         BytesList 'bytes_list' = 1;
#         FloatList 'float_list' = 2;
#         Int64List 'int64_list' = 3;
#     }
# };                                                                                                            
def words_to_feature(words):
    return Feature( bytes_list=BytesList( value=[word.encode("utf-8") for word in words] ) )
 
#repeated Feature feature =>[feature, feature] :each Feature would represent a sentence or comment
# feature=content_features                                                      
content_features = [words_to_feature(sentence) for sentence in content]                                                      
comments_features = [words_to_feature(comment) for comment in comments]                                                      
                                                      
sequence_example = SequenceExample(
    # Features 'context' = 1;
    context = context,
    
    # FeatureLists 'feature_lists' = 2;
    feature_lists = FeatureLists( 
        # FeatureLists { map<string, FeatureList> feature_list = 1; };
        feature_list={
            # map<string, FeatureList>
                       # FeatureList { repeated Feature feature = 1; };
                       # Each FeatureList contains a list of Feature objects
            "content": FeatureList( feature=content_features),
            "comments": FeatureList( feature=comments_features)
        }
    )
)
 
sequence_example

serialized_sequence_example = sequence_example.SerializePartialToString()

context_feature_descriptions = {
    "author_id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "title": tf.io.VarLenFeature(tf.string),
    "pub_date": tf.io.FixedLenFeature([3], tf.int64, default_value=[0,0,0]),
}
 
sequence_feature_descriptions = {
    "content": tf.io.VarLenFeature(tf.string),
    "comments": tf.io.VarLenFeature(tf.string)
}
 
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
    serialized_sequence_example, context_feature_descriptions, sequence_feature_descriptions
)
 
parsed_context

print(
    tf.RaggedTensor.from_sparse(parsed_feature_lists['content'])
)

Putting Images in TFRecords

from sklearn.datasets import load_sample_images
import matplotlib.pyplot as plt
 
img = load_sample_images()['images'][0]

# encode an image using the JPEG format and put this binary data in a BytesList.
data = tf.io.encode_jpeg(img)
# message Example { Features features = 1; };
example_with_image = Example( features=Features(
    # message Features { map<string, Feature> feature = 1; };
    feature={
        #map<string, Feature>       #BytesList bytes_list = 1;
        "image": Feature( bytes_list=BytesList( value=[data.numpy()] ) )
    }))
 
serialized_example = example_with_image.SerializeToString() 
# This is the binary data that is ready to be saved or transmitted over the network.
# then write the resulting data to a TFRecord file

feature_description = { 'image': tf.io.VarLenFeature(tf.string) }
example_with_image = tf.io.parse_single_example( serialized_example, feature_description )
 
# OR # decoded_img = tf.io.decode_image( example_with_image['image'].values[0] 
decoded_img = tf.io.decode_jpeg( example_with_image['image'].values[0] )
 
plt.imshow(decoded_img)
plt.title("Decoded Image")
plt.axis("off")
plt.show()

########################################################################

6. When using TFRecords, when would you want to activate compression? Why not do it systematically?

When using TFRecords, you will generally want to activate compression if the TFRecord files will need to be downloaded by the training script, as compression will make files smaller and thus reduce download time. But if the files are located on the same machine as the training script, it’s usually preferable to leave compression off, to avoid wasting CPU for decompression.

7. Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using TF Transform. Can you list a few pros and cons of each option?

https://blog.csdn.net/Linli522362242/article/details/108108665