13_Loading & Preproces Data from multiple CSV with TF 2_Feature Columns_TF eXtended_num_oov_buckets

最新推荐文章于 2022-01-12 16:08:32 发布

LIQING LIN

最新推荐文章于 2022-01-12 16:08:32 发布

阅读量2.5k

点赞数 1

分类专栏： pythonMachineLearningInAction

本文链接：https://blog.csdn.net/Linli522362242/article/details/107933572

版权

pythonMachineLearningInAction 专栏收录该内容

101 篇文章 1 订阅

订阅专栏

13_Loading and Preprocessing Data from multiple CSV with TensorFlow_custom training loop_TFRecord
https://blog.csdn.net/Linli522362242/article/details/107704824

Preprocessing the Input Features

Preparing your data for a neural network requires converting all features into numerical features, generally normalizing them, and more. In particular, if your data contains categorical features or text features, they need to be converted to numbers. This can be done ahead of time when preparing your data files, using any tool you like (e.g., NumPy, pandas, or Scikit-Learn). Alternatively, you can preprocess your data on the fly when loading it with the Data API (e.g., using the dataset’s map() method, as we saw earlier), or you can include a preprocessing layer directly in your model. Let’s look at this last option now.

For example, here is how you can implement a standardization layer using a Lambda layer. For each feature, it subtracts the mean and divides by its standard deviation (plus a tiny smoothing term to avoid division by zero):

means = np.mean(X_train, axis=0, keepdims=True)
stds = np.std(X_train, axis=0, keepdims=True)
eps = keras.backend.epsilon()
model = keras.models.Sequential([
keras.layers.Lambda(lambda inputs: (inputs - means) / (stds + eps)),
    [...] # other layers
])

That’s not too hard! However, you may prefer to use a nice self-contained custom layer (much like Scikit-Learn’s StandardScaler), rather than having global variables like means and stds dangling around:

class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)

    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

Before you can use this standardization layer, you will need to adapt it to your dataset by calling the adapt() method and passing it a data sample. This will allow it to use the appropriate mean and standard deviation for each feature:

std_layer = Standardization()
std_layer.adapt(data_sample)

This sample must be large enough to be representative of your dataset, but it does not have to be the full training set: in general, a few hundred randomly selected instances will suffice (however, this depends on your task). Next, you can use this preprocessing layer like a normal layer:

model = keras.Sequential()
model.add(std_layer)
[...] # create the rest of the model # https://blog.csdn.net/Linli522362242/article/details/106562190
model.compile([...])
model.fit([...])

If you are thinking that Keras should contain a standardization layer like this one, here’s some good news for you: by the time you read this, the keras.layers.Normalization layer will probably be available. It will work very much like our custom
Standardization layer: first, create the layer, then adapt it to your dataset by passing a data sample to the adapt() method, and finally use the layer normally.

The Features API

Let's use the variant of the California housing dataset that we used in cp2_End-to-End Machine Learning Project_StratifiedShuffleSplit_RMSE_MAE_Geographical Data_CaliforniaHousing
https://blog.csdn.net/Linli522362242/article/details/103387527, since it contains categorical features and missing values:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42
)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42
)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

print( housing.feature_names)

import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

#On your computer
        #or # os.path.abspath( os.path.join(os.getcwd(),'..') )
upperLevelDir=os.path.abspath( os.path.dirname(os.getcwd()) )
HOUSING_PATH = os.path.join(upperLevelDir, "datasets", "housing")
HOUSING_PATH

def fetch_housing_data( housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
#     if not os.path.isdir(housing_path):
#         os.makedirs(housing_path) # creates a datasets/housing directory
#   OR
    os.makedirs( housing_path, exist_ok=True )
    tgz_path = os.path.join(housing_path, "housing.tgz") #storage path
    urllib.request.urlretrieve(housing_url, tgz_path)
    
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)#extracts the housing.csv then save it to 'path'
    housing_tgz.close()

fetch_housing_data()

import pandas as pd

def load_housing_data(housing_path = HOUSING_PATH):
    csv_path = os.path.join(housing_path, 'housing.csv')
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

1. 背景

tf.estimator是tensorflow的一个高级API接口，它最大的特点在于兼容分布式和单机两种场景，工程师可以在同一套代码结构下即实现单机训练也可以实现分布式训练，正是因为这样的特点，目前包括阿里在内的很多公司都在使用这一接口来构建自己的深度学习模型。

特征预处理是几乎所有机器学习模型所必须的一个过程，常见的特征预处理方法包括：连续变量分箱化、离散变量one-hot、离散指标embedding等，tensorflow给我们提供了一个功能强大的特征处理函数tf.feature_column，它通过对特征处理将数据输入网络并交由estimator来进行训练，本文通过实际的数据输出来直观地介绍与展现tf.feature_column的基本用法。

2. 数据处理

特征数据主要包括categorical和dense两类，处理方法是使用tensorflow中的feature_column接口来进行定义，如下图，总共有九种不同的函数，分别有五种Categorical function、三种numerical function 加上一种bucketized_column可属于任何一种，categorical column中的 with_identity其实和 dense column中的indicator_column没有区别，都是类别特征的one-hot表示，但是其属于不同的特征类别，前者属于categorical后者属于dense，对于estimator编写的不同网络而言，其可接受的one-hot类型不同，这里在实际操作中需要注意转换。

Feature Columns

This tutorial details feature columns. Think of feature columns as the intermediaries between raw data and Estimators. Feature columns are very rich, enabling you to transform a diverse range of raw data into formats that Estimators can use, allowing easy experimentation.

In simple words feature column are bridge between raw data and estimator or model.

Input to a Deep Neural Network

What kind of data can a deep neural network operate on? The answer is, of course, numbers (for example, tf.float32). After all, every neuron in a neural network performs multiplication and addition operations on weights and input data. Real-life input data, however, often contains non-numerical (categorical) data. For example, consider a product_class feature that can contain the following three non-numerical values:

kitchenware
electronics
sports

ML models generally represent categorical values as simple vectors in which a 1 represents the presence of a value and a 0 represents the absence of a value. For example, when product_class is set to sports, an ML model would usually represent product_class as [0, 0, 1], meaning:

0: kitchenware is absent
0: electronics is absent
1: sports is present

So, although raw data can be numerical or categorical, an ML model represents all features as numbers.

Feature Columns

As the following figure suggests, you specify the input to a model through the feature_columns argument of an Estimator (DNNClassifier for Iris). Feature Columns bridge input data (as returned by input_fn) with your model.

Feature columns bridge raw data with the data your model needs.

To create feature columns, call functions from the tf.feature_column module. This tutorial explains nine of the functions in that module. As the following figure shows, all nine functions return either a Categorical-Column or a Dense-Column object, except bucketized_column, which inherits from both classes:

Let’s look at these functions in more detail.

Import TensorFlow and other libraries

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.keras import layers

Create Demo data

data = {'marks': [55,21,63,88,74,54,95,41,84,52],
        'grade': ['average','poor','average','good','good','average','good','average','good','average'],
        'point': ['c','f','c+','b+','b','c','a','d+','b+','c']}

df=pd.DataFrame(data)
df

Demonstrate several types of feature column

# A utility method to show transformation from feature column
def demo( feature_column ):
    feature_layer = layers.DenseFeatures( feature_column )
                        ####
    print(feature_layer(data).numpy())

Numeric column（数值列）

The output of a feature column becomes the input to the model (using the demo function defined above, we will be able to see exactly how each column from the dataframe is transformed). A numeric column is the simplest type of column. It is used to represent real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

marks = feature_column.numeric_column("marks")
demo(marks)

Convert a DataFrame to a dict{'key': [...]} by using to_dict(orient='list')

housing_dict=housing.to_dict(orient='list') #{'key':[...value...], 'key':[...],..., 'key':[...]}
housing_dict.keys()

from tensorflow.keras import layers

def demo_housing( feature_column ):
    feature_layer = layers.DenseFeatures( feature_column )
                        ############ 
    print(feature_layer(housing_dict).numpy()[:5]) # 5 : housing.head(n=5)

housing_median_age = tf.feature_column.numeric_column("housing_median_age")
housing_median_age

demo_housing(housing_median_age)

age_mean, age_std = X_mean[1], X_std[1] # The median age is column in 1
housing_median_age = tf.feature_column.numeric_column(
    "housing_median_age", normalizer_fn=lambda x: (x-age_mean)/age_std )
housing_median_age

demo(housing_median_age)

<==

Bucketized column（分桶列）

Often, you don’t want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. Consider raw data that represents a person’s age. Instead of representing age as a numeric column, we could split the age into several buckets using a bucketized column. Notice the one-hot values below describe which age range each row matches. Buckets include the left boundary, and exclude the right boundary. For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split the year into the following four buckets:

The model will represent the buckets as follows:

Why would you want to split a number — a perfectly valid input to your model — into a categorical value? Well, notice that the categorization splits a single input number into a four-element vector. Therefore, the model now can learn four individual weights rather than just one; four weights creates a richer model than one weight. More importantly, bucketizing enables the model to clearly distinguish between different year categories since only one of the elements is set (1) and the other three elements are cleared (0). For example, when we just use a single number (a year) as input, a linear model can only learn a linear relationship. So, bucketing provides the model with additional flexibility that the model can use to learn模型可以学习更复杂的关系.

The following code demonstrates how to create a bucketized feature:

marks_buckets = feature_column.bucketized_column(marks, boundaries=[30,40,50,60,70,80,90])
demo(marks_buckets)

<==

median_income = tf.feature_column.numeric_column("median_income")
bucketized_income = tf.feature_column.bucketized_column(
                              #<1.5,<3.,<4.5,<6.,>=6  
    median_income, boundaries=[1.5, 3., 4.5, 6.]
)
bucketized_income

demo_housing(bucketized_income)

[ [0. 0. 0. 0. 1.] <== 8.3252
   [0. 0. 0. 0. 1.] <== 8.3014
   [0. 0. 0. 0. 1.] <== 7.2574
   [0. 0. 0. 1. 0.] <== 5.6431
   [0. 0. 1. 0. 0.] ] <== 3.8462

# housing_median_age = tf.feature_column.numeric_column(
#     "housing_median_age", normalizer_fn=lambda x: (x-age_mean)/age_std )

bucketized_age = tf.feature_column.bucketized_column(
    housing_median_age, boundaries=[-1., -0.5, 0., 0.5, 1]
)   # age was scaled
bucketized_age

demo_housing(bucketized_age)

<== <==

###############################################

Encoding Categorical Features Using One-Hot Vectors

Consider the ocean_proximity feature in the California housing dataset we explored in cp2_End-to-End Machine Learning Project_StratifiedShuffleSplit_RMSE_MAE_Geographical Data_CaliforniaHousing
https://blog.csdn.net/Linli522362242/article/details/103387527: it is a categorical feature with five possible values: "<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", and "ISLAND". We need to encode this feature before we feed it to a neural network. Since there are very few categories, we can use one-hot encoding. For this, we first need to map each category to its index (0 to 4), which can be done using a lookup table:

pd.unique(housing['ocean_proximity'])

We first define the vocabulary: this is the list of all possible categories.

Then we create a tensor with the corresponding indices (0 to 4).

vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype=tf.int64)
indices

Next, we create an initializer for the lookup table, passing it the list of categories and their corresponding indices. In this example, we already have this data, so we use a KeyValueTensorInitializer; but if the categories were listed in a text file (with one category per line), we would use a TextFileInitializer instead.
```
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
table_init
```
In the last two lines we create the lookup table, giving it the initializer and specifying the number of out-of-vocabulary (oov, reserved for tokens that do not exist in the set) buckets. If we look up a category (or token) that does not exist in the vocabulary(set), the lookup table will compute a hash of this category and use it to assign the unknown category to one of the oov buckets. Their indices start after the known categories, so in this example the indices of the two oov buckets are 5 and 6.
```
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)
table
```
```
input_tensor = tf.constant(["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"])
out = table.lookup(input_tensor)
out
```

Why use oov(out-of-vocabulary) buckets? Well, if the number of categories is large (e.g., zip codes, cities, words, products, or users) and the dataset is large as well, or it keeps changing, then getting the full list of categories may not be convenient. One solution is to define the vocabulary based on a data sample (rather than the whole training set) and add some oov buckets for the other categories that were not in the data sample. The more unknown categories you expect to find during training, the more oov buckets you should use. Indeed, if there are not enough oov buckets, there will be collisions: different categories will end up in the same bucket, so the neural network will not be able to distinguish them (at least not based on this feature).

Now let’s use the lookup table to encode a small batch of categorical features to onehot vectors:

categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
cat_indices

cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)
cat_one_hot

(4, 7) <== (cat_indices, depth=len(vocab) + num_oov_buckets)

<tf.Tensor: shape=(4, 7), dtype=float32, numpy=
array([ [0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.] ], dtype=float32)>

As you can see, "NEAR BAY" was mapped to index 3, the unknown category "DESERT" was mapped to one of the two oov buckets (at index 5), and "INLAND" was mapped to index 1, twice. Then we used tf.one_hot() to one-hot encode these indices. Notice that we have to tell this function the total number of indices, which is equal to the vocabulary size plus the number of oov buckets. Now you know how to encode categorical features to one-hot vectors using TensorFlow!

Just like earlier, it wouldn’t be too difficult to bundle all of this logic into a nice selfcontained class. Its adapt() method would take a data sample and extract all the distinct categories it contains. It would create a lookup table to map each category to its
index (including unknown categories using oov buckets). Then its call() method would use the lookup table to map the input categories to their indices. Well, here’s more good news: by the time you read this, Keras will probably include a layer called
keras.layers.TextVectorization, which will be capable of doing exactly that: its adapt() method will extract the vocabulary from a data sample, and its call() method will convert each category to its index in the vocabulary. You could add this layer at the beginning of your model, followed by a Lambda layer that would apply the tf.one_hot() function, if you want to convert these indices to one-hot vectors.
###############################################

Categorical Columns

Indicator and embedding columns

Indicator columns and embedding columns never work on features directly, but instead take categorical columns as input.

Categorical vocabulary column（类别词汇表）

Indicator columns

In this dataset(Demo data), grade is represented as a string (e.g. ‘poor’, ‘average’, or ‘good’). We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector (much like you have seen above with age buckets). The vocabulary can be passed as a list using categorical_column_with_vocabulary_list, or loaded from a file using categorical_column_with_vocabulary_file.

We cannot input strings directly to a model. Instead, we must first map strings to numeric or categorical values. Categorical vocabulary columns provide a good way to represent strings as a one-hot vector. For example:

grade = feature_column.categorical_column_with_vocabulary_list(
    'grade', ['poor', 'average', 'good']
)
grade_one_hot = feature_column.indicator_column(grade)
demo(grade_one_hot)

ocean_prox_vocab = ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
ocean_proximity = tf.feature_column.categorical_column_with_vocabulary_list(
    "ocean_proximity", ocean_prox_vocab)
ocean_proximity

demo_housing(tf.feature_column.indicator_column(ocean_proximity))

<="NEAR BAY" was mapped to index 3=

demo_housing(ocean_proximity_embed)

5(instances, from housing.head(n=5)) x 2(embedding_dim is the size of the embedding features, the number of embedding columns)

# ocean_prox_vocab = ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
# ocean_proximity = tf.feature_column.categorical_column_with_vocabulary_list(
#    "ocean_proximity", ocean_prox_vocab)
# ocean_proximity

# Represent the categorical column as an indicator column
ocean_proximity_one_hot = tf.feature_column.indicator_column(ocean_proximity)
ocean_proximity_one_hot

demo_housing(ocean_proximity_one_hot)

上面的函数非常简单，但它有一个明显的缺点。那就是，当词汇表很长时，需要输入的内容太多了。在这种情况下，可以调用 tf.feature_column.categorical_column_with_vocabulary_file，以便将词汇表放在单独的文件中.

Embedding columns

Suppose instead of having just a few possible strings, we have thousands (or more) values per category. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. The size of the embedding (8, in the example below) is a parameter that must be tuned.

Key point: using an embedding column is best when a categorical column has many possible values. We are using one here for demonstration purposes, so you have a complete example you can modify for a different dataset in the future.

ocean_proximity_embed = tf.feature_column.embedding_column(ocean_proximity, dimension=2)
demo_housing(ocean_proximity_embed)

# (5 from housing.head(n=5), 2(embedding_dim is the size of the embedding features, the number of embedding columns)

Point column as indicator_column

point = feature_column.categorical_column_with_vocabulary_list(
             ###################
    'point', df['point'].unique() # array(['c', 'f', 'c+', 'b+', 'b', 'a', 'd+'], dtype=object)
)#called credit
point_one_hot = feature_column.indicator_column(point)
demo(point_one_hot)

Point column as embedding_column

# Notice the input to the embedding column is the categorical column
# we previously created
point_embedding = feature_column.embedding_column(point, dimension=4)
demo(point_embedding)

When using an indicator column, we’re telling TensorFlow to do exactly what we’ve seen in our categorical product_class example. That is, an indicator column treats each category as an element in a one-hot vector, where the matching category has value 1 and the rest have 0s:

Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.

We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.

Let’s look at an example comparing indicator and embedding columns. Suppose our input examples consist of different words from a limited palette of only 81 words. Further suppose that the data set provides the following input words in 4 separate examples:

“dog”
“spoon”
“scissors”
“guitar”

In that case, the following figure illustrates the processing path for embedding columns or indicator columns.

An embedding column stores categorical data in a lower-dimensional vector than an indicator column. (We just placed random numbers into the embedding vectors; training determines the actual numbers.)

When an example is processed, one of the categorical_column_with… functions maps the example string to a numerical categorical value. For example, a function maps “spoon” to [32]. (The 32 comes from our imagination — the actual values depend on the mapping function.) You may then represent these numerical categorical values in either of the following two ways:

As an indicator column. A function converts each numeric categorical value into an 81-element vector (because our palette consists of 81 words), placing a 1 in the index of the categorical value (0, 32, 79, 80) and a 0 in all the other positions.
As an embedding column. A function uses the numerical categorical values (0, 32, 79, 80) as indices to a lookup table. Each slot in that lookup table contains a 3-element vector.

How do the values in the embeddings vectors magically get assigned? Actually, the assignments happen during training. That is, the model learns the best way to map your input numeric categorical values(categories' indices) to the embeddings vector value in order to solve your problem. Embedding columns increase your model’s capabilities, since an embeddings vector learns new relationships between categories from the training data.

##########################################################

Encoding Categorical Features Using Embeddings

An embedding is a trainable dense vector that represents a category. By default, embeddings are initialized randomly, so for example the "NEAR BAY" category could be represented initially by a random vector such as [0.131, 0.890], while the "NEAR OCEAN" category might be represented by another random vector such as [0.631,0.791]. In this example, we use 2D embeddings, but the number of dimensions is a hyperparameter you can tweak. Since these embeddings are trainable, they will gradually improve during training; and as they represent fairly similar categories, Gradient Descent will certainly end up pushing them closer together, while it will tend to move them away from the "INLAND" category’s embedding (see Figure 13-4). Indeed, the better the representation, the easier it will be for the neural network to make
accurate predictions, so training tends to make embeddings useful representations of the categories. This is called representation learning (we will see other types of representation learning in Chapter 17).

Figure 13-4. Embeddings will gradually improve during training

Word Embeddings
Not only will embeddings generally be useful representations for the task at hand, but quite often these same embeddings can be reused successfully for other tasks. The most common example of this is word embeddings (i.e., embeddings of individual words): when you are working on a natural language processing task, you are often better off reusing pretrained word embeddings than training your own.

The idea of using vectors to represent words dates back to the 1960s, and many sophisticated techniques have been used to generate useful vectors, including using neural networks. But things really took off in 2013, when Tomáš Mikolov and other Google researchers published a paper describing an efficient technique to learn word embeddings using neural networks, significantly outperforming previous attempts. This allowed them to learn embeddings on a very large corpus of text: they trained a neural network to predict the words near any given word, and obtained astounding word embeddings. For example, synonyms had very close embeddings, and semantically related words such as France, Spain, and Italy ended up clustered together.
It’s not just about proximity, though: word embeddings were also organized along meaningful axes in the embedding space. Here is a famous example: if you compute King – Man + Woman (adding and subtracting the embedding vectors of these
words), then the result will be very close to the embedding of the word Queen (see Figure 13-5). In other words, the word embeddings encode the concept of gender! Similarly, you can compute Madrid – Spain + France, and the result is close to Paris, which seems to show that the notion of capital city was also encoded in the embeddings.

Figure 13-5. Word embeddings of similar words tend to be close, and some axes seem to encode meaningful concepts

Unfortunately, word embeddings sometimes capture our worst biases. For example, although they correctly learn that Man is to King as Woman is to Queen, they also seem to learn that Man is to Doctor as Woman is to Nurse: quite a sexist bias! To be fair, this particular example is probably exaggerated, as was pointed out in a 2019 paper by Malvina Nissim et al. Nevertheless, ensuring fairness in Deep Learning algorithms is an important and active research topic.

implement embeddings manually

Let’s look at how we could implement embeddings manually, to understand how they work (then we will use a simple Keras layer instead). First, we need to create an embedding matrix containing each category’s embedding, initialized randomly; it will have one row per category and per oov bucket, and one column per embedding dimension:

len(vocab) + num_oov_buckets : n+2 (n is the size of the token set, plus index 0 is reserved for the padding placeholder, and n + 1 is for the words not present in the token set)

embedding_dim : embedding size

# vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
embedding_dim = 2                  # 5     +       2
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim])
# embed_init
# <tf.Tensor: shape=(7, 2), dtype=float32, numpy=
# array([[0.803156  , 0.49777734],
#        [0.37054038, 0.9118674 ],
#        [0.637642  , 0.18209696],
#        [0.63791955, 0.27701473],
#        [0.04227114, 0.84219384],
#        [0.90637195, 0.222556  ],
#        [0.9198462 , 0.68789077]], dtype=float32)>
embedding_matrix = tf.Variable(embed_init)
embedding_matrix

vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"] ==>

In this example we are using 2D embeddings, but as a rule of thumb embeddings typically have 10 to 300 dimensions经验, depending on the task and the vocabulary size (you will have to tune this hyperparameter).

This embedding matrix is a random 7 × 2 matrix, stored in a variable (so it can be tweaked by Gradient Descent during training).

Now let’s encode the same batch of categorical features as earlier, but this time using these embeddings:

categories = tf.constant( ["NEAR BAY", "DESERT", "INLAND", "INLAND"] )
cat_indices = table.lookup(categories)
cat_indices

tf.nn.embedding_lookup(embedding_matrix, cat_indices)

The tf.nn.embedding_lookup() function looks up the rows in the embedding matrix, at the given indices—that’s all it does. For example, the lookup table says that the "INLAND" category is at index 1, so the tf.nn.embedding_lookup() function returns the embedding at row 1 in the embedding matrix (twice): [0.37054038, 0.9118674 ]

2. keras.layers.Embedding

. Keras provides a keras.layers.Embedding layer that handles the embedding matrix (trainable, by default); when the layer is created it initializes the embedding matrix randomly, and then when it is called with some category indices it returns the rows at those indices in the embedding matrix (of embedding_dim x size (𝑛 + 2), n is the size of the token set( len(vocab) ), plus index 0 is reserved for the padding placeholder, and n + 1 is for the words not present in the token set):

here, embedding_dim=2
input_dim=len(vocab)+num_oov_buckets=n+2

embedding = keras.layers.Embedding( input_dim=len(vocab)+num_oov_buckets, output_dim=embedding_dim )
embedding(cat_indices) # [3, 5, 1, 1]

Putting everything together, we can now create a Keras model that can process categorical features (along with regular numerical features) and learn an embedding for each category (as well as for each oov bucket):

Figure 10-15. Handling multiple inputs

https://blog.csdn.net/Linli522362242/article/details/106582512

regular_inputs = keras.layers.Input(shape=[8]) # a regular input containing 8 numerical features per instance

categories = keras.layers.Input(shape=[], dtype=tf.string) # a categorical input (one categorical feature per instance)
# uses a Lambda layer to look up each category’s index
cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats)) (categories)
# looks up the embeddings for these indices
# for exampe: input_dim=len(vocab)+num_oov_buckets = 5+2 =7
cat_embed = keras.layers.Embedding(input_dim=7, output_dim=2)(cat_indices)
# concatenates the embeddings and the regular inputs in order to give the encoded inputs, 
# which are ready to be fed to a neural network.
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])

#We could add any kind of neural network at this point, but we just add a dense output layer
outputs = keras.layers.Dense(1)(encoded_inputs)

#create the Keras model
model = keras.models.Model( inputs=[regular_inputs, categories], outputs=[outputs] )

This model takes two inputs: a regular input containing eight numerical features per instance, plus a categorical input (containing one categorical feature per instance). It uses a Lambda layer to look up each category’s index, then it looks up the embeddings for these indices. Next, it concatenates the embeddings and the regular inputs in order to give the encoded inputs, which are ready to be fed to a neural network. We could add any kind of neural network at this point, but we just add a dense output layer, and we create the Keras model.

When the keras.layers.TextVectorization layer is available, you can call its adapt() method to make it extract the vocabulary from a data sample and its call() method will convert each category to its index in the vocabulary(it will take care of creating the lookup table for you; or followed by a Lambda layer that would apply the tf.one_hot() function, if you want to convert these indices to one-hot vectors( based on a Lambda layer)). Then you can add it to your model, and it will perform the index lookup (replacing the Lambda layer in the previous code example).

One-hot encoding followed by a Dense layer (with no activation function and no biases) is equivalent to an Embedding layer. However, the Embedding layer uses way fewer computations (the performance difference becomes clear when the size of the embedding matrix grows). The Dense layer’s weight matrix plays the role of the embedding matrix. For example, using one-hot vectors of size 20 and a Dense layer with 10 units is equivalent to using an Embedding layer with input_dim=20 and output_dim=10(shape(1 instance, 20 features) dot weights(20_features, 10_output) ==>shape(10_output)). As a result, it would be wasteful to use more embedding dimensions than the number of units in the layer that follows the Embedding layer(the Embedding columns are better than one-hot encoding vectors(called indicator columns)).

Now let’s look a bit more closely at the Keras preprocessing layers.
##########################################################

Hashed feature columns

Another way to represent a categorical column with a large number of values is to use a categorical_column_with_hash_bucket. This feature column calculates a hash value of the input, then selects one of the hash_bucket_size buckets to encode a string. When using this column, you do not need to provide the vocabulary, and you can choose to make the number of hash_buckets significantly smaller than the number of actual categories to save space.

Key point: An important downside of this technique is that there may be collisions in which different strings are mapped to the same bucket. In practice, this can work well for some datasets regardless.

到目前为止，我们处理的示例都包含很少的类别。但当类别的数量特别大时，我们不可能为每个词汇或整数设置单独的类别，因为这将会消耗非常大的内存。对于此类情况，我们可以反问自己：“我愿意为我的输入设置多少类别？”实际上，tf.feature_column.categorical_column_with_hash_bucket 函数使您能够指定类别的数量。对于这种 feature column，模型会计算输入值的 hash 值，然后使用模运算符将其置于其中一个 hash_bucket_size 类别中，如以下伪代码所示：

hash('raw_feature')

# 伪代码
feature_id = hash(raw_feature) % hash_bucket_size

point_hased = feature_column.categorical_column_with_hash_bucket(
    'point', hash_bucket_size=4
)
demo(feature_column.indicator_column(point_hased))

Key point: An important downside of this technique is that there may be collisions in which different strings are mapped to the same bucket.

# Just an example, it's not used later on
city_hash = tf.feature_column.categorical_column_with_hash_bucket(
    "city", hash_bucket_size=1000
)
city_hash # HashedCategoricalColumn(key='city', hash_bucket_size=1000, dtype=tf.string)

demo_housing( tf.feature_column.indicator_column(city_hash) )

...

At this point, you might rightfully think: “This is crazy!” After all, we are forcing the different input values to a smaller set of categories. This means that two probably unrelated inputs will be mapped to the same category, and consequently mean the same thing to the neural network. The following figure illustrates this dilemma, showing that kitchenware and sports both get assigned to category (hash bucket) 12:

As with many counter-intuitive phenomena in machine learning, it turns out that hashing often works well in practice. That’s because hash categories provide the model with some separation. The model can use additional features to further separate kitchenware from sports.

Crossed feature columns

Combining features into a single feature, better known as feature crosses, enables a model to learn separate weights for each combination of features. Here, we will create a new feature that is the cross of marks and age. Note that crossed_column does not build the full table of all possible combinations (which could be very large). Instead, it is backed by a hashed_column, so you can choose how large the table is.

More concretely, suppose we want our model to calculate real estate prices in Atlanta, GA. Real-estate prices within this city vary greatly depending on location. Representing latitude and longitude as separate features isn’t very useful in identifying real-estate location dependencies; however, crossing latitude and longitude into a single feature can pinpoint 精准定位 locations. Suppose we represent Atlanta as a grid of 100x100 rectangular sections, identifying each of the 10,000 sections by a feature cross of latitude and longitude. This feature cross enables the model to train on pricing conditions related to each individual section, which is a much stronger signal than latitude and longitude alone.

The following figure shows our plan, with the latitude & longitude values for the corners of the city in red text:

crossed_feature = feature_column.crossed_column([marks_buckets, grade], hash_bucket_size=10)
demo(feature_column.indicator_column(crossed_feature))

... ...

TF 2.0 crossed_column on Windows fails with SystemError: <built-in function TFE_Py_FastPathExecute> returned a result with an error set #28846： https://github.com/tensorflow/tensorflow/issues/28846

https://medium.com/ml-book/demonstration-of-tensorflow-feature-columns-tf-feature-column-3bfcca4ca5c4

Solutions:

Original code:

import tensorflow as tf             #tf.version: '2.1.0'

from tensorflow import feature_column 
from tensorflow.keras import layers #tf.version: '2.1.0'

data = {'marks': [55,21,63,88,74,54,95,41,84,52],
        'grade': ['average','poor','average','good','good','average','good','average','good','average'],
        'point': ['c','f','c+','b+','b','c','a','d+','b+','c']}
        
# A utility method to show transromation from feature column
def demo(feature_column):
  feature_layer = layers.DenseFeatures(feature_column)
  print(feature_layer(data).numpy())        

marks = feature_column.numeric_column("marks")  
marks_buckets = feature_column.bucketized_column(marks, boundaries=[30,40,50,60,70,80,90])

grade = feature_column.categorical_column_with_vocabulary_list(
    'grade', ['poor', 'average', 'good']
)

crossed_feature = feature_column.crossed_column([marks_buckets, grade], hash_bucket_size=10)
demo(feature_column.indicator_column(crossed_feature))

I just changed my code for running on tensorflow version 1.13.1 from https://www.katacoda.com/courses/tensorflow/playground

import tensorflow as tf
from tensorflow import feature_column

sess=tf.Session()#appended

data = {'marks': [55,21,63,88,74,54,95,41,84,52],
        'grade': ['average','poor','average','good','good','average','good','average','good','average'],
        'point': ['c','f','c+','b+','b','c','a','d+','b+','c']}
        
      
marks = feature_column.numeric_column("marks")  
marks_buckets = feature_column.bucketized_column(marks, boundaries=[30,40,50,60,70,80,90])

grade = feature_column.categorical_column_with_vocabulary_list(
    'grade', ['poor', 'average', 'good']
)

crossed_feature = feature_column.crossed_column([marks_buckets, grade], hash_bucket_size=10)

# # A utility method to show transromation from feature column
# def demo(feature_column):
#   feature_layer = layers.DenseFeatures(feature_column)
#   print(feature_layer(data).numpy())
# demo(feature_column.indicator_column(crossed_feature))

inputs = tf.feature_column.input_layer(data, [feature_column.indicator_column(crossed_feature)])
init = tf.global_variables_initializer()

init = tf.global_variables_initializer()
sess.run(tf.tables_initializer())
sess.run(init)
outputs=sess.run(inputs)
print(outputs)

You may create a feature cross from either of the following:

Feature names; that is, names from the dict returned from input_fn.
Any categorical column, except categorical_column_with_hash_bucket (since crossed_column hashes the input).

Except that a full grid would only be tractable for inputs with limited vocabularies. Instead of building this, potentially huge, table of inputs, the crossed_column only builds the number requested by the hash_bucket_size argument. The feature column assigns an example to a index by running a hash function on the tuple of inputs, followed by a modulo operation with hash_bucket_size.

As discussed earlier, performing the hash and modulo function limits the number of categories, but can cause category collisions; that is, multiple (latitude, longitude) feature crosses will end up in the same hash bucket. In practice though, performing feature crosses still adds significant value to the learning capability of your models.

Somewhat counterintuitively, when creating feature crosses, you typically still should include the original (uncrossed) features in your model (as in the preceding code snippet). The independent latitude and longitude features help the model distinguish between examples where a hash collision has occurred in the crossed feature.

age_and_ocean_proximity = tf.feature_column.crossed_column(
    [bucketized_age, ocean_proximity], hash_bucket_size=100
)
age_and_ocean_proximity

latitude = tf.feature_column.numeric_column('latitude')
longitude = tf.feature_column.numeric_column('longitude')

bucketized_latitude = tf.feature_column.bucketized_column(
    latitude, boundaries=list( np.linspace(32., 42., 20-1) )
)
bucketized_longitude = tf.feature_column.bucketized_column(
    longitude, boundaries=list( np.linspace(-125., -114., 20-1) )
)

location = tf.feature_column.crossed_column(
    [bucketized_latitude, bucketized_longitude], hash_bucket_size=1000
)

https://www.tensorflow.org/tutorials/structured_data/feature_columns#crossed_feature_columns

3. 将特征列传递给 Estimator

如下所示，并非所有 Estimator 都支持所有类型的 feature_columns 参数：

LinearClassifier 和 LinearRegressor：接受所有类型的特征列。
DNNClassifier 和 DNNRegressor：只接受密集列。其他类型的列必须封装在 indicator_column 或 embedding_column 中。
DNNLinearCombinedClassifier 和 DNNLinearCombinedRegressor：
- linear_feature_columns 参数接受任何类型的特征列。
- dnn_feature_columns 参数只接受密集列。

Using Feature Columns for Parsing

median_house_value = tf.feature_column.numeric_column("median_house_value")
median_house_value

make_parse_example_spec: Creates parsing spec dictionary from input feature_columns.

# make_parse_example_spec: Creates parsing spec dictionary from input feature_columns.
# similar to tf.io.parse_example(serialized_examples, feature_description)
# https://blog.csdn.net/Linli522362242/article/details/107704824
columns = [housing_median_age, median_house_value]
feature_descriptions = tf.feature_column.make_parse_example_spec(columns)
feature_descriptions

tf.io.TFRecordWriter("my_data_with_features.tfrecords")

with tf.io.TFRecordWriter("my_data_with_features.tfrecords") as f:
                  # selects housing_median_age, median_house_value
    for x,y in zip(X_train[:, 1:2].reshape(-1), y_train.reshape(-1)):
        # https://blog.csdn.net/Linli522362242/article/details/107704824
        # message 'Example' { Features 'features' = 1; };
        example = Example( features=Features(
            # message 'Features' { map<string, Feature> 'feature' = 1; };
            feature={
                "housing_median_age" : Feature( float_list=FloatList(value=[x]) ),
                "median_house_value" : Feature( float_list=FloatList(value=[y]) )
            }
        ))
        f.write(example.SerializeToString())

keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

def parse_example(serialized_examples):
    examples = tf.io.parse_example(serialized_examples, feature_descriptions) # a dict
    targets = examples.pop("median_house_value") # separate the targets
    return examples, targets

Figure 13-2. Loading and preprocessing data from multiple CSV files

Here, just one .tfrecords file

batch_size = 32
#use a tf.data.TFRecordDataset to read one or more TFRecord files
dataset = tf.data.TFRecordDataset(["my_data_with_features.tfrecords"])#<TFRecordDatasetV2 shapes: (), types: tf.string>
dataset = dataset.repeat().shuffle(buffer_size=10000).batch(batch_size).map(parse_example)

<MapDataset shapes: ({housing_median_age: (None, 1)}, (None, 1)), types: ({housing_median_age: tf.float32}, tf.float32)>

# columns = [housing_median_age, median_house_value]
columns_without_target = columns[:-1]
model = keras.models.Sequential([
    keras.layers.DenseFeatures(feature_columns=columns_without_target),
    keras.layers.Dense(1)
])

model.compile(loss="mse",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])
model.fit(dataset, steps_per_epoch=len(X_train)//batch_size, epochs=5)

Demo

# ocean_proximity_embed = tf.feature_column.embedding_column(ocean_proximity, dimension=2)
# median_income = tf.feature_column.numeric_column("median_income")
# bucketized_income = tf.feature_column.bucketized_column(
#                               #<1.5,<3.,<4.5,<6.,>=6  
#     median_income, boundaries=[1.5, 3., 4.5, 6.]
# )
some_columns = [ocean_proximity_embed, bucketized_income]
dense_features = keras.layers.DenseFeatures(some_columns)
dense_features(
    { "ocean_proximity": [["NEAR OCEAN"], ["INLAND"], ["INLAND"]],
      "median_income": [[3.], [7.2], [1.]]
    }
)

Keras Preprocessing Layers

The TensorFlow team is working on providing a set of standard Keras preprocessing layers. They will probably be available by the time you read this; however, the API may change slightly by then, so please refer to the notebook for this chapter if anything behaves unexpectedly. This new API will likely supersede the existing Feature Columns API, which is harder to use and less intuitive (if you want to learn more about the Feature Columns API anyway, please check out the notebook for this chapter).

We already discussed two of these layers: the keras.layers.Normalization layer that will perform feature standardization (it will be equivalent to the Standardization layer we defined earlier), and the TextVectorization layer that will be capable of
encoding each word in the inputs into its index in the vocabulary. In both cases, you create the layer, you call its adapt() method with a data sample, and then you use the layer normally in your model. The other preprocessing layers will follow the same
pattern.

The API will also include a keras.layers.Discretization layer that will chop continuous data into different bins and encode each bin as a one-hot vector. For example, you could use it to discretize prices into three categories, (low, medium, high), which would be encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively. Of course this loses a lot of information, but in some cases it can help the model detect patterns that would otherwise not be obvious when just looking at the continuous values.

Note

Figure 10-15. Handling multiple inputs

https://blog.csdn.net/Linli522362242/article/details/106582512

regular_inputs = keras.layers.Input(shape=[8]) # a regular input containing 8 numerical features per instance

categories = keras.layers.Input(shape=[], dtype=tf.string) # a categorical input (one categorical feature per instance)
# uses a Lambda layer to look up each category’s index
cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats)) (categories)
# looks up the embeddings for these indices
# for exampe: input_dim=len(vocab)+num_oov_buckets = 5+2 =7
cat_embed = keras.layers.Embedding(input_dim=7, output_dim=2)(cat_indices)
# concatenates the embeddings and the regular inputs in order to give the encoded inputs, 
# which are ready to be fed to a neural network.
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])

#We could add any kind of neural network at this point, but we just add a dense output layer
outputs = keras.layers.Dense(1)(encoded_inputs)

#create the Keras model
model = keras.models.Model( inputs=[regular_inputs, categories], outputs=[outputs] )

The Discretization layer will not be differentiable, and it should only be used at the start of your model. Indeed, the model’s preprocessing layers will be frozen during training, so their parameters will not be affected by Gradient Descent, and thus they do not need to be differentiable. This also means that you should not use an Embedding layer directly in a custom preprocessing layer, if you want it to be trainable: instead, it should be added separately to your model, as in the previous code example.

Figure 13-1. Chaining dataset transformations dataset

It will also be possible to chain multiple preprocessing layers using the Preproces singStage class. For example, the following code will create a preprocessing pipeline that will first normalize the inputs, then discretize them (this may remind you of Scikit-Learn pipelines https://blog.csdn.net/Linli522362242/article/details/103587172). After you adapt this pipeline to a data sample, you can use it like a regular layer in your models (but again, only at the start of the model, since it contains a nondifferentiable preprocessing layer):

normalization = keras.layers.Normalization()
discretization = keras.layers.Discretization([...])
pipeline = keras.layers.PreprocessingStage([normalization, discretization])
pipeline.adapt(data_sample)

The TextVectorization layer will also have an option to output word-count vectors instead of word indices. For example, if the vocabulary contains three words, say ["and", "basketball", "more"], then the text "more and more" will be mapped to
the vector [1, 0, 2]: the word "and" appears once, the word "basketball" does not appear at all, and the word "more" appears twice. This text representation is called a bag of words, since it completely loses the order of the words. Common words like
"and" will have a large value in most texts, even though they are usually the least interesting (e.g., in the text "more and more basketball" the word "basketball" is clearly the most important, precisely because it is not a very frequent word). So, the
word counts should be normalized in a way that reduces the importance of frequent words. A common way to do this is to divide each word count by the log of the total number of training instances in which the word appears. This technique is called Term-Frequency × Inverse-Document-Frequency (TF-IDF). For example, let’s imagine that the words "and", "basketball", and "more" appear respectively in 200, 10, and 100 text instances in the training set: in this case, the final vector will be [1/log(200), 0/log(10), 2/log(100)], which is approximately equal to [0.19, 0.,0.43]. The TextVectorization layer will (likely) have an option to perform TF-IDF.

class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)

    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

std_layer = Standardization()
std_layer.adapt(data_sample)

model = keras.Sequential()
model.add(std_layer)
[...] # create the rest of the model # https://blog.csdn.net/Linli522362242/article/details/106562190
model.compile([...])
model.fit([...])

If the standard preprocessing layers are insufficient for your task, you will still have the option to create your own custom preprocessing layer, much like we did earlier with the Standardization class. Create a subclass of the keras.layers.PreprocessingLayer class with an adapt() method, which should take a data_sample argument and optionally an extra reset_state argument: if True, then the adapt() method should reset any existing state before computing the new state; if False, it should try to update the existing state.

As you can see, these Keras preprocessing layers will make preprocessing much easier! Now, whether you choose to write your own preprocessing layers or use Keras’s (or even use the Feature Columns API), all the preprocessing will be done on the fly. During training, however, it may be preferable to perform preprocessing ahead of time. Let’s see why we’d want to do that and how we’d go about it.

TF Transform

If preprocessing is computationally expensive, then handling it before training rather than on the fly may give you a significant speedup: the data will be preprocessed just once per instance before training, rather than once per instance and per epoch during training. As mentioned earlier, if the dataset is small enough to fit in RAM, you can use its cache() method. But if it is too large, then tools like Apache Beam or Spark will help. They let you run efficient data processing pipelines over large amounts of data, even distributed across multiple servers, so you can use them to preprocess all the training data before training.

This works great and indeed can speed up training, but there is one problem: once your model is trained, suppose you want to deploy it to a mobile app. In that case you will need to write some code in your app to take care of preprocessing the data before it is fed to the model. And suppose you also want to deploy the model to Tensor‐Flow.js so that it runs in a web browser? Once again, you will need to write some preprocessing code. This can become a maintenance nightmare: whenever you want to change the preprocessing logic, you will need to update your Apache Beam code, your mobile app code, and your JavaScript code. This is not only time-consuming, but also error-prone: you may end up with subtle微妙 differences between the preprocessing operations performed before training and the ones performed in your app or in the browser. This training/serving skew will lead to bugs or degraded performance.

One improvement would be to take the trained model (trained on data that was preprocessed by your Apache Beam or Spark code) and, before deploying it to your app or the browser, add extra preprocessing layers to take care of preprocessing on the fly. That’s definitely better, since now you just have two versions of your preprocessing code: the Apache Beam or Spark code, and the preprocessing layers’ code.

But what if you could define your preprocessing operations just once? This is what TF Transform was designed for. It is part of TensorFlow Extended (TFX), an end-to-end platform for productionizing TensorFlow models. First, to use a TFX component such as TF Transform, you must install it; it does not come bundled with TensorFlow. You then define your preprocessing function just once (in Python), by using TF Transform functions for scaling, bucketizing, and more. You can also use any Tensor‐Flow operation you need. Here is what this preprocessing function might look like if we just had two features:

#############################################

TensorFlow Transform

TensorFlow Transform is a library for preprocessing data with TensorFlow. tf.Transform is useful for data that requires a full-pass, such as:

Normalize an input value by mean and standard deviation.
Convert strings to integers by generating a vocabulary over all input values.
Convert floats to integers by assigning them to buckets based on the observed data distribution.

TensorFlow has built-in support for manipulations on a single example or a batch of examples. tf.Transform extends these capabilities to support full-passes over the example data.

The output of tf.Transform is exported as a TensorFlow graph to use for training and serving. Using the same graph for both training and serving can prevent skew since the same transformations are applied in both stages.

For an introduction to tf.Transform, see the tf.Transform section of the TFX Dev Summit talk on TFX (link).

tft.scale_to_z_score
Returns a standardized column with mean 0 and variance 1.

tft.scale_to_z_score(
    x, elementwise=False, name=None, output_dtype=None
)

tft.compute_and_apply_vocabulary
Generates a vocabulary for x and maps it to an integer with this vocab.
Returns: A Tensor or SparseTensor where each string value is mapped to an integer.

tft.compute_and_apply_vocabulary(
    x, default_value=-1, top_k=None, frequency_threshold=None, num_oov_buckets=0,
    vocab_filename=None, weights=None, labels=None, use_adjusted_mutual_info=False,
    min_diff_from_avg=0.0, coverage_top_k=None, coverage_frequency_threshold=None,
    key_fn=None, fingerprint_shuffle=False, name=None
)

https://cloud.google.com/solutions/machine-learning/data-preprocessing-for-ml-with-tf-transform-pt2
#############################################

try:
    import tensorflow_transform as tft
    
    def preprocess(inputs): #inputs = a batch of input features
        median_age = inputs["housing_median_age"]
        ocean_proximity = inputs["ocean_proximity"]
        #tft.scale_to_z_score: Returns a standardized column with mean 0 and variance 1.
        standardized_age = tft.scale_to_z_score( median_age )
        ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
        return {
            "standardized_median_age": standardized_age,
            "ocean_proximity_id": ocean_proximity_id
        }
except ImportError:
    print("TF Transform is not installed. Try running: pip3 install -U tensorflow-transform")

Next, TF Transform lets you apply this preprocess() function to the whole training set using Apache Beam (it provides an AnalyzeAndTransformDataset class that you can use for this purpose in your Apache Beam pipeline). In the process, it will also compute all the necessary statistics over the whole training set: in this example, the mean and standard deviation of the housing_median_age feature, and the vocabulary for the ocean_proximity feature. The components that compute these statistics are called analyzers.

Importantly, TF Transform will also generate an equivalent TensorFlow Function that you can plug into the model you deploy. This TF Function includes some constants that correspond to all the all the necessary statistics computed by Apache Beam (the mean, standard deviation, and vocabulary).

With the Data API, TFRecords, the Keras preprocessing layers, and TF Transform, you can build highly scalable input pipelines for training and benefit from fast and portable data preprocessing in production.

But what if you just wanted to use a standard dataset? Well in that case, things are much simpler: just use TFDS!