2.2-tensorflow2-基础教程-加载和预处理数据

最新推荐文章于 2020-10-20 10:12:51 发布

HJZ11

最新推荐文章于 2020-10-20 10:12:51 发布

阅读量508

点赞数

分类专栏： # 深度学习3-Tensorflow

本文链接：https://blog.csdn.net/HJZ11/article/details/108712130

版权

深度学习3-Tensorflow 专栏收录该内容

33 篇文章 1 订阅

订阅专栏

文章目录

1.CSV

TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

# 让 numpy 数据更易读。
np.set_printoptions(precision=3, suppress=True)


CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

dataset = tf.data.experimental.make_csv_dataset(
     ...,
     column_names=CSV_COLUMNS,
     ...)


dataset = tf.data.experimental.make_csv_dataset(
  ...,
  select_columns = columns_to_use, 
  ...)

#对于包含模型需要预测的值的列是你需要显式指定的。

LABEL_COLUMN = 'survived'
LABELS = [0, 1]

现在从文件中读取 CSV 数据并且创建 dataset。

(完整的文档，参考 tf.data.experimental.make_csv_dataset)

def get_dataset(file_path):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=12, # 为了示例更容易展示，手动设置较小的值
      label_name=LABEL_COLUMN,
      na_value="?",
      num_epochs=1,
      ignore_errors=True)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

#数据预处理
#分类数据
CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}
categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))

连续数据
连续数据需要标准化。
写一个函数标准化这些值，然后将这些值改造成 2 维的张量。

def process_continuous_data(mean, data):
  # 标准化数据
  data = tf.cast(data, tf.float32) * 1/(2*mean)
  return tf.reshape(data, [-1, 1])

#现在创建一个数值列的集合
#tf.feature_columns.numeric_column API 会使用 normalizer_fn 参数。
#在传参的时候使用 functools.partial，functools.partial 由使用每个列的均值进行标准化的函数构成。

MEANS = {
    'age' : 29.631308,
    'n_siblings_spouses' : 0.545455,
    'parch' : 0.379585,
    'fare' : 34.385399
}

numerical_columns = []

for feature in MEANS.keys():
  num_col = tf.feature_column.numeric_column(feature, normalizer_fn=functools.partial(process_continuous_data, MEANS[feature]))
  numerical_columns.append(num_col)

#创建预处理层
#将这两个特征列的集合相加，并且传给 tf.keras.layers.DenseFeatures 从而创建一个进行预处理的输入层。

preprocessing_layer = tf.keras.layers.DenseFeatures(categorical_columns+numerical_columns)

#训练、评估和预测
#现在可以实例化和训练模型。

train_data = raw_train_data.shuffle(500)
test_data = raw_test_data

model.fit(train_data, epochs=20)

2.Numpy

从 .npz 文件中加载
DATA_URL = 'https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz'

path = tf.keras.utils.get_file('mnist.npz', DATA_URL)
with np.load(path) as data:
  train_examples = data['x_train']
  train_labels = data['y_train']
  test_examples = data['x_test']
  test_labels = data['y_test']

使用 tf.data.Dataset 加载 NumPy 数组
假设您有一个示例数组和相应的标签数组，请将两个数组作为元组传递给 tf.data.Dataset.from_tensor_slices 以创建 tf.data.Dataset 。

train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_examples, test_labels))

使用该数据集
打乱和批次化数据集
BATCH_SIZE = 64
SHUFFLE_BUFFER_SIZE = 100

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

建立和训练模型
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer=tf.keras.optimizers.RMSprop(),
                loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

model.fit(train_dataset, epochs=10)
model.evaluate(test_dataset)

3.pandas.DataFrame

import pandas as pd
import tensorflow as tf

#下载包含心脏数据集的 csv 文件。
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
df = pd.read_csv(csv_file)
df.head()
df.dtypes

将 thal 列（数据帧（dataframe）中的 object ）转换为离散数值。

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes

#使用 tf.data.Dataset 读取数据
#使用 tf.data.Dataset.from_tensor_slices 从 pandas dataframe 中读取数值。

#使用 tf.data.Dataset 的其中一个优势是可以允许您写一些简单而又高效的数据管道（data pipelines)。

target = df.pop('target')

dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))

for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))

tf.constant(df['thal'])

#随机读取（shuffle）并批量处理数据集。
train_dataset = dataset.shuffle(len(df)).batch(1)

代替特征列
将字典作为输入传输给模型就像创建 tf.keras.layers.Input 层的匹配字典一样简单，应用任何预处理并使用 functional api。 您可以使用它作为 feature columns 的替代方法。
inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}
x = tf.stack(list(inputs.values()), axis=-1)

x = tf.keras.layers.Dense(10, activation='relu')(x)
output = tf.keras.layers.Dense(1, activation='sigmoid')(x)

model_func = tf.keras.Model(inputs=inputs, outputs=output)

model_func.compile(optimizer='adam',
                   loss='binary_crossentropy',
                   metrics=['accuracy'])

与 tf.data 一起使用时，保存 pd.DataFrame 列结构的最简单方法是将 pd.DataFrame 转换为 dict ，并对该字典进行切片。

dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)

for dict_slice in dict_slices.take(1):
  print (dict_slice)
model_func.fit(dict_slices, epochs=15)

4.图像

用 tf.data 加载图片

下载并检查数据集
检索图片
在你开始任何训练之前，你将需要一组图片来教会网络你想要训练的新类别。
你已经创建了一个文件夹，存储了最初使用的拥有创作共用许可的花卉照片。

import pathlib
data_root_orig = tf.keras.utils.get_file(origin='https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
                                         fname='flower_photos', untar=True)
data_root = pathlib.Path(data_root_orig)
print(data_root)

for item in data_root.iterdir():
  print(item)
import random
all_image_paths = list(data_root.glob('*/*'))
all_image_paths = [str(path) for path in all_image_paths]
random.shuffle(all_image_paths)

image_count = len(all_image_paths)
image_count
all_image_paths[:10]

检查图片
现在让我们快速浏览几张图片，这样你知道你在处理什么：

import os
attributions = (data_root/"LICENSE.txt").open(encoding='utf-8').readlines()[4:]
attributions = [line.split(' CC-BY') for line in attributions]
attributions = dict(attributions)

import IPython.display as display

def caption_image(image_path):
    image_rel = pathlib.Path(image_path).relative_to(data_root)
    return "Image (CC BY 2.0) " + ' - '.join(attributions[str(image_rel)].split(' - ')[:-1])


for n in range(3):
  image_path = random.choice(all_image_paths)
  display.display(display.Image(image_path))
  print(caption_image(image_path))
  print()

确定每张图片的标签
列出可用的标签：

label_names = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
label_names

为每个标签分配索引：

label_to_index = dict((name, index) for index, name in enumerate(label_names))
label_to_index

创建一个列表，包含每个文件的标签索引：

all_image_labels = [label_to_index[pathlib.Path(path).parent.name]
                    for path in all_image_paths]

print("First 10 labels indices: ", all_image_labels[:10])

加载和格式化图片
TensorFlow 包含加载和处理图片时你需要的所有工具：

img_path = all_image_paths[0]
img_path

以下是原始数据：

img_raw = tf.io.read_file(img_path)
print(repr(img_raw)[:100]+"...")

将它解码为图像 tensor（张量）：

img_tensor = tf.image.decode_image(img_raw)

print(img_tensor.shape)
print(img_tensor.dtype)

img_final = tf.image.resize(img_tensor, [192, 192])
img_final = img_final/255.0
print(img_final.shape)
print(img_final.numpy().min())
print(img_final.numpy().max())

将这些包装在一个简单的函数里，以备后用。

def preprocess_image(image):
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize(image, [192, 192])
  image /= 255.0  # normalize to [0,1] range

  return image
def load_and_preprocess_image(path):
  image = tf.io.read_file(path)
  return preprocess_image(image)

import matplotlib.pyplot as plt

image_path = all_image_paths[0]
label = all_image_labels[0]

plt.imshow(load_and_preprocess_image(img_path))
plt.grid(False)
plt.xlabel(caption_image(img_path))
plt.title(label_names[label].title())
print()

构建一个 tf.data.Dataset
一个图片数据集
构建 tf.data.Dataset 最简单的方法就是使用 from_tensor_slices 方法。

将字符串数组切片，得到一个字符串数据集：
path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
现在创建一个新的数据集，通过在路径数据集上映射 preprocess_image 来动态加载和格式化图片。

image_ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
import matplotlib.pyplot as plt

plt.figure(figsize=(8,8))
for n, image in enumerate(image_ds.take(4)):
  plt.subplot(2,2,n+1)
  plt.imshow(image)
  plt.grid(False)
  plt.xticks([])
  plt.yticks([])
  plt.xlabel(caption_image(all_image_paths[n]))
  plt.show()

一个(图片, 标签)对数据集
使用同样的 from_tensor_slices 方法你可以创建一个标签数据集：

label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(all_image_labels, tf.int64))

for label in label_ds.take(10):
  print(label_names[label.numpy()])

由于这些数据集顺序相同，你可以将他们打包在一起得到一个(图片, 标签)对数据集：

image_label_ds = tf.data.Dataset.zip((image_ds, label_ds))

注意：当你拥有形似 all_image_labels 和 all_image_paths 的数组，tf.data.dataset.Dataset.zip 的替代方法是将这对数组切片。

ds = tf.data.Dataset.from_tensor_slices((all_image_paths, all_image_labels))

# 元组被解压缩到映射函数的位置参数中
def load_and_preprocess_from_path_label(path, label):
  return load_and_preprocess_image(path), label

image_label_ds = ds.map(load_and_preprocess_from_path_label)
image_label_ds

训练的基本方法
要使用此数据集训练模型，你将会想要数据：

被充分打乱。
被分割为 batch。
永远重复。
尽快提供 batch。
使用 tf.data api 可以轻松添加这些功能。
BATCH_SIZE = 32

# 设置一个和数据集大小一致的 shuffle buffer size（随机缓冲区大小）以保证数据
# 被充分打乱。
ds = image_label_ds.shuffle(buffer_size=image_count)
ds = ds.repeat()
ds = ds.batch(BATCH_SIZE)
# 当模型在训练的时候，`prefetch` 使数据集在后台取得 batch。
ds = ds.prefetch(buffer_size=AUTOTUNE)
ds

ds = image_label_ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE)
ds = ds.prefetch(buffer_size=AUTOTUNE)
ds

传递数据集至模型
设置 MobileNet 的权重为不可训练：

mobile_net = tf.keras.applications.MobileNetV2(input_shape=(192, 192, 3), include_top=False)
mobile_net.trainable=False

在你将输出传递给 MobilNet 模型之前，你需要将其范围从 [0,1] 转化为 [-1,1]：

def change_range(image,label):
  return 2*image-1, label

keras_ds = ds.map(change_range)

MobileNet 为每张图片的特征返回一个 6x6 的空间网格。

传递一个 batch 的图片给它，查看结果：

# 数据集可能需要几秒来启动，因为要填满其随机缓冲区。
image_batch, label_batch = next(iter(keras_ds))

feature_map_batch = mobile_net(image_batch)
print(feature_map_batch.shape)

(32, 6, 6, 1280)

model = tf.keras.Sequential([
  mobile_net,
  tf.keras.layers.GlobalAveragePooling2D(),
  tf.keras.layers.Dense(len(label_names), activation = 'softmax')])
logit_batch = model(image_batch).numpy()

print("min logit:", logit_batch.min())
print("max logit:", logit_batch.max())
print()

print("Shape:", logit_batch.shape)
编译模型以描述训练过程：

model.compile(optimizer=tf.keras.optimizers.Adam(),
              loss='sparse_categorical_crossentropy',
              metrics=["accuracy"])
model.summary()

注意，出于演示目的每一个 epoch 中你将只运行 3 step，但一般来说在传递给 model.fit() 之前你会指定 step 的真实数量，如下所示：
steps_per_epoch=tf.math.ceil(len(all_image_paths)/BATCH_SIZE).numpy()
steps_per_epoch

model.fit(ds, epochs=1, steps_per_epoch=3)

当前数据集的性能是：

ds = image_label_ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
ds
timeit(ds)

缓存
使用 tf.data.Dataset.cache 在 epoch 之间轻松缓存计算结果。这是非常高效的，特别是当内存能容纳全部数据时。

在被预处理之后（解码和调整大小），图片在此被缓存了：

ds = image_label_ds.cache()
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(buffer_size=AUTOTUNE)
ds
timeit(ds)

如果内存不够容纳数据，使用一个缓存文件：

ds = image_label_ds.cache(filename='./cache.tf-data')
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds = ds.batch(BATCH_SIZE).prefetch(1)
ds
timeit(ds)

TFRecord 文件
原始图片数据
TFRecord 文件是一种用来存储一串二进制 blob 的简单格式。通过将多个示例打包进同一个文件内，TensorFlow 能够一次性读取多个示例，当使用一个远程存储服务，如 GCS 时，这对性能来说尤其重要。

首先，从原始图片数据中构建出一个 TFRecord 文件：

image_ds = tf.data.Dataset.from_tensor_slices(all_image_paths).map(tf.io.read_file)
tfrec = tf.data.experimental.TFRecordWriter('images.tfrec')
tfrec.write(image_ds)

image_ds = tf.data.TFRecordDataset('images.tfrec').map(preprocess_image)

压缩该数据集和你之前定义的标签数据集以得到期望的 (图片,标签) 对：

ds = tf.data.Dataset.zip((image_ds, label_ds))
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds=ds.batch(BATCH_SIZE).prefetch(AUTOTUNE)
ds

序列化的 Tensor（张量）
要为 TFRecord 文件省去一些预处理过程，首先像之前一样制作一个处理过的图片数据集：

paths_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
image_ds = paths_ds.map(load_and_preprocess_image)
image_ds

要将此序列化至一个 TFRecord 文件你首先将该 tensor（张量）数据集转化为一个字符串数据集：

ds = image_ds.map(tf.io.serialize_tensor)
ds

要将此序列化至一个 TFRecord 文件你首先将该 tensor（张量）数据集转化为一个字符串数据集：

ds = image_ds.map(tf.io.serialize_tensor)
ds

有了被缓存的预处理，就能从 TFrecord 文件高效地加载数据——只需记得在使用它之前反序列化：

ds = tf.data.TFRecordDataset('images.tfrec')

def parse(x):
  result = tf.io.parse_tensor(x, out_type=tf.float32)
  result = tf.reshape(result, [192, 192, 3])
  return result

ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
ds

现在，像之前一样添加标签和进行相同的标准操作：

ds = tf.data.Dataset.zip((ds, label_ds))
ds = ds.apply(
  tf.data.experimental.shuffle_and_repeat(buffer_size=image_count))
ds=ds.batch(BATCH_SIZE).prefetch(AUTOTUNE)
ds
timeit(ds)

5.文本

使用 tf.data 加载文本数据
tf.data.TextLineDataset 来加载文本文件

本教程中使用的文本文件已经进行过一些典型的预处理，主要包括删除了文档页眉和页脚，行号，章节标题。请下载这些已经被局部改动过的文件。

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)
  
parent_dir = os.path.dirname(text_dir)

parent_dir

将文本加载到数据集中
迭代整个文件，将整个文件加载到自己的数据集中。

每个样本都需要单独标记，所以请使用 tf.data.Dataset.map 来为每个样本设定标签。这将迭代数据集中的每一个样本并且返回（ example, label ）对。

def labeler(example, index):
  return example, tf.cast(index, tf.int64)  

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

将文本编码成数字
机器学习基于的是数字而非文本，所以字符串需要被转化成数字列表。 为了达到此目的，我们需要构建文本与整数的一一映射。

建立词汇表
首先，通过将文本标记为单独的单词集合来构建词汇表。在 TensorFlow 和 Python 中均有很多方法来达成这一目的。在本教程中:

迭代每个样本的 numpy 值。
使用 tfds.features.text.Tokenizer 来将其分割成 token。
将这些 token 放入一个 Python 集合中，借此来清除重复项。
获取该词汇表的大小以便于以后使用。
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
  some_tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
vocab_size

样本编码
通过传递 vocabulary_set 到 tfds.features.text.TokenTextEncoder 来构建一个编码器。编码器的 encode 方法传入一行文本，返回一个整数列表。

encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)

#建立模型
model = tf.keras.Sequential()

#第一层将整数表示转换为密集矢量嵌入。
model.add(tf.keras.layers.Embedding(vocab_size, 64))
#下一层是 LSTM 层，它允许模型利用上下文中理解单词含义。 LSTM 上的双向包装器有助于模型理解当前数据点与其之前和之后的数据点的关系。

model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))
#最后，我们将获得一个或多个紧密连接的层，其中最后一层是输出层。输出层输出样本属于各个标签的概率，最后具有最高概率的分类标签即为最终预测结果。

# 一个或多个紧密连接的层
# 编辑 `for` 行的列表去检测层的大小
for units in [64, 64]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))

# 输出层。第一个参数是标签个数。
model.add(tf.keras.layers.Dense(3, activation='softmax'))
#最后，编译这个模型。对于一个 softmax 分类模型来说，通常使用 sparse_categorical_crossentropy 作为其损失函数。你可以尝试其他的优化器，但是 adam 是最常用的。

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(train_data, epochs=3, validation_data=test_data)
eval_loss, eval_acc = model.evaluate(test_data)

print('\nEval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))

6.Unicode

介绍
处理自然语言的模型通常使用不同的字符集来处理不同的语言。 Unicode是一种标准的编码系统，用于表示几乎所有语言的字符。每个字符都使用介于0和0x10FFFF之间的唯一整数代码点进行编码。 Unicode字符串是零个或多个代码点的序列。

本教程说明了如何在TensorFlow中表示Unicode字符串，以及如何使用标准字符串ops的Unicode等效项来操作它们。它将基于脚本检测将Unicode字符串分成令牌。

tf.string数据类型
基本TensorFlow tf.string dtype允许你建立字节串的张量。 Unicode字符串默认情况下是utf-8编码的。

tf.constant(u"Thanks 😊")

tf.constant([u"You're", u"welcome!"]).shape

表示Unicode
在TensorFlow中有两种表示Unicode字符串的标准方法：

string标量—使用已知的字符编码对代码点序列进行编码 。
int32向量—其中每个位置都包含一个代码点。

在制图表达之间转换
TensorFlow提供了在这些不同表示之间进行转换的操作：

tf.strings.unicode_decode ：将编码的字符串标量转换为代码点的向量。
tf.strings.unicode_encode ：将代码点向量转换为编码的字符串标量。
tf.strings.unicode_transcode ：将编码的字符串标量转换为其他编码。
tf.strings.unicode_decode(text_utf8,
                          input_encoding='UTF-8')
tf.strings.unicode_encode(text_chars,
                          output_encoding='UTF-8')
tf.strings.unicode_transcode(text_utf8,
                             input_encoding='UTF8',
                             output_encoding='UTF-16-BE')

Unicode操作
字符长度
tf.strings.length操作具有参数unit ，该参数指示应如何计算长度。 unit默认为"BYTE" ，但可以将其设置为其他值，例如"UTF8_CHAR"或"UTF16_CHAR" ，以确定每个编码string中Unicode代码点的数量。

字符子串
类似地， tf.strings.substr操作接受“ unit ”参数，并使用它来确定“ pos ”和“ len ”参数所包含的偏移量。

拆分Unicode字符串
tf.strings.unicode_split操作将unicode字符串拆分为各个字符的子字符串：

字符的字节偏移
要将tf.strings.unicode_decode生成的字符张量与原始字符串对齐，了解每个字符开始位置的偏移量很有用。该方法tf.strings.unicode_decode_with_offsets类似于unicode_decode ，不同之处在于它返回一个包含启动每个字符的偏移的第二张量。

Unicode脚本
每个Unicode代码点都属于一个称为脚本的代码点的单个集合。角色的脚本有助于确定角色可能使用的语言。例如，知道西里尔字母为“Б”时，表示包含该角色的现代文本很可能来自斯拉夫语，例如俄语或乌克兰语。

TensorFlow提供tf.strings.unicode_script操作来确定给定代码点使用哪个脚本。脚本代码是与Unicode国际组件 （ICU） UScriptCode值相对应的int32值。

uscript = tf.strings.unicode_script([33464, 1041])  # ['芸', 'Б']

print(uscript.numpy())  # [17, 8] == [USCRIPT_HAN, USCRIPT_CYRILLIC]

7.TF.Text

介绍
TensorFlow Text提供了可与TensorFlow 2.0一起使用的与文本相关的类和操作的集合。该库可以执行基于文本的模型所需的常规预处理，并包括核心TensorFlow未提供的其他对序列建模有用的功能。

在文本预处理中使用这些操作的好处是它们在TensorFlow图中完成。您无需担心训练中的标记化与推理或管理预处理脚本中的标记化不同。
TensorFlow Text需要TensorFlow 2.0，并且与eager模式和graph模式完全兼容。

!pip install -q tensorflow-text

8.TFRecord和tf.Example

The TFRecord format is a simple format for storing a sequence of binary records.

Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data.

The tf.train.Example message (or protobuf) is a flexible message type that represents a {"string": value} mapping. It is designed for use with TensorFlow and is used throughout the higher-level APIs such as TFX.

HJZ11

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
2.2-tensorflow2-基础教程-加载和预处理数据

文章目录1.CSV2.Numpy3.pandas.DataFrame4.图像5.文本6.Unicode7.TF.Text8.TFRecord和tf.Example1.CSVTRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"train_fil
复制链接

扫一扫

专栏目录