TFT(tensorflow_transform)常用操作

该文展示了如何使用TensorFlowTransform库对数据进行预处理,包括创建元数据、定义预处理函数、执行转换并保存预处理操作。它还讨论了如何在后续模型中重用这些预处理步骤,以及如何将预处理后的数据与Keras模型结合。
摘要由CSDN通过智能技术生成

import pathlib
import pprint
import tempfile

import tensorflow as tf
import tensorflow_transform as tft

import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils
2023-06-26 23:15:53.877588: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-26 23:15:55.005099: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/TensorRT/lib:/usr/local/cuda-11.7/lib64
2023-06-26 23:15:55.005222: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/TensorRT/lib:/usr/local/cuda-11.7/lib64
2023-06-26 23:15:55.005232: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

准备数据和元数据

raw_data = [
      {'x': 1.0, 'y': 1.0, 's': 'hello'},
      {'x': 2.0, 'y': 2.0, 's': 'world'},
      {'x': 3.0, 'y': 3.0, 's': 'hello'}
  ]

#方法一,使用tfdv推断
import tensorflow_data_validation as tfdv
import pandas as pd
data = pd.DataFrame(raw_data)
stat = tfdv.generate_statistics_from_dataframe(data)
def clearDim(schema,stat):
    for field in data.columns:
        tfdv.get_feature(schema,field).shape.ClearField('dim')
    return schema
#推断的Schema需要清除Shape中的Dim字段才能用于tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)
#否则会报错
schema = tfdv.infer_schema(stat,max_string_domain_size=0,schema_transformations=[clearDim])
raw_data_metadata = dataset_metadata.DatasetMetadata(
    schema
)

#方法二,手动创建
# raw_data_metadata = dataset_metadata.DatasetMetadata(
#     schema_utils.schema_from_feature_spec({
#         'y': tf.io.FixedLenFeature([], tf.float32),
#         'x': tf.io.FixedLenFeature([], tf.float32),
#         's': tf.io.FixedLenFeature([], tf.string),
#     }))

准备与预处理函数

def preprocessing_fn(inputs):
    x = inputs['x']
    y = inputs['y']
    s = inputs['s']
    x_centered = x - tft.mean(x)
    y_normalized = tft.scale_to_0_1(y)
    s_integerized = tft.compute_and_apply_vocabulary(s)
    x_centered_times_y_normalized = (x_centered * y_normalized)
    return {
        'x_centered':x_centered,
        'y_normalized':y_normalized,
        's_integerized':s_integerized,
        'x_centered_times_y_normalized':x_centered_times_y_normalized
    }

执行转换和将预处理操作写入文件

def main(output_dir):
  # Ignore the warnings
  with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
    transformed_dataset, transform_fn = (  # pylint: disable=unused-variable
        (raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
            preprocessing_fn))

  transformed_data, transformed_metadata = transformed_dataset  # pylint: disable=unused-variable

  # Save the transform_fn to the output_dir
  _ = (
      transform_fn
      | 'WriteTransformFn' >> tft_beam.WriteTransformFn(output_dir))

  return transformed_data, transformed_metadata
#输出目录不能存在相同文件,或者目录为空
output_dir = pathlib.Path('./transform_output')
transformed_data, transformed_metadata = main(str(output_dir))
WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.




WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).


WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).


WARNING:tensorflow:From /home/xzy/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow_transform/tf_utils.py:324: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.


2023-06-26 23:15:57.191927: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-26 23:15:57.288381: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/TensorRT/lib:/usr/local/cuda-11.7/lib64
2023-06-26 23:15:57.288444: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-06-26 23:15:57.289481: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:From /home/xzy/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow_transform/tf_utils.py:324: Tensor.experimental_ref (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use ref() instead.


WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).


WARNING:tensorflow:You are passing instance dicts and DatasetMetadata to TFT which will not provide optimal performance. Consider following the TFT guide to upgrade to the TFXIO format (Apache Arrow RecordBatch).
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/xzy/anaconda3/envs/tf/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/xzy/.local/share/jupyter/runtime/kernel-f0c3eac2-332f-48d3-be51-f348ad18cc09.json']


INFO:tensorflow:Assets written to: /tmp/tmpt45kl1nk/tftransform_tmp/2480ce1b114f48579e5a9a3f65cc7f25/assets


INFO:tensorflow:Assets written to: /tmp/tmpt45kl1nk/tftransform_tmp/2480ce1b114f48579e5a9a3f65cc7f25/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:Assets written to: /tmp/tmpt45kl1nk/tftransform_tmp/ec61c498ec57446d94add565b8b2c0ce/assets


INFO:tensorflow:Assets written to: /tmp/tmpt45kl1nk/tftransform_tmp/ec61c498ec57446d94add565b8b2c0ce/assets


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/home/xzy/anaconda3/envs/tf/lib/python3.8/site-packages/ipykernel_launcher.py', '-f', '/home/xzy/.local/share/jupyter/runtime/kernel-f0c3eac2-332f-48d3-be51-f348ad18cc09.json']

注意

#在试验阶段(手动),输入和输出的数据都是List[Dict]
#但是重新加载模型后输入后输出的是Dict[List]
raw_data
[{'x': 1.0, 'y': 1.0, 's': 'hello'},
 {'x': 2.0, 'y': 2.0, 's': 'world'},
 {'x': 3.0, 'y': 3.0, 's': 'hello'}]
transformed_data
[{'s_integerized': 0,
  'x_centered': -1.0,
  'x_centered_times_y_normalized': -0.0,
  'y_normalized': 0.0},
 {'s_integerized': 1,
  'x_centered': 0.0,
  'x_centered_times_y_normalized': 0.0,
  'y_normalized': 0.5},
 {'s_integerized': 0,
  'x_centered': 1.0,
  'x_centered_times_y_normalized': 1.0,
  'y_normalized': 1.0}]
!ls {output_dir}
transform_fn  transformed_metadata
!ls {output_dir}/transformed_metadata
asset_map  schema.pbtxt

重用预处理操作

#加载方法一,使用方法未知
loaded = tf.saved_model.load(str(output_dir/'transform_fn'))
loaded.signatures['serving_default']
<ConcreteFunction signature_wrapper(*, inputs_1, inputs, inputs_2) at 0x7F0525A17BB0>
#加载方法二,使用如下
tf_transform_output = tft.TFTransformOutput(output_dir)
tft_layer = tf_transform_output.transform_features_layer()
tft_layer
INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:struct2tensor is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_decision_forests is not available.


INFO:tensorflow:tensorflow_text is not available.


INFO:tensorflow:tensorflow_text is not available.





<tensorflow_transform.output_wrapper.TransformFeaturesLayer at 0x7f0525a251c0>
raw_data_batch = {
    's': tf.constant([ex['s'] for ex in raw_data]),
    'x': tf.constant([ex['x'] for ex in raw_data], dtype=tf.float32),
    'y': tf.constant([ex['y'] for ex in raw_data], dtype=tf.float32),
}
transformed_batch = tft_layer(raw_data_batch)
transformed_batch
{'x_centered': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([-1.,  0.,  1.], dtype=float32)>,
 's_integerized': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([0, 1, 0])>,
 'y_normalized': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0. , 0.5, 1. ], dtype=float32)>,
 'x_centered_times_y_normalized': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([-0.,  0.,  1.], dtype=float32)>}

模型对接

class StackDict(tf.keras.layers.Layer):
    def call(self,inputs):
        values = [
          tf.cast(v,tf.float32)  for k, v in sorted(inputs.items(),key=lambda kv:kv[0])
        ]
        #将多个一维数组并列合并,即每个Field为一列
        return tf.stack(values,axis=1)
class TrainedModel(tf.keras.Model):
    def __init__(self):
        super().__init__(self)
        self.pre = tft_layer
        self.concat = StackDict()
        self.body = tf.keras.Sequential([
            tf.keras.layers.Dense(64,activation='relu'),
            tf.keras.layers.Dense(64,activation='relu'),
            tf.keras.layers.Dense(10)
        ])
    def call(self,inputs,training=None):
        x = self.pre(inputs)
        x = self.concat(x)
        return self.body(x,training)
trained_model = TrainedModel()
trained_model_output = trained_model(raw_data_batch)
trained_model_output.shape
TensorShape([3, 10])
#trained_model.compile(...)
#trained_model.fit(...)
trained_model.save(tempfile.mkdtemp(),save_format='tf')
INFO:tensorflow:Assets written to: /tmp/tmp5919sllk/assets


INFO:tensorflow:Assets written to: /tmp/tmp5919sllk/assets
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

起名大废废

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值