tfx pipeline_tfx第1部分在生产中进行深度学习

最新推荐文章于 2024-07-12 19:06:42 发布

weixin_26630173

最新推荐文章于 2024-07-12 19:06:42 发布

阅读量223

点赞数

文章标签： python 深度学习人工智能机器学习 java

原文链接：https://blog.ml6.eu/deep-learning-in-production-with-tfx-part-1-7066aee34b53

版权

tfx pipeline

The rise of artificial intelligence has become omnipresent in recent years, state-of-the-art models are open-sourced on a daily basis and companies are fighting for the best data scientists and machine learning engineers, all with one goal in mind: creating tremendous value by leveraging the power of AI. Sounds great, but reality is harsh. Models, in general, don’t make it to production, and even if they do, inconsistencies along the way prevent your model from doing that one thing it was supposed to do: generate value. Let’s solve that. Once and for all. Unleash the power of Tensorflow Extended by building a production-ready sentiment analysis model! (Move to the coding part straight-away? → link_to_colab)

近年来，人工智能的兴起已无处不在，最先进的模型每天都在开源，并且公司正在争夺最好的数据科学家和机器学习工程师，所有这些都怀着一个目标：创建利用AI的强大功能带来巨大的价值。听起来不错，但现实却很残酷。通常，模型不会将其投入生产，即使这样做，模型中的不一致也会阻止您的模型完成应做的一件事情：创造价值。让我们解决这个问题。一劳永逸。通过构建生产就绪的情绪分析模型，释放Tensorflow Extended的功能！ (直接移到编码部分？→ link_to_colab )

So what is this thing called Tensorflow Extended?According to Google: “A TFX pipeline is a sequence of components that implement an ML pipeline which is specifically designed for scalable, high-performance machine learning tasks. That includes modeling, training, serving inference, and managing deployments to online, native mobile, and JavaScript targets.” Wait a minute. You’re telling me that I can use Google’s open-sourced end-to-end machine learning platform capable of building highly scalable machine learning pipelines that can be deployed to nearly every environment out there? Indeed! Now before we dive in, it is important to know that this ecosystem is pretty big. So it can be quite overwhelming at first. but let’s not rush it and take it one step at a time. Aiming to understand the topic we will be focussing on in this story: data ingestion and data validation.

那么这个叫做Tensorflow Extended的东西是什么？ 根据Google的说法： “ TFX管道是实现ML管道的一系列组件，而ML管道是专门为可伸缩的高性能机器学习任务而设计的。 其中包括建模，培训，服务推断以及管理针对在线，本机移动和JavaScript目标的部署。” 等一下。您是在告诉我，我可以使用Google的开源端到端机器学习平台，该平台能够构建高度可扩展的机器学习管道，并将其部署到几乎所有环境中？确实！现在，在深入探讨之前，重要的是要知道这个生态系统非常庞大。因此，一开始它可能会让人不知所措。但是我们不要急于一次将其迈出一步。为了理解这个故事中我们将重点关注的主题：数据摄取和数据验证。

Data IngestionEvery Machine learning pipeline starts off from the same place, it needs to get data into the pipeline. Tensorflow Extended (TFX) provides several options of ingesting data into the pipeline (e.g. CsvExampleGen and ImportExampleGen). The one we will be working with is ImportExampleGen. ImportExampleGen works with Tfrecords which is just a file type for storing… data (mind blown, right?!). The amazing thing is that you can point to a directory containing only Tfrecords and ImportExampleGen will combine all files and generate a training and validation dataset for you when the full pipeline is run.

数据提取每个机器学习管道都从同一位置开始，它需要将数据放入管道。 Tensorflow Extended(TFX)提供了几种将数据提取到管道中的选项(例如CsvExampleGen和ImportExampleGen)。我们将使用的是ImportExampleGen。 ImportExampleGen可与Tfrecords一起使用，Tfrecords只是一种用于存储…数据的文件类型(请注意，对吗？令人惊奇的是，您可以指向一个仅包含Tfrecords的目录，而ImportExampleGen将在运行完整管道时合并所有文件并为您生成训练和验证数据集。

# imports needed for data ingestion
from tfx.components.example_gen.import_example_gen.component import ImportExampleGen
from tfx.utils.dsl_utils import external_input# location of tfrecords (can be on Google Cloud Platform (GCP) or locally)
# create a pointer to the files
examples = external_input(tfrecord_dir)# create tfx data ingestion component
example_gen = ImportExampleGen(input=examples)

That was easy. Instead of installing Scikit-Learn and using their famous train_test_split function, we used TFX’s powerful data ingestion component to create a training and evaluation set. Not only was it extremely straightforward, but it also creates these datasets from a span of files that could potentially contain vast amounts of data scattered across many files.

那很简单。我们没有安装Scikit-Learn并使用其著名的train_test_split函数，而是使用TFX强大的数据提取组件来创建训练和评估集。它不仅非常简单，而且还可以从一系列文件创建这些数据集，这些文件可能包含分散在许多文件中的大量数据。

Data ValidationPerhaps, data validation is the most powerful component of the entire TFX-ecosystem. Why is that you might wonder? Well, the beauty of TFX is that every module can be used on its own. Of course, the integration of all components is what truly makes TFX an end-to-end machine learning platform, but even if you are not using Tensorflow to create your machine learning model, it can be tremendously valuable to understand your data, generate a schema for what features to expect and find anomalies over time. To do so, we’ll be relying on Tensorflow Data Validation (TFDV).

数据验证也许，数据验证是整个TFX生态系统中最强大的组件。为什么您可能会奇怪？好吧，TFX的优点在于每个模块都可以单独使用。当然，所有组件的集成才真正使TFX成为端到端的机器学习平台，但是即使您没有使用Tensorflow创建您的机器学习模型，理解您的数据，生成随时间推移预期和发现异常的特征的模式。为此，我们将依靠Tensorflow数据验证(TFDV)。

StatisticsGenThe first step for every data scientist in building a machine learning model revolves around Exploratory Data Analysis (EDA). Tensorflow Data Validation takes away some of the burdens for us by providing a component called: StatisticsGen. This component consumes the output of any of the provided ExampleGen (remember the first part of this story where we introduced the ImportExampleGen). The way these components are linked is by connecting the output of the ExampleGen to the input of the StatisticsGen to create, in essence, a small pipeline. You might wonder, how? Let us see:

StatisticsGen建立机器学习模型的每个数据科学家的第一步都围绕着探索性数据分析(EDA)。 Tensorflow数据验证通过提供一个名为StatisticsStats的组件为我们减轻了一些负担。此组件使用任何提供的ExampleGen的输出(请记住我们引入ImportExampleGen的故事的第一部分)。链接这些组件的方式是将ExampleGen的输出连接到StatisticsGen的输入，以实质上创建一个小的管道。您可能想知道，如何？让我们看看：

# Import our beloved StatisticsGen component
from tfx.components.statistics_gen.component import StatisticsGen# create the StatisticsGen instance linking the outputs of our
# example_gen to the input of the StatisticsGen
examples = example_gen.outputs['examples']
statistics_gen = StatisticsGen(examples=examples)

The results of this component are then stored in a temporary directory containing the statistics in tfrecord-format for both the training dataset(s) and evaluation dataset(s). Before visualizing the statistics, let us inspect the directory containing the results.

然后将此组件的结果存储在一个临时目录中，该目录包含tfrecord格式的统计数据，用于训练数据集和评估数据集。在可视化统计信息之前，让我们检查包含结果的目录。

# In case you are running this in a notebook.
statistics_dir = statistics_gen.outputs['statistics'].get()[0].uri# inspect directory
! tree {statistics_dir}>>
├── eval 
│   └── stats_tfrecord 
└── train     
    └── stats_tfrecord

Great. the output contains two directories containing tfrecords with our beloved statistics. Now it is time to visualize these stats and get some understanding of the dataset. Throughout this example, I have used the IMDB sentiment dataset and for the sake of brevity, I’ll focus on showing the statistics for the label class.

大。输出包含两个目录，其中包含tfrecords和我们所钟爱的统计信息。现在是时候可视化这些统计信息并对数据集有所了解了。在整个示例中，我使用了IMDB情感数据集，为简洁起见，我将重点介绍标签类的统计信息。

from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext# useful during development in notebook
context = InteractiveContext()# show statistics
context.show(statistics_gen.outputs['statistics'])

Image for post — The visualized output from StatisticsGen for one “feature”.

The “zeros” column has been marked red. In our case, it is expected as negative sentiment is marked as 0. However, in many scenarios, it might be troublesome to have a feature where a significant degree of its values equal to zero. I’d highly recommend to try it out for yourself and play with the results from StatisticsGen to fully capture its value.

“零”列已标记为红色。在我们的案例中，预期将负面情绪标记为0。但是，在许多情况下，具有很大一部分值等于零的功能可能会很麻烦。我强烈建议您自己尝试一下，并使用StatisticsGen的结果来充分利用其价值。

As mentioned before, TFX is an end-to-end ML platform. As such, even though the components can be used on a stand-alone base, the true value comes from its integration. Therefore, let’s move on to the following part and use StatisticsGen’s output as the input for the next component: SchemaGen. SchemaGen might not seem that import (why all the fuzz about generating a schema, right?), but it provides the fundamentals for parts further down the pipeline. So how to use it? Simple:

如前所述，TFX是端到端的ML平台。这样，即使组件可以在独立的基础上使用，真正的价值也来自其集成。因此，让我们继续以下部分，并使用StatisticsGen的输出作为下一个组件的输入：SchemaGen。 SchemaGen似乎不那么重要(为什么要对生成模式进行种种模糊处理，对吗？)，但是它为进一步深入的部分提供了基础。那么如何使用呢？简单：

# import SchemaGen component
from tfx.components import SchemaGen# use output of StatisticsGen as input for SchemaGen.
# We will not infer the shape of the feature as we are working with text data.
schema_gen = SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=False
)# run the component
context.run(schema_gen)

The output of this component will generate a .pbtxt file containing the schema. Further down in the pipeline, we can use this schema to validate our input data and detect anomalies. In my opinion, this is the point where things are becoming really interesting. So let’s do a deep dive here. First, we’ll use TFX’ component ExampleValidator to chain the outputs of the SchemaGen and StatisticsGen together and check if there are any anomalies. Secondly, we’ll be looking at a real business example and how we can use Tensorflow Data Validation (TFDV) to find anomalies in our “newly” obtained data from customers interacting with the IMDB website.

该组件的输出将生成一个包含模式的.pbtxt文件。在管道的更深处，我们可以使用此模式来验证输入数据并检测异常。我认为这是事情变得非常有趣的地方。因此，让我们在这里进行深入研究。首先，我们将使用TFX的ExampleValidator组件将SchemaGen和StatisticsGen的输出链接在一起，并检查是否存在异常。其次，我们将看一个真实的业务示例，以及如何使用Tensorflow数据验证(TFDV)从与IMDB网站交互的客户的“新”获得的数据中发现异常。

Case 1 is, again, really straightforward. As TFX is such an integrated platform, we can do the following:

同样，情况1确实很简单。由于TFX是这样的集成平台，因此我们可以执行以下操作：

# instantiate ExampleValidator and pass outputs of StatisticsGen
# and SchemaGen to the inputs of ExampleValidator
example_validator = ExampleValidator(
    statistics=statistics_gen.outputs['statistics'],
    schema=schema_gen.outputs['schema']
)# run and show the results
context.run(example_validator)
context.show(example_validator.outputs['anomalies'])>>
No anomalies found.

That’s great! We didn’t find any anomalies. Now let’s look at a more intriguing case. Suppose for now, we are not interested in ML (We are, but we are not ;)). The objective is to see if there are any anomalies in two datasets collected during different times. We have already generated a schema in a .pbtxt file and stored our data in two different folders:

那很棒！我们没有发现任何异常。现在让我们看一个更有趣的案例。假设现在，我们对ML不感兴趣(我们是，但我们不是;))。目的是查看在不同时间收集的两个数据集中是否存在异常。我们已经在.pbtxt文件中生成了一个架构，并将我们的数据存储在两个不同的文件夹中：

./data/jan/
./data/jan/
./data/feb/
./data/feb/

So now our objective becomes as follows:

现在我们的目标如下：

we want to load in the schema from the schema.pbtxt file
我们要从schema.pbtxt文件中加载架构
we want to generate statistics for both datasets collected over multiple days
我们想为多天收集的两个数据集生成统计信息
we want to use the schema to validate if there are any anomalies
我们要使用该架构来验证是否存在任何异常

This involves a bit more than chaining together two TFX-components, but the value will become clear rather quickly.

这涉及到的不仅仅是将两个TFX组件链接在一起，而是可以很快地弄清它的值。

# import libraries
# import third-party packagesfrom google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
import tensorflow_data_validation as tfdv# get path to schema.pbtxt
# remember, if we used SchemaGen, the path can by obtained by os.path.join(schema_gen.outputs['schema'].get()[0].uri, 'schema.pbtxt')# get path to predefined schema
path_to_schema = "schema_dir/schema.pbtxt"# init schema class and read schema content
schema = schema_pb2.Schema()
contents = file_io.read_file_to_string(path_to_schema)# parse schema
schema = text_format.Parse(contents, schema)

Great! We have loaded our protobuf schema! Objective 1: Done. The second part is about generating statistics for both datasets. Sure, no issue:

大！我们已经加载了protobuf模式！目标1：完成。第二部分是关于为两个数据集生成统计信息。当然，没有问题：

# these paths can also point to a GCS-bucket (e.g. gs://my_bucket/sentiment/tfrecords/day1)# specify tfrecord directories
monthly_business_data_dir = ["./data/jan", "./data/feb"]# generate statistics from tfrecords
jan_tfrecord = os.path.join(monthly_business_data_dir[0], "*.tfrecord*")
feb_tfrecord = os.path.join(monthly_business_data_dir[1], "*.tfrecord*")statistics_month_jan = tfdv.generate_statistics_from_tfrecord(jan_tfrecord)
statistics_month_feb = tfdv.generate_statistics_from_tfrecord(feb_tfrecord)

Perfect! Objective 2: completed. At this time, we have loaded in the schema and generated statistics captured from different operating months (Jan and Feb). Let’s validate if some anomalies are present. We used the base_line_dataset to generate the schema, so we’ll not be using those results directly for comparison. However, we do use the schema generated from the base_line_stats to check for anomalies.

完善！目标2：完成。目前，我们已经加载了架构并生成了从不同工作月份(1月和2月)捕获的统计信息。让我们验证是否存在一些异常。我们使用base_line_dataset生成模式，因此我们不会直接将这些结果用于比较。但是，我们确实使用从base_line_stats生成的架构来检查异常。

# check anomalies
anomalies = tfdv.validate_statistics(
    statistics=feb_tfrecord,
    schema=schema
)# display results
tfdv.display_anomalies(anomalies)>> 
No anomalies found.

Great. We didn’t find any anomalies. Can we make it a bit more interesting to see what would happen in case there are anomalies? Sure. Suppose that we used base_line_stats to generate a schema in which the feature called “class” was of type Int (in our case it is a bytes-string, but for now assume it is of type Int). Let us make a change to our schema to reflect this:

大。我们没有发现任何异常。我们是否可以使它更加有趣，以防万一出现异常情况会发生什么？当然。假设我们使用base_line_stats生成了一个架构，在该架构中，称为“类”的功能的类型为Int(在我们的情况下，它是字节字符串，但现在假定它的类型为Int)。让我们对架构进行更改以反映这一点：

# 1 = Bytes-String, 2 = INT
# update schema to be of type INT
tfdv.get_feature(schema, 'class').type = 2# inspect if schema has correctly been updated.
tfdv.get_feature(schema, 'class')

At this time, the schema generated from the “base_line_dataset” expects the feature “class” to be of type INT. In the new_dataset_stats, this feature happens to be of type Bytes-String. So what would happen if we now check for anomalies:

此时，从“ base_line_dataset”生成的架构期望要素“ class”的类型为INT。在new_dataset_stats中，此功能恰好是Bytes-String类型。因此，如果现在检查异常，将会发生什么：

# check anomalies
anomalies = tfdv.validate_statistics(
    statistics=feb_tfrecord,
    schema=schema
)# display results
tfdv.display_anomalies(anomalies)>>           Anomaly short description.   Anomaly long descriptionFeature 
'class'   Expected data of type: 
          INT but got STRING

How cool! Using our schema, we validated that our new dataset has some anomalies which are not inline with our expectations.

挺酷的！使用我们的模式，我们验证了我们的新数据集存在一些与我们的期望不一致的异常。

ConclusionIf you have come this far, I salute you. I hope you have learnt a great deal already and I highly recommend exploring the TFX platform yourself. This was only the beginning of the TFX pipeline and I hope you can see the value of building a production-ready Machine Learning Pipeline! Stay tuned for my next story in which I’ll focus on Tensorflow Transform. One of the most powerful components which allow us to include preprocessing steps in your TF-graph!

结论如果您走了这么远，我向您致敬。我希望您已经学到了很多东西，我强烈建议您自己探索TFX平台。这只是TFX管道的开始，我希望您能看到构建生产就绪的机器学习管道的价值！请继续关注我的下一个故事，其中我将重点介绍Tensorflow Transform。最强大的组件之一，使我们可以在TF图中包括预处理步骤！

关于ML6 (About ML6)

We are a team of AI experts and the fastest-growing AI company in Belgium. With offices in Ghent, Amsterdam, Berlin, and London, we build and implement self-learning systems across different sectors to help our clients operate more efficiently. We do this by staying on top of research, innovation, and applying our expertise in practice. To find out more, please visit www.ml6.eu

我们是一支由AI专家组成的团队，并且是比利时发展最快的AI公司。我们在根特，阿姆斯特丹，柏林和伦敦设有办事处，我们跨不同部门构建和实施自学系统，以帮助我们的客户更有效地运作。我们通过保持研究，创新并在实践中运用我们的专业知识来做到这一点。要了解更多信息，请访问www.ml6.eu

翻译自: https://blog.ml6.eu/deep-learning-in-production-with-tfx-part-1-7066aee34b53

tfx pipeline

weixin_26630173

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
tfx pipeline_tfx第1部分在生产中进行深度学习

tfx pipelineThe rise of artificial intelligence has become omnipresent in recent years, state-of-the-art models are open-sourced on a daily basis and companies are fighting for the best data scientist...
复制链接

扫一扫