深度学习笔记——深度学习框架TensorFlow(八)[Logging and Monitoring Basics with tf.contrib.learn]

最新推荐文章于 2024-06-19 08:40:49 发布

阅读量2.1k

点赞数 1

分类专栏：深度学习文章标签：深度学习

深度学习专栏收录该内容

37 篇文章 2 订阅

订阅专栏

Logging and Monitoring Basics with tf.contrib.learn

When training a model, it’s often valuable to track and evaluate progress in real time. In this tutorial, you’ll learn how to use TensorFlow’s logging capabilities and the Monitor API to audit the in-progress training of a neural network classifier for categorizing irises. This tutorial builds on the code developed in tf.contrib.learn Quickstart so if you haven’t yet completed that tutorial, you may want to explore it first, especially if you’re looking for an intro/refresher on tf.contrib.learn basics.
在训练模型时，实时跟踪和评估进度常常是很有价值的。在本教程中，您将学习如何使用TensorFlow的logging功能和监测API审核在分类器分类的神经网络进行训练的鸢尾花。本教程建立在tf.contrib.learn快速开发的代码（https://www.tensorflow.org/versions/r0.12/tutorials/tflearn/index.html）的基础上，所以如果你还没有完成这个教程，你可能想探索它，特别是如果你正在寻找一个基本tf.contrib.learn介绍/复习。
下面是tf.contrib.learn Quickstart的代码：

'''
@author: smile
'''
#coding=utf-8
import tensorflow as tf
import numpy as np
IRIS_TRAINNING = "iris_training.csv"
IRIS_TEST = "iris_test.csv"
#Load datasets
tf.logging.set_verbosity(tf.logging.INFO)
training_set = tf.contrib.learn.datasets.base.load_csv_with_header(
    filename = IRIS_TRAINNING,target_dtype = np.int,features_dtype = np.float32)
test_set = tf.contrib.learn.datasets.base.load_csv_with_header(
    filename = IRIS_TEST,target_dtype = np.int,features_dtype = np.float32)
feature_columns = [tf.contrib.layers.real_valued_column("",dimension=4)]
classifier = tf.contrib.learn.DNNClassifier(feature_columns = feature_columns,
                                            hidden_units = [10,20,10],
                                            n_classes = 3,
                                            model_dir = "/iris_model")
classifier.fit(x = training_set.data,
               y = training_set.target,steps = 2000)
accuracy_score = classifier.evaluate(x = test_set.data,
                                     y = test_set.target)["accuracy"]
print("Accuracy : {0:f}".format(accuracy_score))
new_samples = np.array(
    [[6.4,3.2,4.5,1.5],[5.8,3.1,5.0,1.7]],dtype=float)
y = classifier.predict(new_samples)
print('Prediction : {}'.format(str(y)))

Copy the above code into a file, and download the corresponding training and test data sets to the same directory.

In the following sections, you’ll progressively make updates to the above code to add logging and monitoring capabilities. Final code incorporating all updates is available for download here.
Copy the above code into a file, and download the corresponding training（http://download.tensorflow.org/data/iris_training.csv） and test data（http://download.tensorflow.org/data/iris_test.csv） sets to the same directory.
将上述代码复制到一个文件中，并将相应的训练数据集和测试数据集下载到同一目录中。

In the following sections, you’ll progressively make updates to the above code to add logging and monitoring capabilities. Final code incorporating all updates is available for download here.
在下面的小节中，您将逐步更新上述代码，以添加日志记录和监视功能。包含所有更新的最终代码可在这里下载（https://github.com/tensorflow/tensorflow/blob/r0.12/tensorflow/examples/tutorials/monitors/iris_monitors.py）。

Overview

The tf.contrib.learn Quickstart tutorial walked through how to implement a neural net classifier to categorize Iris examples into one of three species.
tf.contrib.learn的快速入门教程走过如何实现神经网络分类器进行分类，将鸢尾花分为三种品种。
But when the code from this tutorial is run, the output contains no logging tracking how model training is progressing—only the results of the print statements that were included:
但是，当运行本教程的代码时，输出不包含日志跟踪，模型培训的进展情况仅包括所包含的打印语句的结果：

Accuracy: 0.933333
Predictions: [1 2]

Without any logging, model training feels like a bit of a black box; you can’t see what’s happening as TensorFlow steps through gradient descent, get a sense of whether the model is converging appropriately, or audit to determine whether early stopping(https://en.wikipedia.org/wiki/Early_stopping) might be appropriate.
没有任何记录，模型训练的感觉有点像一个黑盒子；你不能看到发生的tensorflow步骤通过梯度下降的事情，了解模型是否收敛得体，或审计决定是否提前终止可能是适当的。

One way to address this problem would be to split model training into multiple fit calls with smaller numbers of steps in order to evaluate accuracy more progressively. However, this is not recommended practice, as it greatly slows down model training. Fortunately, tf.contrib.learn offers another solution: a Monitor APIdesigned to help you log metrics and evaluate your model while training is in progress. In the following sections, you’ll learn how to enable logging in TensorFlow, set up a ValidationMonitor to do streaming evaluations, and visualize your metrics using TensorBoard.
解决这个问题的一种方法是将模型训练分成更小的步骤，以更精确地评估精度。然而，这是不推荐的做法，因为它大大减慢了模型训练。幸运的是，tf.contrib.learn提供另一种解决办法：一个Monitor API（https://www.tensorflow.org/versions/r0.12/api_docs/python/contrib.learn.monitors.html）帮你记录度量和评估你的模型在训练的过程中。在下面的章节中，您将学习如何在TensorFlow启用logging，设置一个ValidationMonitor做流式评估，并使用TensorBoard可视化你的模型。

Enabling Logging with TensorFlow

TensorFlow uses five different levels for log messages. In order of ascending severity, they are DEBUG, INFO, WARN, ERROR, and FATAL. When you configure logging at any of these levels, TensorFlow will output all log messages corresponding to that level and all levels of higher severity. For example, if you set a logging level of ERROR, you’ll get log output containing ERROR and FATAL messages, and if you set a level of DEBUG, you’ll get log messages from all five levels.
TensorFlow用五个不同级别的日志信息。按照升序的顺序，它们是DEBUG、INFO、WARN、ERROR和FATAL。当你以任意级别配置日志信息，TensorFlow将输出此和所有高于此日志级别的日志信息。例如，如果设置了ERROR的日志级别，将得到包含ERROR和FATAL消息的日志输出，如果设置DEBUG级别，则将从所有五个级别获得日志消息。

By default, TensorFlow is configured at a logging level of WARN, but when tracking model training, you’ll want to adjust the level to INFO, which will provide additional feedback as fit operations are in progress.
默认情况下，TensorFlow日志级别为WARN。但跟踪模型的训练过程时，如果你想要调整级别为INFO，这会在fit操作中输出额外的反馈信息。

Add the following line to the beginning of your code (right after your imports):
在代码的开头添加以下行（在导入之后）：

tf.logging_set_verbosity(tf.logging.INFO)

Now when you run the code, you’ll see additional log output like the following:
当你执行这段代码吼吼，你会看到如下的额外信息：

INFO:tensorflow:step = 855017, loss = 2.87982e-05
INFO:tensorflow:global_step/sec: 503.671
INFO:tensorflow:step = 855117, loss = 2.87923e-05 (0.197 sec)
INFO:tensorflow:global_step/sec: 477.156
INFO:tensorflow:step = 855217, loss = 2.87863e-05 (0.209 sec)

With INFO-level logging, tf.contrib.learn automatically outputs training-loss metrics to stderr after every 100 steps.
随着INFO日志级别，每100步后，tf.contrib.learn将会自动输出到培训损失度量。

Configuring a ValidationMonitor for Streaming Evaluation

Logging training loss is helpful to get a sense whether your model is converging, but what if you want further insight into what’s happening during training? tf.contrib.learn provides several high-level Monitors you can attach to your fit operations to further track metrics and/or debug lower-level TensorFlow operations during model training, including:
记录训练损失有助于弄清楚你的模型是否正在融合，但如果你想进一步了解培训过程中发生的事情，该怎么办？tf.contrib.learn提供若干高级监视器可以附加到你适合操作进一步跟踪指标和/或调试级别较低的tensorflow操作模型的训练过程中，包括：
这里写图片描述

Evaluating Every N Steps

For the Iris neural network classifier, while logging training loss, you might also want to simultaneously evaluate against test data to see how well the model is generalizing. You can accomplish this by configuring a ValidationMonitor with the test data (test_set.data and test_set.target), and setting how often to evaluate with every_n_steps. The default value of every_n_steps is 100; here, set every_n_steps to 50 to evaluate after every 50 steps of model training:
对于鸢尾花神经网络分类器，在记录训练损失时，您可能也想同时对测试数据进行评估，以查看模型的泛化程度。你可以通过使用test data（test_set.data和test_set.target）配置ValidationMonitor，并通过every_n_steps设置多久评价一次。every_n_steps的默认值是100；在这里，设置every_n_steps为50，表示每50步评估一次模型训练：

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps = 50)

Place this code right before the line instantiating the classifier.
将此代码置于实例化classifier的前面。

ValidationMonitors rely on saved checkpoints to perform evaluation operations, so you’ll want to modify instantiation of the classifier to add a RunConfig that includes save_checkpoints_secs, which specifies how many seconds should elapse between checkpoint saves during training. Because the Iris data set is quite small, and thus trains quickly, it makes sense to set save_checkpoints_secs to 1 (saving a checkpoint every second) to ensure a sufficient number of checkpoints:
ValidationMonitors依靠保存断点以执行评估操作，所以修改分类其的实例需要添加一个包括save_checkpoints_secs的Runconfig，即训练期间检查点之间保存的秒数。因为Iris数据集很小，因此训练的很快，将save_checkpoints_secs设置为1（每秒保存一个断点）以确保有足够数量的检查：

classifier = tf.contrib.learn.DNNClassifier(feature_columns = feature_columns,
                                            hidden_units = [10,20,10],
                                            n_classes = 3,
                                            model_dir = "/iris_model",
                                            config=tf.contrib.learn.RunConfig(
                                                save_checkpoints_secs = 1))

NOTE: The model_dir parameter specifies an explicit directory (/iris_model) for model data to be stored; this directory path will be easier to reference later on than an autogenerated one. Each time you run the code, any existing data in /iris_model will be loaded, and model training will continue where it left off in the last run (e.g., running the script twice in succession will execute 4000 steps during training—2000 during each fit operation). To start over model training from scratch, delete /iris_model before running the code.
注：本model_dir参数指定一个显式的目录（/ iris_model）存储模型数据；这个目录路径将比以前更容易引用，而不是自动生成的路径。每次运行代码时，/ iris_model中的任何现有数据将被加载，并且在最后一次运行中继续进行模型训练（例如，每个fit操作有2000步，当训练需要执行4000步时，脚本会连续执行两次）。如果要从头开始训练模型，则在运行代码之前需要删除/ iris_model。

Finally, to attach your validation_monitor, update the fit call to include a monitors param, which takes a list of all monitors to run during model training:
最后，要附加您的validation_monitor，更新fit方法使其包括监视器参数，该参数将在模型训练期间列举出所有监视器的列表：

classifier.fit(x = training_set.data,
               y = training_set.target,steps = 2000,
               monitors = [validation_monitor])

Now, when you rerun the code, you should see validation metrics in your log output, e.g.:
现在，当您重新运行代码时，您应该在日志输出中看到验证度量，例如：

INFO:tensorflow:Validation (step 2909035): loss = 0.679562, accuracy = 0.966667, global_step = 2909031

Customizing the Evaluation Metrics （自定义评估标准）

By default, if no evaluation metrics are specified, ValidationMonitor will log both loss and accuracy, but you can customize the list of metrics that will be run every 50 steps. The tf.contrib.metrics module provides a variety of additional metric functions for classification models that you can use out of the box with ValidationMonitor, including streaming_precision and streaming_recall. To specify the exact metrics you’d like to run in each evaluation pass, add a metrics param to the ValidationMonitor constructor. metrics takes a dict of key/value pairs, where each key is the name you’d like logged for the metric, and the corresponding value is the function that calculates it.
默认情况下，如果未指定评估指标，ValidationMonitor将记录loss(https://en.wikipedia.org/wiki/Loss_function)和accuracy，但您可以自定义运行每50步的指标列表。 tf.contrib.metrics模块（https://www.tensorflow.org/versions/r0.12/api_docs/python/contrib.metrics.html）为分类模型提供了各种额外的度量函数，您可以使用ValidationMonitor开箱即用，包括streaming_precision和streaming_recall。要得到精确的指标，需要在每个评估中执行一遍，向ValidationMonitor构造函数中添加metrics参数。 metrics采用键/值对的形式，其中每个键都是您要为度量标准记录的名称，相应的值是计算该值的函数。

Revise the ValidationMonitor constructor as follows to add logging for precision and recall, in addition to accuracy (loss is always logged, and doesn’t need to be explicity specified):
修改ValidationMonitor构造函数，如下所示，除了准确度（总是记录丢失，并且不需要明确指定）之外，还可以添加精确率和召回率的日志记录：

#自定义validation_metrics
# Evaluate accuracy.
# MetricSpec构造函数接受四个参数：
# metric_fn。计算和返回指标值的函数。这可以是tf.contrib.metrics模块中可用的预定义函数，
# 如tf.contrib.metrics.streaming_precision或tf.contrib.metrics.streaming_recall。
# 或者，您可以定义自己的自定义度量函数，该函数必须将预测和标签张量作为参数（也可以选择提供权重参数）。
# 该函数必须以两种格式之一返回度量值：
# 一张张量
# 一对op（value_op，update_op），其中value_op返回度量值，update_op执行相应的操作来更新内部模型状态。
# prediction_key。包含模型返回的预测的张量的关键。如果模型返回单个张量或具有单个条目的dict，则可以省略此参数。
# 对于DNNClassifier模型，类预测将使用关键字tf.contrib.learn.PredictionKey.CLASSES在张量中返回。
# label_key。包含由模型返回的标签的张量的键，如模型的input_fn所指定。与prediction_key一样，
# 如果input_fn返回单个张量或具有单个条目的dict，则可以省略此参数。在本教程的虹膜示例中，
# DNNClassifier没有input_fn（x，y数据直接传递给fit），因此不需要提供label_key。
# weights_key。可选的。张量的键（由input_fn返回）包含metric_fn的权重输入。

    validation_metrics = {
        "accuracy":
            tf.contrib.learn.MetricSpec(
                metric_fn=tf.contrib.metrics.streaming_accuracy,
                prediction_key=tf.contrib.learn.PredictionKey.CLASSES),
        "precision":
            tf.contrib.learn.MetricSpec(
                metric_fn=tf.contrib.metrics.streaming_precision,
                prediction_key=tf.contrib.learn.PredictionKey.CLASSES),
        "recall":
            tf.contrib.learn.MetricSpec(
                metric_fn=tf.contrib.metrics.streaming_recall,
                prediction_key=tf.contrib.learn.PredictionKey.CLASSES)
    }

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps = 50,
    metrics = validation_metrics)

Rerun the code, and you should see precision and recall included in your log output, e.g.:
重新运行代码，您应该看到日志输出中包含的精确率和召回率，例如：

INFO:tensorflow:Saving dict for global step 8601646: accuracy = 0.966667, global_step = 8601646, loss = 0.70374, precision = 1.0, recall = 1.0

Early Stopping with ValidationMonitor

Note that in the above log output, by step 150, the model has already achieved precision and recall rates of 1.0. This raises the question as to whether model training could benefit from early stopping.
注意，在上述日志输出中，通过150步，模型已经实现了1.0的精确率和回调率。这提出了模型培训是否可以从早期停止受益的问题。

In addition to logging eval metrics, ValidationMonitors make it easy to implement early stopping when specified conditions are met, via three params:
除了记录eval指标之外，ValidationMonitor可以通过三个参数，在满足指定条件时轻松实现提前停止：
这里写图片描述

The following revision to the ValidationMonitor constructor specifies that if loss
ValidationMonitor构造函数如果指定如下：
如果损失不会降低(early_stopping_metric=”loss”)，一次超过200步(early_stopping_metric_minimize = True)，这个点上的训练模型会马上停止，fit操作中，不会完整的执行完2000步

validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
    test_set.data,
    test_set.target,
    every_n_steps = 50,
    metrics = validation_metrics,
    early_stopping_metric = "loss",
    early_stopping_metric_minimize = True,
    early_stopping_rounds = 200)

接下来我们在输出中可以看到：

INFO:tensorflow:Validation (step 1450): recall = 1.0, accuracy = 0.966667, global_step = 1431, precision = 1.0, loss = 0.0550445
INFO:tensorflow:Stopping. Best step: 1150 with loss = 0.0506100878119.

Indeed, here training stops at step 1450, indicating that for the past 200 steps, loss did not decrease, and that overall, step 1150 produced the smallest loss value against the test data set. This suggests that additional calibration of hyperparameters by decreasing the step count might further improve the model.
实际上，这里的训练在步骤1450停止，表明在过去的200个步骤中，损失没有减少，并且总的来说，步骤1150相对于测试数据集产生最小的损失值。这表明通过减少步数可以进一步提高模型的超参数的附加校准。

Visualizing Log Data with TensorBoard

Reading through the log produced by ValidationMonitor provides plenty of raw data on model performance during training, but it may also be helpful to see visualizations of this data to get further insight into trends—for example, how accuracy is changing over step count. You can use TensorBoard (a separate program packaged with TensorFlow) to plot graphs like this by setting the logdir command-line argument to the directory where you saved your model training data (here, /tmp/iris_model). Run the following on your command line:
通过ValidationMonitor生成的日志读取在培训期间提供大量关于模型性能的原始数据，但也可能有助于查看此数据的可视化，以进一步了解趋势，例如准确性如何随步数而变化。您可以使用TensorBoard（与TensorFlow打包的单独程序）通过将logdir命令行参数设置为保存模型训练数据的目录（此处为/ tmp / iris_model）来绘制图形。在命令行上运行以下命令：

tensorboard所在文件路径 -logdir = model所在路径

这里写图片描述
Then load the provided URL (here, http://0.0.0.0:6006) in your browser. If you click on the accuracy field, you’ll see an image like the following, which shows accuracy plotted against step count:

For more on using TensorBoard, see TensorBoard: Visualizing Learning(https://www.tensorflow.org/versions/r0.12/how_tos/summaries_and_tensorboard/index.html) and TensorBoard: Graph Visualization(https://www.tensorflow.org/versions/r0.12/how_tos/graph_viz/index.html).

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
深度学习笔记——深度学习框架TensorFlow(八)[Logging and Monitoring Basics with tf.contrib.learn]

Logging and Monitoring Basics with tf.contrib.learnWhen training a model, it’s often valuable to track and evaluate progress in real time. In this tutorial, you’ll learn how to use TensorFlow’s logging
复制链接

扫一扫

专栏目录