当训练完一个模型,或者模型正在训练时,我们想要评估模型在实际应用的表现,可通过两个部分来实现模型评估
- 定义评估标准(度量模型性能的指标)(如Accuracy,Recall_5)
- 评估代码用于读取数据,执行inference,计算对应于GT的分数,并保存评估的分数
一 Metric
(1)metrics:用于定义评估模型性能的标准,如F1分数,IOU;
TF-slim提供了一系列的度量操作metrics使得模型评估变的非常方便;TF-slim计算评估的数值可分为三步:
- 初始化:初始化用于计算指标的变量
- 聚合:执行用于计算指标的操作(总和等)
- 完成:(可选)执行任何最终操作以计算度量值。例如,计算方式,分钟,最大值等。
例如计算mean_absolute_error时,TF-slim计算的步骤:
- 初始化变量count=0,total=0
- 聚合:根据predictions和labels(如一个batch)计算绝对误差absolu_error,并加到total中,同事count=count+1
- 最后:用total/count得到mean_absolute_error
(2)定义metrics的栗子
images, labels = LoadTestData(...) predictions = MyModel(images) mae_value_op, mae_update_op = slim.metrics.streaming_mean_absolute_error(predictions, labels) mre_value_op, mre_update_op = slim.metrics.streaming_mean_relative_error(predictions, labels) pl_value_op, pl_update_op = slim.metrics.percentage_less(mean_relative_errors, 0.3)
在创建一个metric时,会返回两个value: value_op和update_op
- value_op:幂指操作,返回当前的metric的值value
- update_op:执行聚合操作(上面提到的),然后返回metric的值value(如用于在step循环中累加metric的值)
(3)两个便捷管理metrics的函数
定义了多个评估指标metrics时,使得想要跟踪每个指标的value_op,和update_op变得困难,TF-slim提供了两个函数便于管理metrics的value_op和update_op,其实就是将多个指标的value_op和update_op分别放到两个list,或者是两个字典中
# Aggregates the value and update ops in two lists: value_ops, update_ops = slim.metrics.aggregate_metrics( slim.metrics.streaming_mean_absolute_error(predictions, labels), slim.metrics.streaming_mean_squared_error(predictions, labels)) # Aggregates the value and update ops in two dictionaries: names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({ "eval/mean_absolute_error": slim.metrics.streaming_mean_absolute_error(predictions, labels), "eval/mean_squared_error": slim.metrics.streaming_mean_squared_error(predictions, labels), })
(4)多个评估指标metrics的例子
import tensorflow as tf import tensorflow.contrib.slim.nets as nets slim = tf.contrib.slim vgg = nets.vgg # Load the data images, labels = load_data(...) # Define the network predictions = vgg.vgg_16(images) # Choose the metrics to compute: names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({ "eval/mean_absolute_error": slim.metrics.streaming_mean_absolute_error(predictions, labels), "eval/mean_squared_error": slim.metrics.streaming_mean_squared_error(predictions, labels), }) # Evaluate the model using 1000 batches of data: num_batches = 1000 with tf.Session() as sess: sess.run(tf.global_variables_initializer()) sess.run(tf.local_variables_initializer()) for batch_id in range(num_batches): sess.run(names_to_updates.values()) metric_values = sess.run(names_to_values.values()) for metric, value in zip(names_to_values.keys(), metric_values): print('Metric %s has value: %f' % (metric, value))
二 循环评估
为了简化评估流程,TF-slim提供了评估模块(evaluation.py),这个模块包含一些使用metric(metric_op.py模块定义的)的有助于编写评估代码的函数,其中一个函数会定期的运行评估,计算一个batch_data的mestric指标的值,将metric指标的值输出到标准输出并保存到summeries中,
import tensorflow as tf slim = tf.contrib.slim # Load the data images, labels = load_data(...) # Define the network predictions = MyModel(images) # Choose the metrics to compute: names_to_values, names_to_updates = slim.metrics.aggregate_metric_map({ 'accuracy': slim.metrics.accuracy(predictions, labels), 'precision': slim.metrics.precision(predictions, labels), 'recall': slim.metrics.recall(mean_relative_errors, 0.3), }) # Create the summary ops such that they also print out to std output: summary_ops = [] for metric_name, metric_value in names_to_values.iteritems(): op = tf.summary.scalar(metric_name, metric_value) op = tf.Print(op, [metric_value], metric_name) summary_ops.append(op) num_examples = 10000 batch_size = 32 num_batches = math.ceil(num_examples / float(batch_size)) # Setup the global step. slim.get_or_create_global_step() output_dir = ... # Where the summaries are stored. eval_interval_secs = ... # How often to run the evaluation. slim.evaluation.evaluation_loop( 'local', checkpoint_dir, log_dir, num_evals=num_batches, eval_op=names_to_updates.values(), summary_op=tf.summary.merge(summary_ops), eval_interval_secs=eval_interval_secs)