Detectron2 包括一些DatasetEvaluator使用标准数据集特定 API(例如 COCO、LVIS)计算指标的工具。COCOEvaluator能够评估任何自定义数据集上框检测、实例分割、关键点检测的 AP(平均精度)。SemSegEvaluator能够评估任何自定义数据集上的语义分割指标。
如何实现自定义的评估器呢?
DatasetEvaluator 和 DatasetEvaluators 类:
class DatasetEvaluator:
def reset(self):
pass
def process(self, inputs, outputs):
pass
def evaluate(self):
pass
我这里删去了注释部分
process的实现:
for input_, output in zip(inputs, outputs):
# do evaluation on single input/output pair
...
# inputs (list): the inputs that's used to call the model.
# outputs (list): the return value of `model(inputs)`
evaluate的实现:
返回字典
class DatasetEvaluators(DatasetEvaluator):
def __init__(self, evaluators):
super().__init__()
self._evaluators = evaluators
def reset(self):
for evaluator in self._evaluators:
evaluator.reset()
def process(self, inputs, outputs):
for evaluator in self._evaluators:
evaluator.process(inputs, outputs)
def evaluate(self):
results = OrderedDict()
for evaluator in self._evaluators:
result = evaluator.evaluate()
if is_main_process() and result is not None:
for k, v in result.items():
assert (
k not in results
), "Different evaluators produce results with the same key {}".format(k)
results[k] = v
return results
DatasetEvaluator
是一个抽象基类,用于定义数据集评估器的接口。它包含了三个方法:reset()
、process(inputs, outputs)
和 evaluate()
。用户需要继承这个类,并实现这些方法来定义自己的数据集评估逻辑。
DatasetEvaluators
类是 DatasetEvaluator
的一个实现,用于将多个数据集评估器组合起来。它接受一个 evaluators
参数,该参数是一个评估器列表,它会在 evaluate()
方法中遍历所有评估器的结果,并将它们合并成一个字典返回。
inference_on_dataset
inference_on_dataset
函数是用于在给定数据集上运行模型推理并评估结果的主要函数。以下是对函数的详细解释:
def inference_on_dataset(
model,
data_loader,
evaluator: Union[DatasetEvaluator, List[DatasetEvaluator], None],
callbacks=None,
):
"""
Run model on the data_loader and evaluate the metrics with evaluator.
Also benchmark the inference speed of `model.__call__` accurately.
The model will be used in eval mode.
Args:
model (callable): a callable which takes an object from
`data_loader` and returns some outputs.
If it's an nn.Module, it will be temporarily set to `eval` mode.
If you wish to evaluate a model in `training` mode instead, you can
wrap the given model and override its behavior of `.eval()` and `.train()`.
data_loader: an iterable object with a length.
The elements it generates will be the inputs to the model.
evaluator: the evaluator(s) to run. Use `None` if you only want to benchmark,
but don't want to do any evaluation.
callbacks (dict of callables): a dictionary of callback functions which can be
called at each stage of inference.
Returns:
The return value of `evaluator.evaluate()`
"""
num_devices = get_world_size()
logger = logging.getLogger(__name__)
logger.info("Start inference on {} batches".format(len(data_loader)))
total = len(data_loader) # inference data loader must have a fixed length
if evaluator is None:
# create a no-op evaluator
evaluator = DatasetEvaluators([])
if isinstance(evaluator, abc.MutableSequence):
evaluator = DatasetEvaluators(evaluator)
evaluator.reset()
num_warmup = min(5, total - 1)
start_time = time.perf_counter()
total_data_time = 0
total_compute_time = 0
total_eval_time = 0
with ExitStack() as stack:
if isinstance(model, nn.Module):
stack.enter_context(inference_context(model))
stack.enter_context(torch.no_grad())
start_data_time = time.perf_counter()
dict.get(callbacks or {}, "on_start", lambda: None)()
for idx, inputs in enumerate(data_loader):
total_data_time += time.perf_counter() - start_data_time
if idx == num_warmup:
start_time = time.perf_counter()
total_data_time = 0
total_compute_time = 0
total_eval_time = 0
start_compute_time = time.perf_counter()
dict.get(callbacks or {}, "before_inference", lambda: None)()
outputs = model(inputs)
dict.get(callbacks or {}, "after_inference", lambda: None)()
if torch.cuda.is_available():
torch.cuda.synchronize()
total_compute_time += time.perf_counter() - start_compute_time
start_eval_time = time.perf_counter()
evaluator.process(inputs, outputs)
total_eval_time += time.perf_counter() - start_eval_time
iters_after_start = idx + 1 - num_warmup * int(idx >= num_warmup)
data_seconds_per_iter = total_data_time / iters_after_start
compute_seconds_per_iter = total_compute_time / iters_after_start
eval_seconds_per_iter = total_eval_time / iters_after_start
total_seconds_per_iter = (time.perf_counter() - start_time) / iters_after_start
if idx >= num_warmup * 2 or compute_seconds_per_iter > 5:
eta = datetime.timedelta(seconds=int(total_seconds_per_iter * (total - idx - 1)))
log_every_n_seconds(
logging.INFO,
(
f"Inference done {idx + 1}/{total}. "
f"Dataloading: {data_seconds_per_iter:.4f} s/iter. "
f"Inference: {compute_seconds_per_iter:.4f} s/iter. "
f"Eval: {eval_seconds_per_iter:.4f} s/iter. "
f"Total: {total_seconds_per_iter:.4f} s/iter. "
f"ETA={eta}"
),
n=5,
)
start_data_time = time.perf_counter()
dict.get(callbacks or {}, "on_end", lambda: None)()
# Measure the time only for this worker (before the synchronization barrier)
total_time = time.perf_counter() - start_time
total_time_str = str(datetime.timedelta(seconds=total_time))
# NOTE this format is parsed by grep
logger.info(
"Total inference time: {} ({:.6f} s / iter per device, on {} devices)".format(
total_time_str, total_time / (total - num_warmup), num_devices
)
)
total_compute_time_str = str(datetime.timedelta(seconds=int(total_compute_time)))
logger.info(
"Total inference pure compute time: {} ({:.6f} s / iter per device, on {} devices)".format(
total_compute_time_str, total_compute_time / (total - num_warmup), num_devices
)
)
results = evaluator.evaluate()
# An evaluator may return None when not in main process.
# Replace it by an empty dict instead to make it easier for downstream code to handle
if results is None:
results = {}
return results
-
参数:
model
: 进行推理的模型。这个模型可以是一个可以调用的对象,接受数据集中的输入并返回输出。通常情况下,它是一个神经网络模型(torch.nn.Module
)。data_loader
: 数据加载器,用于生成数据集中的样本。这个数据加载器必须是一个可迭代对象,并且具有固定的长度。evaluator
: 用于评估模型推理结果的评估器。可以是单个的评估器,也可以是评估器列表。如果传入None
,则表示只进行推理而不进行评估。callbacks
: 回调函数字典,包含在推理过程中不同阶段调用的回调函数。
-
功能:
- 函数首先通过
get_world_size()
获取可用的设备数量,并初始化日志记录器和一些计时器。 - 如果
evaluator
是None
,则创建一个空的评估器;如果是列表,则将多个评估器组合成一个DatasetEvaluators
对象。 - 接着函数开始推理过程。对数据加载器中的每个批次数据进行如下操作:
- 计算数据加载时间。
- 如果达到预热次数(默认为5),则开始计时整个推理过程。
- 执行模型推理,记录计算时间。
- 如果 GPU 可用,进行同步。
- 使用评估器处理模型的输出。
- 计算评估器的评估时间。
- 计算并输出每个迭代周期的数据加载时间、计算时间、评估时间以及总时间。
- 最后,函数返回评估器的评估结果。
- 函数首先通过
-
总结:
inference_on_dataset
函数提供了一个方便的接口,用于在给定数据集上对模型进行推理并评估结果。- 它处理了推理过程中的数据加载、模型推理、评估器处理和计时统计等操作。
- 通过传入不同的模型、数据加载器和评估器,可以轻松地在不同的数据集上进行模型评估和比较。
Evaluator
怎么自定义一个Evaluator呢?
def test_and_save_results():
self._last_eval_results = self.test(self.cfg, self.model)
return self._last_eval_results
# Do evaluation after checkpointer, because then if it fails,
# we can use the saved checkpoint to debug.
ret.append(hooks.EvalHook(cfg.TEST.EVAL_PERIOD, test_and_save_results))
如何在训练进行或结束时,对模型进行评估呢?
Detectron2中给出的例子,是在build_hooks的时候,通过调用self.test的方法来获得eval的结果。那么self.test是怎么写的呢?
@classmethod
def test(cls, cfg, model, evaluators=None):
"""
Evaluate the given model. The given model is expected to already contain
weights to evaluate.
Args:
cfg (CfgNode):
model (nn.Module):
evaluators (list[DatasetEvaluator] or None): if None, will call
:meth:`build_evaluator`. Otherwise, must have the same length as
``cfg.DATASETS.TEST``.
Returns:
dict: a dict of result metrics
"""
logger = logging.getLogger(__name__)
if isinstance(evaluators, DatasetEvaluator):
evaluators = [evaluators]
if evaluators is not None:
assert len(cfg.DATASETS.TEST) == len(evaluators), "{} != {}".format(
len(cfg.DATASETS.TEST), len(evaluators)
)
results = OrderedDict()
for idx, dataset_name in enumerate(cfg.DATASETS.TEST):
data_loader = cls.build_test_loader(cfg, dataset_name)
# When evaluators are passed in as arguments,
# implicitly assume that evaluators can be created before data_loader.
if evaluators is not None:
evaluator = evaluators[idx]
else:
try:
evaluator = cls.build_evaluator(cfg, dataset_name)
except NotImplementedError:
logger.warn(
"No evaluator found. Use `DefaultTrainer.test(evaluators=)`, "
"or implement its `build_evaluator` method."
)
results[dataset_name] = {}
continue
results_i = inference_on_dataset(model, data_loader, evaluator)
results[dataset_name] = results_i
if comm.is_main_process():
assert isinstance(
results_i, dict
), "Evaluator must return a dict on the main process. Got {} instead.".format(
results_i
)
logger.info("Evaluation results for {} in csv format:".format(dataset_name))
print_csv_format(results_i)
if len(results) == 1:
results = list(results.values())[0]
return results
self.test的传入参数为cfg, model, evaluators=None
,我们先忽略这个cfg。
- 首先 evaluators必须为list[DatasetEvaluator] or None,如果只有一个DatasetEvaluator,就用[]把它变成list;
- 判断len(cfg.DATASETS.TEST) == len(evaluators),虽然我还没搞明白这个cfg.DATASETS.TEST应该怎么写
- 返回的results是一个OrderedDict(),对cfg.DATASETS.TEST进行遍历,并使用build_test_daloader方法构建data_loader。如果evaluators为None,调用cls.build_evaluator方法构建evaluator。
- 通过
results_i = inference_on_dataset(model, data_loader, evaluator)
得到结果; - 如果是主进程,打印结果
之前踩了一个坑,就是没详细的看train_dataloader和test_dataloader的区别。
@configurable(from_config=_train_loader_from_config)
def build_detection_train_loader(
dataset,
*,
mapper,
sampler=None,
total_batch_size,
aspect_ratio_grouping=True,
num_workers=0,
collate_fn=None,
**kwargs
):
@configurable(from_config=_test_loader_from_config)
def build_detection_test_loader(
dataset: Union[List[Any], torchdata.Dataset],
*,
mapper: Callable[[Dict[str, Any]], Any],
sampler: Optional[torchdata.Sampler] = None,
batch_size: int = 1,
num_workers: int = 0,
collate_fn: Optional[Callable[[List[Any]], Any]] = None,
) -> torchdata.DataLoader:
他们用的sample一个是TrainingSampler,一个是InferenceSampler。