论文阅读：Evaluating explanation without ground truth in interpretable machine learning

最新推荐文章于 2024-07-11 03:29:05 发布

LET IT BE

最新推荐文章于 2024-07-11 03:29:05 发布

阅读量231

点赞数

文章标签：人工智能机器学习

本文链接：https://blog.csdn.net/weixin_45251621/article/details/125046416

版权

https://arxiv.org/abs/1907.06831

Yang F, Du M, Hu X. Evaluating explanation without ground truth in interpretable machine learning[J]. arXiv preprint arXiv:1907.06831, 2019.

0 Abstract

可解释机器学习（IML）在许多现实世界的应用中变得越来越重要，例如自动驾驶汽车和医疗诊断，在这些应用中，解释可以帮助人们更好地理解机器学习系统的工作方式，并进一步增强他们对系统的信任。然而，由于场景的多样性和解释的主观性，我们在IML中很少有关于所生成解释质量的基准评估的基本事实（ground truth）。评估解释的质量不仅对评估系统边界很重要，而且有助于在实际环境中实现对人类用户的真正好处。为了对IML中的评估进行基准测试，在本文中，我们严格定义了评估解释的问题，并系统地回顾了现有的最新研究成果。具体而言，我们总结了解释的三个一般属性（即概括性generalizability、忠实性fidelity和说服性persuasibility），并给出了正式定义，然后分别回顾了它们在不同任务下的代表性方法。此外，根据开发人员和最终用户的分层需求，设计了一个统一的评估框架，可以方便地应用于实践中的不同场景。最后，讨论了尚未解决的问题，并提出了当前评估技术的一些局限性，以供今后的探索。

1 Introduction

复杂机器学习模型（比如深度学习模型）虽然有着较高的精度，但通常难以解释，这限制了它们在高风险投资场景下的应用机会。因此，可解释机器学习（IML）的概念被提出，旨在帮助人类更好地理解机器学习的决策过程。

IML技术的核心思想：

IML技术的研究趋势：

尽管已经对IML技术进行了全面的讨论，包括方法和应用，但IML评估视角的见解仍然相当有限，这严重阻碍了IML进入严谨科学领域的道路。为了准确反映IML系统的边界并衡量IML给人类用户带来的好处，有效的评估是关键和不可或缺的。与单纯依赖模型性能的传统评估不同，IML评估还需要关注生成的解释的质量，这使得其难以处理和基准化。

IML评估的挑战：设计实验时需要权衡主观和客观的问题

一方面，不同的用户可能会对不同场景下的良好解释有不同的偏好，因此，用一组共同的基本事实对IML评估进行客观评估是不切实际的。
另一方面，用户对IML的主观满意度也不能单独作为对IML的评估，有研究表明用户的主观满意度往往与用户的反应时间有关，而与系统的预测性能没有太大的关系，
而且，完全依靠用户的主观评估还会导致道德问题，因为操纵解释以更好地迎合人类用户是不道德的。过分追求人类的满意度可能会导致解释说服用户，而不是实际解释系统。

本文工作：

概述ML相关方法，从两个维度进行分类（interpretation scope和interpretation manner）
概括三个一般属性（概括性generalizability、忠实性fidelity和说服性persuasibility），对其进行正式地定义，并且严格定义IML场景下的评估问题
依据三个一般属性，系统回顾现有关于IML评估的工作，聚焦于不同应用中的不同解释技术
回顾一些其他一些特殊场景下IML评估的特殊属性
依据以上的这些属性，设计一个统一的评估框架，该框架同时考虑了模型开发人员和终端用户的层次需求
最后，我们提出了当前评估技术存在的几个问题，并讨论了未来探索的一些潜在局限性。

2 Explanation and Evaluation

2.1 Explanation Overview

A two-dimensional categorization

Scope dimension

- global: indicates the overall working mechanism of models by interpreting structures or parameters
- local: reflects the particular model behavior for individual instance by analyzing specific decisions

Manner dimension

- intrinsic: achieved by self-interpretable models
- post-hoc (also written as posthoc): requires another independent interpretation model or technique to provide explanations for the target model

2.2 General Properties of Explanation

generalizability
fidelity
persuasibility

2.3 Explanation Evaluation Problem

IML评估问题的定义：

model evaluation：与传统机器学习系统评估方式相同，采用accuracy、precision、recall、F1-score等指标进行评估
explanation evaluation：explanation evaluation在目标和方法上都不同于mode evaluation。由于解释通常包含多个视角，并且在不同场景中没有共同的基本事实，因此传统的模型评估技术无法完美应用。

Defination of explanation evaluation: The explanation evaluation problem within IML context is to assess the quality of the generated explanations from systems, where high-quality explanation corresponds to large values of generalizability, fidelity and persuasibility with relevant measurement. In general, good explanation ought to be well generalized, highly faithful and easily understood.

3 Evaluation Review

3.1 Evaluation on Generalizability

intrinsic-global explanations: 本身就是可解释的，将这些explanations应用于测试集，看看它们的泛化能力（概括推广能力），与model evaluation有些相似

- metrics： accuracy， F1-score，AUC
- examples：generalized linear model (with informative coefficients) , decision tree (with structured branches) , K-nearest neighbors (with significant instances) and rule-based systems (with discriminative patterns)

posthoc-global explanations：与内在全局解释类似，只不过这里的模型不是目标模型，而是代理模型

- metrics： accuracy， F1-score，AUC
- examples：knowledge distillation， mimic learning

3.2 Evaluation on Fidelity

intrinsic explanations: full fidelity，因为它本身就完全反映模型的真实运作机制
posthoc-global explanations：代理模型是不具有full fidelity的，因为它建立在一个不同的模型基础上。通过衡量目标模型和代理模型性能上的差异来反映fidelity。

- examples： teacher-student models

posthoc-local explanations：一般的方式是通过消融分析或者扰动分析（ablation and perturbation）来衡量fidelity，核心思想是依据生成的解释在进行对抗性改变之后检查预测的波动（check the prediction variation after the adversarial changes made according to the generated explanation）。

- philosophy：如果生成的解释是faithful的，那么依据该解释，对输入样例的特征进行相应地修改，那么会对预测结果造成显著的影响（modifications made in accordance with the generated explanation to the input instances bring about significant differences to model prediction ,if the explanation is faithful to the target system). 预测结果改变越大，说明explanation越faithful。
- examples：mask the attributing regions in images for image classification tasks；

3.3 Evaluation on Persuasibility

uncontentious tasks(eg. object detection)：如目标检测等无争议的任务（typically keep consistent across different groups of user and one particular task），可以使用与人工注解（human annotation）比较的方法来评估

- example 1：bounding box：利用交并比Intersection over Union(IoU)或者Jaccard index等指标量化说服力
- example 2：semantic segmentation：利用像素级别的差异作为指标衡量解释的说服力
- example 3：类似human annotation， NLP中有rationale，它是被标注员认为对预测产生重要影响的输入特征的一个子集。

complicated tasks：复杂任务下，基于人工标注的方法不再适用，因为相关的标注在不同用户组中会不一致。这种情况下，常见的说服力评估方法是employing users for human studies

- metrics for human evaluation：mental mode， human-machineperformance, user satisfaction user trust, response time, decision accuracy

3.4 Evaluation on Other Properties

The reason to introduce thers properties seperately:

these properies are not representative and general for explanation eavluation among IML systems, and are simply considered under specific architectures or applications
related to both prediction model and generated explanation, and need novel and special design to evaluate

Other properties:

robustness：how similar the explanations are for similar examples

- metrics: sensitivity

capability：the exend that corresponding explanations can be generated

- used on those explanations generated from search based mathdologies , instead of those obtained from gradient base or perturbation based methdologies
- example: recomender system, use explainability precision and explainability recall as metrics

certainty：whether the explanations reflct the uncertainty of the target IML system

- discrepancy in prediction confidence of the IML system between one category and the others

4 Discussion and Exploration

4.1 Unified Framework for Evaluation

Generalizability -> reflect the true knowledge for particular tasks? -> the precondition for human users to make accurate decisions with the generated explanations
Fidelity -> explanation relevance -> trust the IML system or not?
Persuasibility -> directly bridge the gap between human users and machine explanations -> tangible impacts

Each tier should have a consistent pipeline with a fixed set of data, user, and metrics correspondingly.

The overall evaluation results can be further derived through an ensemble way, e.g., weighted sum.

LET IT BE

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
论文阅读：Evaluating explanation without ground truth in interpretable machine learning

https://arxiv.org/abs/1907.06831Yang F, Du M, Hu X. Evaluating explanation without ground truth in interpretable machine learning[J]. arXiv preprint arXiv:1907.06831, 2019.0 Abstract可解释机器学习（IML）在许多现实世界的应用中变得越来越重要，例如自动驾驶汽车和医疗诊断，在这些应用中，解释可以帮助人们更好地理解机.
复制链接

扫一扫