Paper：可解释性之PFI《All Models are Wrong, but Many are Useful: Learning a Variable’s Importance》翻译与解读

一个处女座的程序猿

已于 2023-04-25 20:38:18 修改

阅读量3.1k

点赞数 5

分类专栏： ML Paper 文章标签： PFI 排列重要性置换重要性

于 2022-07-26 00:00:18 首次发布

本文链接：https://blog.csdn.net/qq_41185868/article/details/125985807

版权

ML 同时被 2 个专栏收录

514 篇文章 131 订阅

订阅专栏

Paper

71 篇文章 46 订阅

订阅专栏

Paper：可解释性之PFI《All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously-所有模型都是错误的，但许多模型都是有用的：通过同时研究一整类预测模型来了解变量的重要性》翻译与解读

《All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously-所有模型都是错误的，但许多模型都是有用的：通过同时研究一整类预测模型来了解变量的重要性》翻译与解读

Abstract

1. Introduction‌

2. Notation & Technical Summary‌‌

2.1 Summary of Rashomon Sets & Model Class Reliance‌‌

3. Model Reliance‌‌

3.1 Estimating Model Reliance with U-statistics, and Connections to Permutation-based Variable Importance‌‌

3.2 Limitations of Existing Variable Importance Methods‌‌

4. Model Class Reliance‌

4.1 Motivating Empirical Estimators of MCR by Deriving Finite-sample Bounds‌

5. Extensions of Rashomon Sets Beyond Variable Importance‌

5.1 Finite-sample Confidence Intervals from Rashomon Sets

6. Calculating Empirical Estimates of Model Class Reliance‌

7. MR & MCR for Linear Models, Additive Models, and Regression Models in a Reproducing Kernel Hilbert Space

7.1 Interpreting and Computing MR for Linear or Additive Models‌

10. Data Analysis: Reliance of Criminal Recidivism Prediction Models on Race and Sex‌

10.1 Results‌

10.2 Discussion & Limitations

11. Conclusion‌

《All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously-所有模型都是错误的，但许多模型都是有用的：通过同时研究一整类预测模型来了解变量的重要性》翻译与解读

来源

https://arxiv.org/pdf/1801.01489.pdf

http://www.jmlr.org/papers/v20/18-760.html

作者

Aaron Fisher afishe27@alumni.jh.edu

arXiv:1801.01489v5 [stat.ME] 23 Dec 2019

Takeda Pharmaceuticals Cambridge, MA 02139, USA

Cynthia Rudin cynthia@cs.duke.edu

Departments of Computer Science and Electrical and Computer Engineering Duke University

Durham, NC 27708, USA

Francesca Dominici fdominic@hsph.harvard.edu

Department of Biostatistics

Harvard T.H. Chan School of Public Health Boston, MA 02115, USA

(Authors are listed in order of contribution, with highest contribution listed first.)

发布日期

2019 年 12 月 23 日

Abstract

Variable importance (VI) tools describe how much covariates contribute to a prediction model’s accuracy. However, important variables for one well-performing model (for exam- ple, a linear model f (x) = xT β with a fixed coefficient vector β) may be unimportant for another model. In this paper, we propose model class reliance (MCR) as the range of VI values across all well-performing model in a prespecified class. Thus, MCR gives a more comprehensive description of importance by accounting for the fact that many prediction models, possibly of different parametric forms, may fit the data well. In the process of de- riving MCR, we show several informative results for permutation-based VI estimates, based on the VI measures used in Random Forests. Specifically, we derive connections between permutation importance estimates for a single prediction model, U-statistics, conditional variable importance, conditional causal effects, and linear model coefficients. We then give probabilistic bounds for MCR, using a novel, generalizable technique. We apply MCR to a public data set of Broward County criminal records to study the reliance of recidivism prediction models on sex and race. In this application, MCR can be used to help inform VI for unknown, proprietary models.

变量重要性 (VI) 工具描述了协变量对预测模型准确性的贡献程度。然而，一个性能良好的模型的重要变量（例如，具有固定系数向量 β 的线性模型 f (x) = xT β）对于另一个模型可能并不重要。在本文中，我们提出模型类依赖 (MCR) 作为预先指定类中所有表现良好的模型的 VI 值范围。因此，MCR通过解释许多预测模型(可能具有不同的参数形式)可以很好地拟合数据这一事实，对重要性给出了更全面的描述。在推导 MCR 的过程中，我们展示了基于随机森林中使用的 VI 度量的基于排列的 VI 估计的几个信息结果。具体来说，我们推导出单个预测模型的置换重要性估计、U-统计量、条件变量重要性、条件因果效应和线性模型系数之间的联系。然后，我们使用一种新颖的、可推广的技术给出 MCR 的概率界限。我们将 MCR 应用于布劳沃德县犯罪记录的公共数据集，以研究再犯预测模型对性别和种族的依赖。在这个应用程序中，MCR 可用于帮助通知 VI 了解未知的专有模型。

Keywords: Rashomon, permutation importance, conditional variable importance, U- statistics, transparency, interpretable models

关键词：Rashomon、排列重要性、条件变量重要性、U-统计、透明度、可解释模型

1. Introduction‌

Variable importance (VI) tools describe how much a prediction model’s accuracy depends on the information in each covariate. For example, in Random Forests, VI is measured by the decrease in prediction accuracy when a covariate is permuted (Breiman, 2001; Breiman et al., 2001; see also Strobl et al., 2008; Altmann et al., 2010; Zhu et al., 2015; Gregorutti et al., 2015; Datta et al., 2016;Gregorutti et al., 2017). A similar “Perturb” VI measure has been used for neural networks, where noise is added to covariates (Recknagel et al., 1997; Yao et al., 1998; Scardi and Harding, 1999; Gevrey et al., 2003).Such tools can be useful for identifying covariates that must be measured with high precision, for improving the transparency of a “black box” prediction model (see also Rudin, 2019), or for determining what scenarios may cause the model to fail.

However, existing VI measures do not generally account for the fact that many prediction models may fit the data almost equally well. In such cases, the model used by one analyst may rely on entirely different covariate information than the model used by another analyst. This common scenario has been called the “Rashomon” effect of statistics (Breiman et al., 2001; see also Lecu´e, 2011; Statnikov et al., 2013; Tulabandhula and Rudin, 2014; Nevo and Ritov, 2017; Letham et al., 2016). The term is inspired by the 1950 Kurosawa film of the same name, in which four witnesses offer different descriptions and explanations for the same encounter. Under the Rashomon effect, how should analysts give comprehensive descriptions of the importance of each covariate? How well can one analyst recover the conclusions of another? Will the model that gives the best predictions necessarily give the most accurate interpretation?

To address these concerns, we analyze the set of prediction models that provide near- optimal accuracy, which we refer to as a Rashomon set. This approach stands in contrast to training to select a single prediction model, among a prespecified class of candidate models. Our motivation is that Rashomon sets (defined formally below) summarize the range of effective prediction strategies that an analyst might choose. Additionally, even if the candidate models do not contain the true data generating process, we may hope that some of these models function in similar ways to the data generating process. In particular, we may hope there exist well performing candidate models that place the same importance on a variable of interest as the underlying data generating process does. If so, then studying sets of well-performing models will allow us to deduce information about the data generating process.

变量重要性 (VI) 工具描述了预测模型的准确性在多大程度上取决于每个协变量中的信息。例如，在随机森林中，VI 是通过对协变量进行置换时预测精度的降低来衡量的（Breiman，2001； Breiman 等人，2001；另见 Strobl 等人，2008； Altmann 等人，2010； Zhu等人，2015； Gregorutti 等人，2015； Datta 等人，2016；Gregorutti 等人，2017）。类似的“Perturb”VI 度量已用于神经网络，其中噪声被添加到协变量中（Recknagel 等人， 1997；Yao 等人，1998； Scardi 和 Harding，1999； Gevrey 等人，2003）。工具可用于识别必须以高精度测量的协变量、提高“黑盒”预测模型的透明度（另见 Rudin，2019 年），或确定哪些场景可能导致模型失败。

然而，现有的 VI 度量通常不能解释许多预测模型可能几乎同样好地拟合数据的事实。在这种情况下，一位分析师使用的模型可能依赖于与另一位分析师使用的模型完全不同的协变量信息。这种常见情况被称为统计的“Rashomon”效应（Breiman 等人，2001 年；另见 Lecu'e，2011 年； Statnikov 等人，2013 年； Tulabandhula 和 Rudin，2014 年； Nevo 和 Ritov，2017 年； Letham 等人等人，2016）。这个词的灵感来自于 1950 年黑泽明的同名电影，其中四名目击者对同一次遭遇提供了不同的描述和解释。在Rashomon效应下，分析师应该如何全面描述每个协变量的重要性？一位分析师能在多大程度上恢复另一位分析师的结论？给出最佳预测的模型一定会给出最准确的解释吗？

为了解决这些问题，我们分析了提供接近最佳精度的预测模型集，我们将其称为Rashomon集。这种方法与在预先指定的候选模型类别中选择单个预测模型的训练形成对比。我们的动机是Rashomon集（正式定义如下）总结了分析师可能选择的有效预测策略的范围。此外，即使候选模型不包含真实的数据生成过程，我们也可能希望其中一些模型的功能与数据生成过程类似。特别是，我们可能希望存在性能良好的候选模型，它们对感兴趣的变量具有与基础数据生成过程相同的重要性。如果是这样，那么研究一组性能良好的模型将使我们能够推断出有关数据生成过程的信息。

Applying this approach to study variable importance, we define model class reliance (MCR) as the highest and lowest degree to which any well-performing model within a given class may rely on a variable of interest for prediction accuracy. Roughly speaking, MCR captures the range of explanations, or mechanisms, associated with well-performing models. Because the resulting range summarizes many prediction models simultaneously, rather a single model, we expect this range to be less affected by the choices that an individual analyst makes during the model-fitting process. Instead of reflecting these choices, MCR aims to reflect the nature of the prediction problem itself.

We make several, specific technical contributions in deriving MCR. First, we review a core measure of how much an individual prediction model relies on covariates of interest for its accuracy, which we call model reliance (MR). This measure is based on permutation importance measures for Random Forests (Breiman et al., 2001; Breiman, 2001), and can be expanded to describe conditional importance (see Section 8, as well as Strobl et al. 2008). We draw a connection between permutation-based importance estimates (MR) and U-statistics, which facilitates later theoretical results. Additionally, we derive connections between MR, conditional causal effects, and coefficients for additive models. Expanding on MR, we propose MCR, which generalizes the definition of MR for a class of models. We derive finite-sample bounds for MCR, which motivate an intuitive estimator of MCR. Finally, we propose computational procedures for this estimator.

The tools we develop to study Rashomon sets are quite general, and can be used to make finite-sample inferences for arbitrary characteristics of well-performing models. For example, beyond describing variable importance, these tools can describe the range of risk predictions that well-fitting models assign to a particular covariate profile, or the variance of predictions made by well-fitting models. In some cases, these novel techniques may provide finite-sample confidence intervals (CIs) where none have previously existed (see Section 5).

MCR and the Rashomon effect become especially relevant in the context of criminal recidivism prediction. Proprietary recidivism risk models trained from criminal records data are increasingly being used in U.S. courtrooms. One concern is that these models may be relying on information that would otherwise be considered unacceptable (for example, race, sex, or proxies for these variables), in order to estimate recidivism risk. The relevant models are often proprietary, and cannot be studied directly. Still, in cases where the predictions made by these models are publicly available, it may be possible to identify alternative prediction models that are sufficiently similar to the proprietary model of interest.

应用这种方法来研究变量重要性，我们将模型类依赖 (MCR) 定义为给定类中任何表现良好的模型可能依赖感兴趣的变量来预测准确性的最高和最低程度。粗略地说，MCR 捕获了与性能良好的模型相关的一系列解释或机制。由于结果范围同时汇总了许多预测模型，而不是单个模型，因此我们预计该范围受单个分析师在模型拟合过程中所做选择的影响较小。 MCR 不是反映这些选择，而是旨在反映预测问题本身的性质。

我们在推导 MCR 方面做出了一些具体的技术贡献。首先，我们回顾了一个核心度量，即单个预测模型在多大程度上依赖于感兴趣的协变量来实现其准确性，我们称之为模型依赖 (MR)。该度量基于随机森林的排列重要性度量 (Breiman et al., 2001; Breiman, 2001)，并且可以扩展为描述条件重要性（参见第 8 节，以及 Strobl et al. 2008）。我们在基于排列的重要性估计 (MR) 和 U 统计量之间建立了联系，这有助于后来的理论结果。此外，我们推导出 MR、条件因果效应和加性模型系数之间的联系。在 MR 的基础上，我们提出了 MCR，它为一类模型概括了 MR 的定义。我们推导出 MCR 的有限样本界限，这激发了 MCR 的直观估计。最后，我们提出了这个估计器的计算程序。

我们开发的用于研究Rashomon集的工具非常通用，可用于对性能良好的模型的任意特征进行有限样本推断。例如，除了描述变量重要性之外，这些工具还可以描述拟合模型分配给特定协变量概况的风险预测范围，或拟合模型所做预测的方差。在某些情况下，这些新技术可能会提供以前不存在的有限样本置信区间 (CI)（参见第 5 节）。

MCR 和Rashomon效应在刑事累犯预测的背景下变得尤为重要。从犯罪记录数据训练的专有累犯风险模型越来越多地用于美国法庭。一个担忧是，这些模型可能依赖于否则会被认为不可接受的信息（例如，种族、性别或这些变量的代理），以估计累犯风险。相关模型通常是专有的，不能直接研究。尽管如此，在这些模型做出的预测是公开可用的情况下，有可能识别出与感兴趣的专有模型足够相似的替代预测模型。

In this paper, we specifically consider the proprietary model COMPAS (Correctional Of- fender Management Profiling for Alternative Sanctions), developed by the company North- pointe Inc. (subsequently, in 2017, Northpointe Inc.,Courtview Justice Solutions Inc., and Constellation Justice Systems Inc. joined together under the name Equivant). Our goal is to estimate how much COMPAS relies on either race, sex, or proxies for these variables not measured in our data set. To this end, we apply a broad class of flexible, kernel-based prediction models to predict COMPAS score. In this setting, the MCR interval reflects the highest and lowest degree to which any prediction model in our class can rely on race and sex while still predicting COMPAS score relatively accurately. Equipped with MCR, we can relax the common assumption of being able to correctly specify the unknown model of interest (here, COMPAS) up to a parametric form. Instead, rather than assuming that the COMPAS model itself is contained in our class, we assume that our class contains at least one well-performing alternative model that relies on sensitive covariates to the same degree that COMPAS does. Under this assumption, the MCR interval will contain the VI value for COMPAS. Applying our approach, we find that race, sex, and their potential proxy variables, are likely not the dominant predictive factors in the COMPAS score (see analysis and discussion in Section 10).

The remainder of this paper is organized as follows. In Section 2 we introduce notation, and give a high level summary of our approach, illustrated with visualizations. In Sections 3 and 4 we formally present MR and MCR respectively, and derive theoretical properties of each. We also review related variable importance practices in the literature, such as retraining a model after removing one of the covariates. In Section 5, we discuss general applicability of our approach for determining finite-sample CIs for other problems. In Section 6, we present a general procedure for computing MCR. In Section 7, we give specific implementations of this procedure for (regularized) linear models, and linear models in a reproducing kernel Hilbert space. We also show that, for additive models, MR can be expressed in terms of the model’s coefficients. In Section 8 we outline connections between MR, causal inference, and conditional variable importance. In Section 9, we illustrate MR and MCR with a simulated toy example, to aid intuition. We also present simulation studies for the task of estimating MR for an unknown, underlying conditional expectation function, under misspecification. We analyze a well-known public data set on recidivism in Section 10, described above. All proofs are presented in the appendices.

在本文中，我们特别考虑了由 Northpointe Inc.（随后于 2017 年，Northpointe Inc.、Courtview Justice Solutions Inc. 和 Constellation Justice Systems Inc. 以 Equivant 的名义联合起来）。我们的目标是估计 COMPAS 对我们数据集中未测量的这些变量的种族、性别或代理的依赖程度。为此，我们应用了广泛的灵活的、基于内核的预测模型来预测 COMPAS 分数。在这种情况下，MCR 区间反映了我们班任何预测模型可以依赖种族和性别的最高和最低程度，同时仍能相对准确地预测 COMPAS 分数。配备 MCR，我们可以放宽通常的假设，即能够将感兴趣的未知模型（此处为 COMPAS）正确指定为参数形式。相反，我们不假设 COMPAS 模型本身包含在我们的类中，而是假设我们的类包含至少一个性能良好的替代模型，该模型依赖于与 COMPAS 相同程度的敏感协变量。在此假设下，MCR 区间将包含 COMPAS 的 VI 值。应用我们的方法，我们发现种族、性别及其潜在的代理变量可能不是 COMPAS 评分中的主要预测因素（参见第 10 节中的分析和讨论）。

本文的其余部分安排如下。在第 2 节中，我们介绍了符号，并对我们的方法进行了高级总结，并用可视化进行了说明。在第 3 和 4 节中，我们分别正式介绍了 MR 和 MCR，并推导出了各自的理论性质。我们还回顾了文献中相关的变量重要性实践，例如在删除一个协变量后重新训练模型。在第 5 节中，我们讨论了我们的方法在确定其他问题的有限样本 CI 的一般适用性。在第 6 节中，我们提出了计算 MCR 的一般过程。在第 7 节中，我们给出了这个过程的具体实现，用于（正则化）线性模型和再现核希尔伯特空间中的线性模型。我们还表明，对于加法模型，MR 可以用模型的系数表示。在第 8 节中，我们概述了 MR、因果推理和条件变量重要性之间的联系。在第 9 节中，我们用一个模拟的玩具示例来说明 MR 和 MCR，以帮助直觉。我们还提出了在错误指定下为未知的潜在条件期望函数估计 MR 的任务的模拟研究。我们在第 10 节分析了一个众所周知的关于累犯的公共数据集，如上所述。所有的证明都在附录中给出。

2. Notation & Technical Summary‌‌

The label of “variable importance” measure has been broadly used to describe approaches for either inference (van der Laan, 2006; D´ıaz et al., 2015; Williamson et al., 2017) or prediction. While these two goals are highly related, we primarily focus on how much prediction models rely on covariates to achieve accuracy. We use terms such as “model reliance” rather than “importance” to clarify this context.

In order to evaluate how much prediction models rely on variables, we now introduce notation for random variables, data, classes of prediction models, and loss functions for evaluating predictions. Let Z = (Y, X1, X2) ∈ Z be a random variable with outcome Y ∈ Y and covariates X = (X1, X2) ∈ X , where the covariate subsets X1 ∈ X1 and X2 ∈ X2 may each be multivariate. We assume that observations of Z are iid, that n ≥ 2, and that solutions to arg min and arg max operations exist whenever optimizing over sets mentioned in this paper (for example, in Theorem 4, below). Our goal is to study how much different prediction models rely on X1 to predict Y .

“变量重要性”度量的标签已被广泛用于描述推理（van der Laan，2006； D´ıaz 等人，2015； Williamson 等人，2017）或预测的方法。虽然这两个目标高度相关，但我们主要关注预测模型在多大程度上依赖协变量来实现准确性。我们使用诸如“模型依赖”而不是“重要性”之类的术语来阐明这一背景。

为了评估预测模型对变量的依赖程度，我们现在介绍随机变量、数据、预测模型类别和用于评估预测的损失函数的符号。令 Z = (Y, X1, X2) ∈ Z 是一个随机变量，结果 Y Y 和协变量 X = (X1, X2) ∈ X ，其中协变量子集 X1 ∈ X1 和 X2 ∈ X2 可能每个都是多变量的。我们假设 Z 的观察是独立同分布的，n ≥ 2，并且只要优化本文中提到的集合，就存在 arg min 和 arg max 操作的解决方案（例如，在下面的定理 4 中）。我们的目标是研究有多少不同的预测模型依赖 X1 来预测 Y。

We refer to our data set as Z = I y X l, a matrix composed of a n-length outcome vector y in the first column, and a n × p covariate matrix X = I X1 X2 l in the remaining columns. In general, for a given vector v, let v[j] denote its jth element(s). For a given

matrix A, let A/, A[i,·], A[·,j], and A[i,j] respectively denote the transpose of A, the i

row(s) of A, the jth column(s) of A, and the element(s) in the ith row(s) and jth column(s) of A.

We use the term model class to refer to a prespecified subset F ⊂ {f | f : X → Y} of the measurable functions from X to Y. We refer to member functions f ∈ F as prediction models, or simply as models. Given a model f , we evaluate its performance using a nonnegative loss function L : (F × Z) → R≥0. For example, L may be the squared error loss Lse(f, (y, x1, x2)) = (y − f (x1, x2))2 for regression, or the hinge loss Lh(f, (y, x1, x2)) = (1 − yf (x1, x2))+ for classification. We use the term algorithm to refer to any procedure A : Zn → F that takes a data set as input and returns a model f ∈ F as output.

我们将我们的数据集称为 Z = I y X l，这是一个由第一列中的 n 长度结果向量 y 和其余列中的 n × p 协变量矩阵 X = I X1 X2 l 组成的矩阵。一般来说，对于给定的向量 v，让 v[j] 表示它的第 j 个元素。对于给定的

矩阵 A，令 A/、A[i,·]、A[·,j] 和 A[i,j] 分别表示 A 的转置，即 i

A 的行、A 的第 j 列以及 A 的第 i 行和第 j 列中的元素。

我们使用术语模型类来指代预先指定的子集 F ⊂ {f | f : X → Y} 从 X 到 Y 的可测量函数。我们将成员函数 f ∈ F 称为预测模型，或简称为模型。给定模型 f，我们使用非负损失函数 L 评估其性能：(F × Z) → R≥0。例如，L 可能是回归的平方误差损失 Lse(f, (y, x1, x2)) = (y - f (x1, x2))2，或铰链损失 Lh(f, (y, x1, x2)) = (1 − yf (x1, x2))+ 用于分类。我们使用术语算法来指代任何将数据集作为输入并返回模型 f ∈ F 作为输出的过程 A:Zn→F。

2.1 Summary of Rashomon Sets & Model Class Reliance‌‌

Many traditional statistical estimates come from descriptions of a single, fitted predic- tion model. In contrast, in this section, we summarize our approach for studying a set of near-optimal models. To define this set, we require a prespecified “reference” model, denoted by fref, to serve as a benchmark for predictive performance. For example, fref may come from a flowchart used to predict injury severity in a hospital’s emergency room, or from another quantitative decision rule that is currently implemented in practice. Given a reference model fref, we define a population E-Rashomon set as the subset of models with expected loss no more than E above that of fref. We denote this set as R(E) := {f ∈ F : EL(f, Z) ≤ EL(fref, Z) + E}, where E denotes expectations with respect to the population distribution. This set can be thought of as representing models that might be arrived at due to differences in data measurement, processing, filtering, model parameteri- zation, covariate selection, or other analysis choices (see Section 4).

许多传统的统计估计来自对单个拟合预测模型的描述。相比之下，在本节中，我们总结了我们研究一组接近最优模型的方法。为了定义这个集合，我们需要一个预先指定的“参考”模型，用 fref 表示，作为预测性能的基准。例如，fref 可能来自用于预测医院急诊室损伤严重程度的流程图，或者来自当前在实践中实施的另一个定量决策规则。给定参考模型 fref，我们将总体 E-Rashomon 集定义为预期损失不超过 fref 的 E 的模型子集。我们将此集合表示为 R(E) := {f ∈ F : EL(f, Z) ≤ EL(fref, Z) + E}，其中 E 表示对人口分布的期望。该集合可以被认为是表示可能由于数据测量、处理、过滤、模型参数化、协变量选择或其他分析选择的差异而得到的模型（参见第 4 节）。

Figure 1: Rashomon sets and model class reliance – Panel (A) illustrates a hypothetical Rashomon set R(E), within a model class F . The y-axis shows the expected loss of each model f ∈ F , and the x-axis shows how much each model f relies on X1 (defined formally in Section 3). Along the x-axis, the population-level MCR range is highlighted in blue, showing the values of MR corresponding to well-performing models (see Section 4). Panel (B) shows the in-sample analogue of Panel (A).

Here, the y-axis denotes the in-sample loss, EˆL(f, Z) := 1 Tn L(f, Z[i, ]); the n i=1 ·x-axis shows the empirical model reliance of each model f ∈ F on X1 (see Section 3); and the highlighted portion of the x-axis shows empirical MCR (see Section 4).

图 1：Rashomon集和模型类依赖 - 面板 (A) 说明了模型类 F 中的假设Rashomon集 R(E)。 y 轴显示每个模型 f ∈ F 的预期损失，x 轴显示每个模型 f 对 X1 的依赖程度（在第 3 节中正式定义）。沿着 x 轴，总体水平的 MCR 范围以蓝色突出显示，显示了与表现良好的模型相对应的 MR 值（参见第 4 节）。面板 (B) 显示面板 (A) 的样本内类似物。

这里，y 轴表示样本内损失，E^L(f, Z) := 1 Tn L(f, Z[i, ]); n i=1 ·x 轴显示每个模型 f ∈ F 对 X1 的经验模型依赖（见第 3 节）； x 轴突出显示的部分显示经验 MCR（参见第 4 节）。

href="#bookmark88" Figure 1-A illustrates a hypothetical example of a population E-Rashomon set. Here, the y-axis shows the expected loss of each model f ∈ F , and the x-axis shows how much each model relies on X1 for its predictive accuracy. More specifically, given a prediction model f , the x-axis shows the percent increase in f ’s expected loss when noise is added to X1. We refer to this measure as the model reliance (MR) of f on X1, written informally as

The added noise must satisfy certain properties, namely, it must render X1 completely uninformative of the outcome Y , without altering the marginal distribution of X1 (for details, see Section 3, as well as Breiman, 2001; Breiman et al., 2001).

图 1-A 说明了人口 E-Rashomon集的假设示例。在这里，y 轴显示每个模型 f ∈ F 的预期损失，x 轴显示每个模型对 X1 的预测准确性的依赖程度。更具体地说，给定一个预测模型 f，x 轴显示当噪声添加到 X1 时 f 的预期损失的百分比增加。我们将此度量称为 f 对 X1 的模型依赖 (MR)，非正式地写为

添加的噪声必须满足某些特性，即它必须使 X1 完全不提供结果 Y 的信息，而不改变 X1 的边际分布（详细信息参见第 3 节，以及 Breiman，2001； Breiman 等人，2001） .

Our central goal is to understand how much, or how little, models may rely on covariates of interest (X1) while still predicting well. In Figure 1-A, this range of possible MR values is shown by the highlighted interval along the x-axis. We refer to an interval of this type as a population-level model class reliance (MCR) range (see Section 4), formally defined as

To estimate this range, we use empirical analogues of the population E-Rashomon set, and of MR, based on observed data (Figure 1-B). We define an empirical E-Rashomon set as the set of models with in-sample loss no more than E above that of fref, and denote this set by ˆ (E). Informally, we define the empirical MR of a model f on X1 as

that is, the extent to which f appears to rely on X1 in a given sample (see Section 3 for details). Finally, we define the empirical model class reliance as the range of empirical MR values corresponding to models with strong in-sample performance (see Section 4), formally written as

我们的中心目标是了解模型有多少或多少可能依赖于感兴趣的协变量 (X1)，同时仍能很好地预测。在图 1-A 中，这个可能的 MR 值范围由沿 x 轴突出显示的间隔显示。我们将这种类型的区间称为人口级模型类依赖 (MCR) 范围（参见第 4 节），正式定义为

为了估计这个范围，我们根据观察到的数据使用人口 E-Rashomon集和 MR 的经验类似物（图 1-B）。我们将经验 E-Rashomon 集定义为样本内损失不超过 fref 的 E 的模型集，并用 ^ (E) 表示该集。非正式地，我们将模型 f 在 X1 上的经验 MR 定义为

也就是说，在给定样本中 f 似乎依赖 X1 的程度（详见第 3 节）。最后，我们将经验模型类依赖定义为与具有强样本内性能的模型相对应的经验 MR 值范围（见第 4 节），正式写为

We make several technical contributions in the process of developing MCR.

Estimation of MR, and population-level MCR: Given f , we show desirable properties of M�R(f ) as an estimator of M R(f ), using results for U-statistics (Section 3.1 and Theorem 5). We also derive finite sample bounds for population-level MCR, some of which require a limit on the complexity of F in the form of a covering num- ber. These bounds demonstrate that, under fairly weak conditions, empirical MCR provides a sensible estimate of population-level MCR (see Section 4 for details).

Computation of empirical MCR: Although empirical MCR is fully determined given a sample, the minimization and maximization in Eq 2.4 require nontrivial com- putations. To address this, we outline a general optimization procedure for MCR (Section 6). We give detailed implementations of this procedure for cases when the model class F is a set of (regularized) linear regression models, or a set of regression models in a reproducing kernel Hilbert space (Section 7). The output of our pro- posed procedure is a closed-form, convex envelope containing F , which can be used to approximate empirical MCR for any performance level E (see Figure 2 for an illustra- tion). Still, for complex model classes where standard empirical loss minimization is an open problem (for example, neural networks), computing empirical MCR remains an open problem as well.

Interpretation of MR in terms of model coefficients, and causal effects: We show that MR for an additive model can be written as a function of the model’s coefficients (Proposition 15), and that MR for a binary covariate X1 can be written as a function of the conditional causal effects of X1 on Y (Proposition 19).

Extensions to conditional importance: We provide an extension of MR that is analogous to the notion of conditional importance (Strobl et al., 2008). This extension describes how much a model relies on the specific information in X1 that cannot otherwise be gleaned from X2 (Section 8.2).

Generalizations for Rashomon sets: Beyond notions of variable importance, we also generalize our finite sample results for MCR to describe arbitrary characteriza- tions of models in a population E-Rashomon set. As we discuss in concurrent work (Coker et al., 2018), this generalization is analogous to the profile likelihood inter- val, and can, for example, be used to bound the range of risk predictions that well- performing prediction models may assign to a particular set of covariates (Section 5).

我们在开发 MCR 的过程中做出了多项技术贡献。

MR 的估计和总体水平的 MCR：给定 f，我们使用 U 统计量的结果（第 3.1 节和定理 5）显示了 M�R(f) 作为 M R(f) 的估计量的理想属性。我们还推导出总体级 MCR 的有限样本界限，其中一些要求以覆盖数的形式限制 F 的复杂性。这些界限表明，在相当弱的条件下，经验 MCR 提供了对总体水平 MCR 的合理估计（详见第 4 节）。

经验 MCR 的计算：虽然经验 MCR 是在给定样本的情况下完全确定的，但 Eq 2.4 中的最小化和最大化需要非平凡的计算。为了解决这个问题，我们概述了 MCR 的一般优化过程（第 6 节）。当模型类 F 是一组（正则化的）线性回归模型，或再现核希尔伯特空间中的一组回归模型（第 7 节）时，我们给出了此过程的详细实现。我们提出的程序的输出是一个包含 F 的封闭形式的凸包络，它可用于逼近任何性能级别 E 的经验 MCR（参见图 2 中的说明）。尽管如此，对于标准经验损失最小化是一个开放问题（例如，神经网络）的复杂模型类，计算经验 MCR 仍然是一个开放问题。

根据模型系数和因果效应解释 MR：我们表明，加法模型的 MR 可以写成模型系数的函数（命题 15），二元协变量 X1 的 MR 可以写成函数X1 对 Y 的条件因果影响（命题 19）。

条件重要性的扩展：我们提供了类似于条件重要性概念的 MR 扩展（Strobl et al., 2008）。此扩展描述了模型在多大程度上依赖于 X1 中无法从 X2 中收集到的特定信息（第 8.2 节）。

Rashomon 集的概括：除了变量重要性的概念之外，我们还概括了 MCR 的有限样本结果，以描述总体 E-Rashomon 集中模型的任意特征。正如我们在并行工作中所讨论的（Coker 等人，2018 年），这种概括类似于轮廓似然区间，例如，可以用于限制表现良好的预测模型可能分配的风险预测范围到一组特定的协变量（第 5 节）。

We begin in the next section by formally reviewing model reliance.

Figure 2: Illustration of output from our empirical MCR computational procedure – Our computation procedure produces a closed-form, convex envelope that contains F (shown above as the solid, purple line), which bounds empirical MCR for any value of E (see Eq 2.4). The procedure works sequentially, tightening these bounds as much as possible near the E value of interest (Section 6). The results from our data analysis (Figure 8) are presented in the same format as the above purple envelope.

我们在下一节开始正式回顾模型依赖。

图 2：我们的经验 MCR 计算过程的输出图示——我们的计算过程产生一个封闭形式的凸包络，其中包含 F（如上图所示的紫色实线），它限制了任何 E 值的经验 MCR（参见方程式2.4)。该过程按顺序进行，尽可能将这些界限收紧在感兴趣的 E 值附近（第 6 节）。我们的数据分析结果（图 8）以与上述紫色信封相同的格式呈现。

3. Model Reliance‌‌

To formally describe how much the expected accuracy of a fixed prediction model f relies on the random variable X1, we use the notion of a “switched” loss where X1 is rendered uninformative. Throughout this section, we will treat f as a pre-specified prediction model of interest (as in Hooker, 2007). Let Z(a) = (Y (a), X(a), X(a)) and Z(b) = (Y (b), X(b), X(b)) 2be independent random variables, each following the same distribution as Z = (Y, X1, X2).We define

‌‌为了正式描述固定预测模型 f 的预期准确性在多大程度上依赖于随机变量 X1，我们使用“切换”损失的概念，其中 X1 被渲染为无信息。在本节中，我们将 f 视为预先指定的感兴趣的预测模型（如 Hooker，2007 年）。令 Z(a) = (Y (a), X(a), X(a)) 和 Z(b) = (Y (b), X(b), X(b)) 2 为独立随机变量，每个遵循与 Z = (Y, X1, X2) 相同的分布。我们定义

as representing the expected loss of model f across pairs of observations (Z(a), Z(b)) in which the values of X(a) and X(b) have been switched. To see this interpretation of the above equation, note that we have used the variables (Y (b), X(b)) from Z(b), but we have used the variable X(b) from an independent copy Z(b). This is why we say that X(a) and X(b)have been switched; the values of (Y (b), X1 , X2 ) do not relate to each other as they would if they had been chosen together. An alternative interpretation of eswitch(f ) is as the expected loss of f when noise is added to X1 in such a way that X1 becomes completely uninformative of Y , but that the marginal distribution of X1 is unchanged.

As a reference point, we compare eswitch(f ) against the standard expected loss when none of the variables are switched, eorig(f ) := EL(f, (Y, X1, X2)). From these two quantities, we formally define model reliance (MR) as the ratio,

as we alluded to in Eq 2.1. Higher values of M R(f ) signify greater reliance of f on X1. For example, an M R(f ) value of 2 means that the model relies heavily on X1, in the sense that its loss doubles when X1 is scrambled. An M R(f ) value of 1 signifies no reliance on X1, in the sense that the model’s loss does not change when X1 is scrambled. Models with reliance values strictly less than 1 are more difficult to interpret, as they rely less on the variable of interest than a random guess. Interestingly, it is possible to have models with reliance less than one. For instance, a model f / may satisfy M R(f /) < 1 if it treats X1 and Y as positively correlated when they are in fact negatively correlated. However, in many cases, the existence of a model f / ∈ F satisfying M R(f /) < 1 implies the existence of another, better performing model f // ∈ F satisfying M R(f //) = 1 and eorig(f //) ≤ eorig(f /). That is, although models may exist with MR values less than 1, they will typically be suboptimal

(see Appendix A.2).

表示模型 f 在 X(a) 和 X(b) 的值已切换的观察对 (Z(a), Z(b)) 中的预期损失。要查看对上述等式的这种解释，请注意我们使用了来自 Z(b) 的变量 (Y (b)、X(b))，但我们使用了来自独立副本 Z(b) 的变量 X(b) ）。这就是为什么我们说 X(a) 和 X(b) 已经互换了； (Y (b), X1 , X2 ) 的值不会像一起选择时那样相互关联。 eswitch(f) 的另一种解释是当噪声以这样的方式添加到 X1 时 f 的预期损失，使得 X1 变得完全不提供 Y 的信息，但 X1 的边际分布没有变化。

作为参考点，我们将 eswitch(f) 与没有切换任何变量时的标准预期损失 eorig(f) := EL(f, (Y, X1, X2)) 进行比较。从这两个量中，我们正式将模型依赖 (MR) 定义为比率，

正如我们在公式 2.1 中提到的那样。 M R(f) 的值越高，表示 f 对 X1 的依赖性越大。例如，M R(f) 值为 2 意味着模型严重依赖 X1，即当 X1 被打乱时其损失会翻倍。 M R(f) 值为 1 表示不依赖 X1，即模型的损失在 X1 被加扰时不会改变。依赖值严格小于 1 的模型更难解释，因为它们对感兴趣变量的依赖比随机猜测要少。有趣的是，模型的依赖可能少于一个。例如，一个模型 f / 可能满足 M R(f /) < 1，如果它将 X1 和 Y 视为正相关，而实际上它们是负相关的。然而，在许多情况下，模型 f / ∈ F 满足 M R(f /) < 1 的存在意味着存在另一个性能更好的模型 f // ∈ F 满足 M R(f //) = 1 和 eorig(f //) ≤ eorig(f /)。也就是说，虽然模型可能存在 MR 值小于 1，但它们通常不是最优的

（见附录 A.2）。

Model reliance could alternatively be defined as a difference rather than a ratio, that is, as M Rdifference(f ) := eswitch(f ) − eorig(f ). In Appendix A.5, we discuss how many of our results remain similar under either definition.

模型依赖也可以定义为差异而不是比率，即 M Rdifference(f) := eswitch(f) - eorig(f)。在附录 A.5 中，我们讨论了在任一定义下我们的结果中有多少保持相似。

3.1 Estimating Model Reliance with U-statistics, and Connections to Permutation-based Variable Importance‌‌

Given a model f and data set Z = I y X l, we estimate M R(f ) by separately estimating

href="#bookmark95" the numerator and denominator of Eq 3.1. We estimate eorig(f ) with the standard empirical loss,

We estimate eswitch(f ) by performing a “switch” operation across all observed pairs, as in

Above, we have aggregated over all possible combinations of the observed values for (Y, X2) and for X1, excluding pairings that are actually observed in the original sample. If the summation over all possible pairs (Eq 3.3) is computationally prohibitive due to sample size, another estimator of eswitch(f ) is

Here, rather than summing over all pairs, we divide the sample in half. We then match the first half’s values for (Y, X2) with the second half’s values for X1 (Line 3.4), and vice versa (Line 3.5). All three of the above estimators (Eqs 3.2, 3.3 & 3.5) are unbiased for their respective estimands, as we discuss in more detail shortly.

Finally, we can estimate M R(f ) with the plug-in estimator

给定一个模型 f 和数据集 Z = I y X l，我们通过分别估计来估计 M R(f)

Eq 3.1 的分子和分母。我们用标准经验损失估计 eorig(f)，

我们通过对所有观察到的对执行“切换”操作来估计 eswitch(f)，如

上面，我们汇总了 (Y, X2) 和 X1 的观察值的所有可能组合，不包括在原始样本中实际观察到的配对。如果对所有可能对的求和（等式 3.3）由于样本量的原因在计算上是令人望而却步的，则 eswitch(f) 的另一个估计量是

在这里，我们不是对所有对求和，而是将样本分成两半。然后，我们将 (Y, X2) 的前半部分的值与 X1 的后半部分的值进行匹配（第 3.4 行），反之亦然（第 3.5 行）。正如我们稍后将更详细讨论的那样，上述所有三个估计量（等式 3.2、 3.3 和 3.5）对于它们各自的估计都是无偏的。

最后，我们可以使用插件估计器估计 M R(f)

which we define as the empirical model reliance of f on X1. In this way, we formalize the empirical MR definition in Eq 2.3.

Again, our definition of empirical MR is very similar to the permutation-based vari- able importance approach of Breiman (2001), where Breiman uses a single random per- mutation and we consider all possible pairs. To compare these two approaches more pre- cisely, let {π1, . . . , πn!} be a set of n-length vectors, each containing a different permu- tation of the set {1, . . . , n}. The approach of Breiman (2001) is analogous to computing the loss for a randomly chosen permutation vector {π1, . . . , πn!}. Similarly, our calculation in Eq 3.3 is proportional to the sum of losses over all possible (n!) permutations, excluding the n unique combinations of the rows of X1 and the rows of I X2 y l that appear in the original sample (see Appendix A.3). Excluding these observations is necessary to preserve the (finite-sample) unbiasedness of

eˆswitch(f ).

The estimators eˆorig(f ), eˆswitch(f ) and eˆdivide(f ) all belong to the well-studied class of

U-statistics. Thus, under fairly minor conditions, these estimators are unbiased, asymptot-

ically normal, and have finite-sample probabilistic bounds (Hoeffding, 1948, 1963; Serfling, 1980; see also DeLong et al., 1988 for an early use of U-statistics in machine learning, as well as caveats in Demler et al., 2012). To our knowledge, connections between permutation- based importance and U-statistics have not been previously established.

我们将其定义为 f 对 X1 的经验模型依赖。通过这种方式，我们将公式 2.3 中的经验 MR 定义形式化。

同样，我们对经验 MR 的定义与 Breiman (2001) 的基于置换的变量重要性方法非常相似，其中 Breiman 使用单个随机置换并且我们考虑所有可能的对。为了更准确地比较这两种方法，让 {π1, . . . , πn!} 是一组 n 长度向量，每个向量包含集合 {1, . . . , n}。 Breiman (2001) 的方法类似于计算随机选择的置换向量 {π1, . . . , πn!}。类似地，我们在公式 3.3 中的计算与所有可能 (n!) 排列的损失总和成正比，不包括原始样本中出现的 X1 行和 I X2 y l 行的 n 个唯一组合（见附录A.3)。排除这些观察对于保持 e^switch(f) 的（有限样本）无偏性是必要的。

估计量 e^orig(f)、e^switch(f) 和 e^divide(f) 都属于经过充分研究的 U 统计类。因此，在相当小的条件下，这些估计量是无偏的、渐近正态的，并且具有有限样本的概率界限（Hoeffding，1948， 1963； Serfling，1980；另见 DeLong 等人，1988 关于 U-statistics 的早期使用机器学习，以及 Demler 等人，2012 年的警告）。据我们所知，基于排列的重要性和 U 统计量之间的联系以前尚未建立。

While the above results from U-statistics depend on the model f being fixed a priori, we can also leverage these results to create uniform bounds on the MR estimation error for all models in a sufficiently regularized class F . We formally present this bound in Section 4 (Theorem 5), after introducing required conditions on model class complexity.

The existence of this uniform bound implies that it is feasible to train a model and to evaluate its importance using the same data. This differs from the classical VI approach of Random Forests (Breiman, 2001),which avoids in-sample importance estimation. There, each tree in the ensemble is fit on a random subset of data, and VI for the tree is estimated using the held-out data. The tree-specific VI estimates are then aggregated to obtain a VI estimate for the overall ensemble. Although sample-splitting approaches such as this are helpful in many cases, the uniform bound for MR suggests that they are not strictly necessary, depending on the sample size and the complexity of F .

虽然 U 统计的上述结果取决于先验固定的模型 f，但我们也可以利用这些结果为充分正则化的类 F 中的所有模型创建 MR 估计误差的统一界限。在介绍了模型类复杂性的必要条件之后，我们在第 4 （定理 5）中正式提出了这个界限。

这种统一界限的存在意味着训练模型并使用相同的数据评估其重要性是可行的。这不同于随机森林的经典 VI 方法 (Breiman, 2001)，后者避免了样本内重要性估计。在那里，集成中的每棵树都适合随机数据子集，并且树的 VI 是使用保留的数据估计的。然后聚合特定树的 VI 估计以获得整个集成的 VI 估计。尽管诸如此类的样本拆分方法在许多情况下都有用，但 MR 的统一界限表明它们并非绝对必要，具体取决于样本大小和 F 的复杂性。

3.2 Limitations of Existing Variable Importance Methods‌‌

Several common approaches for variable selection, or for describing relationships between variables, do not necessarily capture a variable’s importance. Null hypothesis testing meth- ods may identify a relationship, but do not describe the relationship’s strength. Similarly, checking whether a variable is included by a sparse model-fitting algorithm, such as the Lasso (Hastie et al., 2009), does not describe the extent to which the variable is relied on. Partial dependence plots (Breiman et al., 2001; Hastie et al., 2009) can be difficult to in- terpret if multiple variables are of interest, or if the prediction model contains interaction effects.

Another common VI procedure is to run a model-fitting algorithm twice, first on all of the data, and then again after removing X1 from the data set. The losses for the two resulting models are then compared to determine the importance, or “necessity,” of X1 (Gevrey et al., 2003). Because this measure is a function of two prediction models rather than one, it does not measure how much either individual model relies on X1. We refer to this approach as measuring empirical Algorithm Reliance (AR) on X1, as the model- fitting algorithm is the common attribute between the two models. Related procedures were proposed by Breiman et al. (2001);Breiman (2001), which measure the sufficiency of X1.

‌‌用于变量选择或描述变量之间关系的几种常见方法不一定能捕捉到变量的重要性。零假设检验方法可以识别关系，但不能描述关系的强度。同样，检查变量是否包含在稀疏模型拟合算法中，例如 Lasso (Hastie et al., 2009)，并不能描述变量的依赖程度。如果对多个变量感兴趣，或者如果预测模型包含交互效应，则部分依赖图 (Breiman et al., 2001; Hastie et al., 2009) 可能难以解释。

另一个常见的 VI 程序是两次运行模型拟合算法，首先是对所有数据，然后在从数据集中删除 X1 后再次运行。然后比较两个结果模型的损失，以确定 X1 的重要性或“必要性”（Gevrey 等人，2003 年）。由于此度量是两个预测模型而不是一个预测模型的函数，因此它不会衡量任何一个模型对 X1 的依赖程度。我们将此方法称为测量 X1 上的经验算法依赖 (AR)，因为模型拟合算法是两个模型之间的共同属性。 Breiman 等人提出了相关程序。 (2001);Breiman (2001)，测量 X1 的充分性。

As we discuss in Section 3.1, the permutation-based VI measure from RFs (Breiman, 2001; Breiman et al., 2001) forms the inspiration for our definition of MR. This RF VI measure has been the topic of empirical studies (Archer and Kimes, 2008; Calle and Urrea, 2010; Wang et al., 2016), and several variations of the measure have been proposed (Strobl et al., 2007, 2008; Altmann et al., 2010; Hapfelmeier et al., 2014). Mentch and Hooker (2016) use U-statistics to study predictions of ensemble models fit to subsamples, similar to the bootstrap aggregation used in RFs. Procedures related to “Mean Difference Impurity,” another VI measure derived for RFs, have been studied theoretically by Louppe et al. (2013); Kazemitabar et al. (2017). All of this literature focuses on VI measures for RFs, for ensembles, or for individual trees. Our estimator for model reliance differs from the traditional RF VI measure (Breiman, 2001) in that we permute inputs to the overall model, rather than permuting the inputs to each individual ensemble member. Thus, our approach can be used generally, and is not limited to trees or ensemble models.

Outside of the context of RF VI, Zhu et al. (2015) propose an estimand similar to our definition of model reliance, and Gregorutti et al. (2015, 2017) propose an estimand analogous to eswitch(f ) − eorig(f ). These recent works focus on the model reliance of f on X1 specifically when f is equal to the conditional expectation function of Y (that is, f (x1, x2) = E[Y |X1 = x1, X2 = x2]). In contrast, we consider model reliance for arbitrary prediction models f . Datta et al. (2016) study the extent to which a model’s predictions are expected to change when a subset of variables is permuted, regardless of whether the permutation affects a loss function L. These VI approaches are specific to a single prediction model, as is MR. In the next section, we consider a more general conception of importance: how much any model in a particular set may rely on the variable of interest.

正如我们在第 3.1 节中讨论的那样，来自 RF 的基于排列的 VI 度量（Breiman，2001； Breiman 等人，2001）为我们定义 MR 提供了灵感。这种 RF VI 测量一直是实证研究的主题（Archer 和 Kimes，2008； Calle 和 Urrea，2010； Wang 等，2016），并且已经提出了该测量的几种变体（Strobl 等，2007， 2008 ; Altmann 等人，2010 年； Hapfelmeier 等人，2014 年）。 Mentch 和 Hooker (2016) 使用 U-statistics 来研究适合子样本的集成模型的预测，类似于 RF 中使用的 bootstrap 聚合。 Louppe 等人在理论上研究了与“平均差杂质”相关的程序，这是另一种用于 RF 的 VI 度量。（2013）； Kazemitabar 等人。（2017）。所有这些文献都侧重于 RF、集成或单个树的 VI 测量。我们的模型依赖估计器不同于传统的 RF VI 测量（Breiman，2001），因为我们将输入置换到整个模型，而不是将输入置换到每个单独的集成成员。因此，我们的方法可以普遍使用，并且不限于树或集成模型。

在 RF VI 的背景下，Zhu 等人。 (2015) 提出了一个类似于我们对模型依赖的定义的估计，以及 Gregorutti 等人。 (2015, 2017) 提出了一个类似于 eswitch(f) - eorig(f) 的估计量。这些最近的工作侧重于 f 对 X1 的模型依赖，特别是当 f 等于 Y 的条件期望函数时（即 f（x1，x2）=E[Y|X1=x1，X2=x2]）。相反，我们考虑任意预测模型 f 的模型依赖。达塔等人。（2016 年）研究当变量子集被置换时，模型的预测预计会发生多大的变化，无论置换是否影响损失函数 L。这些 VI 方法特定于单个预测模型，MR 也是如此。在下一节中，我们将考虑更一般的重要性概念：特定集合中的任何模型可能在多大程度上依赖于感兴趣的变量。

4. Model Class Reliance‌

Like many statistical procedures, our MR measure (Section 3) produces a description of a single predictive model. Given a model with high predictive accuracy, MR describes how much the model’s performance hinges on covariates of interest (X1). However, there will often be many other models that perform similarly well, and that rely on X1 to different degrees. With this notion in mind, we now study how much any well-performing model from a prespecified class F may rely on covariates of interest.

像许多统计程序一样，我们的 MR 测量（第 3 节）产生了对单个预测模型的描述。给定一个具有高预测精度的模型，MR 描述了模型的性能在多大程度上取决于感兴趣的协变量 (X1)。但是，通常会有许多其他模型表现相似，并且在不同程度上依赖 X1。考虑到这个概念，我们现在研究来自预先指定的 F 类的任何表现良好的模型在多大程度上可能依赖于感兴趣的协变量。

4.1 Motivating Empirical Estimators of MCR by Deriving Finite-sample Bounds‌

In this section we derive finite-sample, probabilistic bounds for M CR+(E) and M CR−(E). Our results imply that, under minimal assumptions, M二CR+(E) and M二CR−(E) are respectively within a neighborhood of M CR+(E) and M CR−(E) with high probability. However, the weakness of our assumptions (which are typical for statistical-learning-theoretic anal- ysis) renders the width of our resulting CIs to be impractically large, and so we use these results only to show conditions under which M二CR+(E) and M二CR−(E) form sensible point estimates. In Sections 9.1 & 10, below, we apply a bootstrap procedure to account for sampling variability.

To derive these results we introduce three bounded loss assumptions, each of which can be assessed empirically. Let borig, Bind, Bref, Bswitch ∈ R be known constants.

在本节中，我们推导出 M CR+(E) 和 M CR-(E) 的有限样本概率界。我们的结果表明，在最小假设下，M二CR+(E) 和M二CR-(E) 分别以高概率分别位于M CR+(E) 和M CR-(E) 的邻域内。然而，我们假设的弱点（这是典型的统计学习理论分析）使我们得到的 CI 的宽度变得不切实际，因此我们仅使用这些结果来显示 M 二 CR+（E ) 和 M二CR-(E) 形成合理的点估计。在下面的第 9.1 和第 10 节中，我们应用引导程序来解释抽样可变性。

为了得出这些结果，我们引入了三个有限损失假设，每一个都可以通过经验进行评估。令 borig、Bind、Bref、Bswitch ∈ R 为已知常数。

5. Extensions of Rashomon Sets Beyond Variable Importance‌

In this section we generalize the Rashomon set approach beyond the study of MR. In Section 5.1, we create finite-sample CIs for other summary characterizations of near-optimal, or best-in-class models. The generalization also helps to illustrate a core aspect of the argument underlying Theorem 4: models with near-optimal performance in the population tend to have relatively good performance in random samples.

In Section 5.2, we review existing literature on near-optimal models.

在本节中，我们将Rashomon集方法推广到 MR 研究之外。在第 5.1 节中，我们为接近最优或同类最佳模型的其他摘要特征创建有限样本 CI。概括还有助于说明定理 4 的核心论点：在总体中具有接近最优性能的模型往往在随机样本中具有相对较好的性能。

在第 5.2 节中，我们回顾了关于接近最优模型的现有文献。

5.1 Finite-sample Confidence Intervals from Rashomon Sets

‌

Rather than describing how much a model relies on X1, here we assume the analyst is interested in an arbitrary characteristic of a model. We denote this characteristic of interest

as φ : F → R. For example, if fβ is the linear model fβ (x) = x/β, then φ may be defined

as the norm of the associated coefficient vector (that is, φ(fβ ) = 1β12) or the prediction fβ

would assign given a specific covariate profile xnew (that is, φ(fβ ) = fβ (xnew)).

在这里，我们假设分析师对模型的任意特征感兴趣，而不是描述模型对 X1 的依赖程度。我们表示这个感兴趣的特征

为 φ：F → R。例如，如果 fβ 是线性模型 fβ (x) = x/β，则可以定义 φ

作为相关系数向量的范数（即φ(fβ) = 1β12）或预测fβ

将分配给定的特定协变量配置文件 xnew（即 φ(fβ) = fβ(xnew)）。

Still, it is worth emphasizing the generality of Proposition 7. Through this result, Rashomon sets allow us to reframe a wide set of finite-sample inference problems as in- sample optimization problems. The implied CIs are not necessarily in closed form, but the approach still opens an exciting pathway for deriving non-asymptotic results. For example, they imply that existing methods for profile likelihood intervals might be able to be reapplied to achieve finite-sample results. For highly complex model classes where profile likelihoods are difficult to compute, such as neural networks or random forests, approximate inference is sometimes achieved via approximate optimization procedures (for example, Markov chain Monte Carlo for Bayesian additive regression trees, in Chipman et al., 2010). Proposition 7 shows that similar approximate optimization methods could be repurposed to establish approximate, finite-sample inferences for the same model classes.

尽管如此，还是值得强调命题 7 的普遍性。通过这个结果，Rashomon集允许我们将广泛的有限样本推理问题重新构建为样本内优化问题。隐含的 CI 不一定是封闭形式，但该方法仍然为得出非渐近结果开辟了一条令人兴奋的途径。例如，它们暗示现有的轮廓似然区间方法可能能够重新应用以实现有限样本结果。对于轮廓似然性难以计算的高度复杂的模型类，例如神经网络或随机森林，有时通过近似优化程序来实现近似推断（例如，Chipman 等人的用于贝叶斯加性回归树的马尔可夫链蒙特卡罗， 2010）。命题 7 表明，类似的近似优化方法可以重新用于为相同的模型类建立近似的、有限样本的推断。

5.2 Related Literature on the Rashomon Effect‌

Breiman et al. (2001) introduced the “Rashomon effect” of statistics as a problem of ambi- guity: if many models fit the data well, it is unclear which model we should try to interpret. Breiman suggests that the ensembling many well-performing models together can resolve this ambiguity, as the new ensemble model may perform better than any of its individual members. However, this approach may only push the problem from the member level to the ensemble level, as there may also be many different ensemble models that fit the data well.

The Rashomon effect has also been considered in several subject areas outside of VI, in- cluding those in non-statistical academic disciplines (Heider, 1988; Roth and Mehta, 2002). Tulabandhula and Rudin (2014) optimize a decision rule to perform well under the pre- dicted range of outcomes from any well-performing model. Statnikov et al. (2013) propose an algorithm to discover multiple Markov boundaries, that is, minimal sets of covariates such that conditioning on any one set induces independence between the outcome and the remaining covariates. Nevo and Ritov (2017) report interpretations corresponding to a set of well-fitting, sparse linear models. Meinshausen and Bu¨hlmann (2010) estimate structural aspects of an underlying model (such as the variables included in that model) based on how stable those aspects are across a set of well-fitting models. This set of well-fitting models is identified by repeating an estimation procedure in a series of perturbed samples, using varying levels of regularization (see also Azen et al., 2001). Letham et al. (2016) search for a pair of well-fitting dynamical systems models that give maximally different predictions.

Breiman 等人。 (2001) 将统计学的“Rashomon效应”作为一个模棱两可的问题引入：如果许多模型都很好地拟合数据，那么我们应该尝试解释哪个模型是不清楚的。 Breiman 建议将许多性能良好的模型组合在一起可以解决这种歧义，因为新的集成模型可能比其任何单个成员的性能更好。但是，这种方法可能只会将问题从成员级别推到集成级别，因为也可能有许多不同的集成模型可以很好地拟合数据。

在 VI 之外的几个学科领域也考虑了Rashomon效应，包括非统计学科领域（Heider，1988；Roth 和 Mehta，2002）。 Tulabandhula 和 Rudin (2014) 优化决策规则，使其在任何表现良好的模型的预测结果范围内表现良好。斯塔尼科夫等人。 (2013) 提出了一种算法来发现多个马尔可夫边界，即最小的协变量集，这样对任何一个集的条件都会导致结果与剩余协变量之间的独立性。 Nevo 和 Ritov (2017) 报告了对应于一组拟合良好的稀疏线性模型的解释。 Meinshausen 和 Buühlmann (2010) 根据一组良好拟合模型中这些方面的稳定性来估计基础模型的结构方面（例如该模型中包含的变量）。这组拟合良好的模型是通过在一系列扰动样本中重复估计过程来识别的，使用不同程度的正则化（另见 Azen 等人，2001）。莱瑟姆等人。（2016 年）搜索一对拟合良好的动态系统模型，这些模型可以提供最大程度的不同预测。

6. Calculating Empirical Estimates of Model Class Reliance‌

In this section, we propose a binary search procedure to bound the values of M二CR−(E) and

M二CR+(E) (see Eq 2.4), which respectively serve as estimates of M CR−(E) and M CR+(E)

(see Section 4.1). Each step of this search consists of minimizing a linear combination of

eˆorig(f ) and eˆswitch(f ) across f ∈ F . Our approach is related to the fractional programming approach of Dinkelbach (1967), but accounts for the fact that the problem is constrained by the value of the denominator, eˆorig(f ). We additionally show that, for many model classes,

‌‌在本节中，我们提出了一种二分搜索程序来限制 M二CR-(E) 和

M 二 CR+(E)（见公式 2.4），分别作为 M CR-(E) 和 M CR+(E) 的估计值

（见第 4.1 节）。这个搜索的每一步都包括最小化一个线性组合

e^orig(f) 和 e^switch(f) 跨越 f ∈ F。我们的方法与 Dinkelbach (1967) 的分数规划方法有关，但考虑到问题受分母 e^orig(f) 值约束的事实。我们还表明，对于许多模型类，

7. MR & MCR for Linear Models, Additive Models, and Regression Models in a Reproducing Kernel Hilbert Space

For linear or additive models, many simplifications can be made to our approaches for MR and MCR. To simplify the interpretation of MR, we show below that population-level MR for a linear model can be expressed in terms of the model’s coefficients (Section 7.1). To simplify computation, we show that the cost of computing empirical MR for a linear model grows only linearly in n (Section 7.1), even though the number of terms in the definition of empirical MR grows quadratically (see Eqs 3.3 & 3.6).

Moving on from MR, we show how empirical MCR can be computed for the class of linear models (Section 7.2), for regularized linear models (Section 7.3), and for regression models in a reproducing kernel Hilbert space (RKHS, Section 7.4). To do this, we build on the approach in Section 6 by giving approaches for minimizing arbitrary combinations

of eˆswitch(f ) and eˆorig(f ) across f ∈ F . Even when the associated objective functions are non-convex, we can tractably obtain global minima for these model classes. We also discuss procedures to determine an upper bound Bind on the loss for any observation when using these model classes (see Assumption 1).

Throughout this section, we assume that X ⊂ Rp for p ∈ Z+, that Y ⊂ R1, and that L is the squared error loss function L(f, (y, x1, x2) = (y − f (x1, x2))2. As in Section 6, we also assume that 0 < minf ∈F eˆorig(f ), to ensure that empirical MR is finite.

对于线性或加法模型，可以对我们的 MR 和 MCR 方法进行许多简化。为了简化 MR 的解释，我们在下面展示了线性模型的总体水平 MR 可以用模型的系数表示（第 7.1 节）。为了简化计算，我们表明计算线性模型的经验 MR 的成本仅在 n 中线性增长（第 7.1 节），即使经验 MR 定义中的项数呈二次增长（参见公式 3.3 和 3.6）。

从 MR 开始，我们展示了如何为线性模型类（第 7.2 节）、为正则化线性模型（第 7.3 节）、以及再现核希尔伯特空间中的回归模型（RKHS，第 7.4 节）计算经验 MCR。为此，我们在第 6 节中的方法的基础上给出了最小化任意组合的方法

e^switch(f) 和 e^orig(f) 跨越 f ∈ F。即使相关的目标函数是非凸的，我们也可以轻松地获得这些模型类的全局最小值。我们还讨论了在使用这些模型类时确定任何观察损失的上限绑定的程序（参见假设 1）。

在本节中，我们假设 X ⊂ Rp 对于 p ∈ Z+，即 Y ⊂ R1，并且 L 是平方误差损失函数 L(f, (y, x1, x2) = (y − f (x1, x2) )2. 与第 6 节一样，我们还假设 0 < minf ∈F eˆorig(f )，以确保经验 MR 是有限的。

7.1 Interpreting and Computing MR for Linear or Additive Models‌

We begin by considering MR for linear models evaluated with the squared error loss. For this setting, we can show both an interpretable definition of MR, as well as a computationally efficient formula for eˆswitch(f ).	我们首先考虑使用平方误差损失评估的线性模型的 MR。对于这个设置，我们既可以展示 MR 的可解释定义，也可以展示 e^switch(f) 的计算高效公式。

10. Data Analysis: Reliance of Criminal Recidivism Prediction Models on Race and Sex‌

Evidence suggests that bias exists among judges and prosecutors in the criminal justice system (Spohn, 2000; Blair et al., 2004; Paternoster and Brame, 2008). In an aim to counter this bias, machine learning models trained to predict recidivism are increasingly being used to inform judges’ decisions on pretrial release, sentencing, and parole (Monahan and Skeem, 2016; Picard-Fritsche et al., 2017). Ideally, prediction models can avoid human bias and provide judges with empirically tested tools. But prediction models can also mirror the biases of the society that generates their training data, and perpetuate the same bias at scale. In the case of recidivism, if arrest rates across demographic groups are not representative of underlying crime rate (Beckett et al., 2006; Ramchand et al., 2006; U.S. Department of Justice - Civil Rights Devision, 2016), then bias can be created in both (1) the outcome variable, future crime, which is measured imperfectly via arrests or convictions, and (2) the covariates, which include the number of prior convictions on a defendant’s record (Corbett-Davies et al., 2016; Lum and Isaac, 2016). Further, when a prediction model’s behavior and mechanisms are an opaque black box, the model can evade scrutiny, and fail to offer recourse or explanations to individuals rated as “high risk.”

We focus here on the issue of transparency, which takes an important role in the recent debate about the proprietary recidivism prediction tool COMPAS (Larson et al., 2016; Corbett-Davies et al., 2016). While COMPAS is known to not rely explicitly on race, there is concern that it may rely implicitly on race via proxies–variables statistically dependent with race (see further discussion in Section 11).

Our goal is to identify bounds for how much COMPAS relies on different covariate sub- sets, either implicitly or explicitly, under certain assumptions (defined below). We analyze a public data set of defendants from Broward County, Florida, in which COMPAS scores have been recorded (Larson et al., 2016). Within this data set, we only included defendants measured as African-American or Caucasian (3,373 in total) due to sparseness in the re- maining categories. The outcome of interest (Y ) is the COMPAS violent recidivism score. Of the available covariates, we consider three variables which we refer to as “admissible”: an individual’s age, their number of priors, and an indicator of whether the current charge is a felony. We also consider two variables which we refer to as “inadmissible”: an individual’s race and sex. Our labels of “admissible” and “inadmissible” are not intended to be legally precise–indeed, the boundary between these types of labels is not always clear (see Section 10.2). We compute empirical MCR and AR for each variable group, as well as bootstrap CIs for MCR (see Section 9.2).

证据表明，刑事司法系统中的法官和检察官之间存在偏见（Spohn，2000； Blair 等人，2004； Paternoster 和 Brame，2008）。为了对抗这种偏见，训练用于预测累犯的机器学习模型越来越多地被用于为法官关于审前释放、量刑和假释的决定提供信息（Monahan 和 Skeem，2016； Picard-Fritsche 等，2017）。理想情况下，预测模型可以避免人为偏见，并为法官提供经过经验测试的工具。但预测模型也可以反映生成训练数据的社会的偏见，并在规模上延续相同的偏见。在累犯的情况下，如果人口群体的逮捕率不能代表潜在的犯罪率（Beckett 等人，2006 年； Ramchand 等人，2006 年；美国司法部 - 民权处，2016 年），那么偏见可能是（1）结果变量，未来犯罪，通过逮捕或定罪来衡量，以及（2）协变量，包括被告记录中的先前定罪数量（Corbett-Davies 等人，2016 年；卢姆和艾萨克，2016）。此外，当预测模型的行为和机制是一个不透明的黑匣子时，该模型可以逃避审查，并且无法向被评为“高风险”的个人提供追索权或解释。

我们在这里关注透明度问题，这在最近关于专有累犯预测工具 COMPAS 的辩论中发挥了重要作用（Larson 等人， 2016；Corbett-Davies 等人，2016）。虽然众所周知 COMPAS 不明确依赖种族，但有人担心它可能通过代理隐含地依赖种族——统计上依赖于种族的变量（参见第 11 节中的进一步讨论）。

我们的目标是确定在某些假设（定义如下）下，COMPAS 在多大程度上依赖于不同的协变量子集，无论是隐式的还是显式的。我们分析了佛罗里达州布劳沃德县的被告公共数据集，其中记录了 COMPAS 分数（Larson 等人，2016 年）。在该数据集中，由于其余类别的稀疏性，我们仅包括被测量为非裔美国人或高加索人（总共 3,373 名）的被告。感兴趣的结果 (Y) 是 COMPAS 暴力累犯分数。在可用的协变量中，我们考虑了三个我们称之为“可接受”的变量：个人的年龄、他们的先验数量以及当前指控是否为重罪的指标。我们还考虑了两个我们称之为“不可接受”的变量：个人的种族和性别。我们的“可接受”和“不可接受”标签并非旨在在法律上精确——事实上，这些类型的标签之间的界限并不总是清晰的（参见第 10.2 节）。我们计算每个变量组的经验 MCR 和 AR，以及 MCR 的引导 CI（参见第 9.2 节）。

To compute empirical MCR and AR, we consider a flexible class of linear models in a RKHS to predict the COMPAS score (described in more detail below). Given this class, the MCR range (See Eq 2.2) captures the highest and lowest degree to which any model in the class may rely on each covariate subset. We assume that our class contains at least one model that relies on “inadmissible variables” to the same extent that COMPAS relies either on “inadmissible variables” or on proxies that are unmeasured in our sample (analogous to Condition 20). We make the same assumption for “admissible variables.” These assumptions can be interpreted as saying that the reliance values of COMPAS are relatively “well supported” by our chosen model class, and allows us to identify bounds on the MR values for COMPAS. We also consider the more conventional, but less robust approach of AR (Section 3.2), that is, how much would the accuracy suffer for a model- fitting algorithm trained on COMPAS score if a variable subset was removed?

为了计算经验 MCR 和 AR，我们考虑了 RKHS 中的一类灵活的线性模型来预测 COMPAS 分数（下面更详细地描述）。给定这个类，MCR 范围（见 Eq 2.2）捕获了该类中的任何模型可能依赖每个协变量子集的最高和最低程度。我们假设我们的课程至少包含一个模型，该模型依赖于“不可接受的变量”，其程度与 COMPAS 依赖于“不可接受的变量”或我们样本中未测量的代理（类似于条件 20）的程度相同。我们对“可接受的变量”做出同样的假设。这些假设可以解释为我们选择的模型类相对“很好地支持”了 COMPAS 的依赖值，并允许我们确定 COMPAS 的 MR 值的界限。我们还考虑了更传统但鲁棒性较差的 AR 方法（第 3.2 节），也就是说，如果删除了变量子集，那么在 COMPAS 分数上训练的模型拟合算法的准确性会受到多大影响？

These computations require that we predefine our loss function, model class, and per- formance threshold. We define MR, MCR, and AR in terms of the squared error loss L(f, (y, x1, x2)) = {y − f (x1, x2)} . We define our model class FD,rk in the form of Eq 7.6,where we determine D, µ, k, and rk based on a subset S of 500 training observations. We set D equal to the matrix of covariates from S; we set µ equal to the mean of Y in S; we set k equal to the radial basis function kσs (x, x˜) = exp where we choose σs to min-

imize the cross-validated loss of a Nadaraya-Watson kernel regression (Hastie et al., 2009) fit to S; and we select the parameters rk by cross-validation on S. We set E equal to 0.1 times the cross-validated loss on S. Also using S, we train a reference model fref ∈ FD,rk . Using the held-out 2,873 observations, we then estimate M R(fref) and MCR for FD,rk . To calculate AR, we train models from FD,rk using S, and evaluate their performance in the held-out observations.

这些计算要求我们预先定义我们的损失函数、模型类和性能阈值。我们根据平方误差损失 L(f, (y, x1, x2)) = {y − f (x1, x2)} 来定义 MR、MCR 和 AR。我们以公式 7.6 的形式定义我们的模型类 FD,rk，其中我们根据 500 个训练观察的子集 S 确定 D、µ、k 和 rk。我们将 D 设置为 S 的协变量矩阵；我们将 μ 设置为 S 中 Y 的平均值；我们将 k 设置为径向基函数 kσs (x, x∼) = exp，其中我们选择 σs 到 min-

将 Nadaraya-Watson 核回归 (Hastie et al., 2009) 的交叉验证损失最小化到 S；我们通过 S 上的交叉验证选择参数 rk。我们将 E 设置为 S 上交叉验证损失的 0.1 倍。同样使用 S，我们训练参考模型 fref ∈ FD,rk。使用保留的 2,873 个观察值，我们然后估计 FD,rk 的 M R(fref) 和 MCR。为了计算 AR，我们使用 S 训练来自 FD,rk 的模型，并评估它们在保留观察中的表现。

10.1 Results‌

Our results imply that race and sex play somewhere between a null role and a modest role in determining COMPAS score, but that they are less important than “admissible” factors (Figure 8). As a benchmark for comparison, the empirical MR of fref is equal to 1.09 for “inadmissible variables,” and 2.78 for “admissible variables.” The AR is equal to 0.94 and 1.87 for “inadmissible” and “admissible” variables respectively, roughly in agreement with MR. The MCR range for “inadmissible variables” is equal to [1.00,1.56], indicating that for any model in FD,rk with empirical loss no more than E above that of fref, the model’s loss can increase by no more than 56% if race and sex are permuted. Such a statement cannot be made solely based on AR or MR methods, as these methods do not upper bound the reliance values of well-performing models. The bootstrap 95% CI for MCR on “inadmissible variables” is [1.00, 1.73]. Thus, under our assumptions, if COMPAS relied on sex, race, or their unmeasured proxies by a factor greater than 1.73, then intervals as low as what we observe would occur with probability < 0.05.

我们的结果表明，种族和性别在确定 COMPAS 分数方面的作用介于零作用和适度作用之间，但它们不如“可接受”因素重要（图 8）。作为比较的基准，对于“不可接受的变量”，fref 的经验 MR 等于 1.09，而对于“允许的变量”，则等于 2.78。对于“不允许”和“允许”变量，AR 分别等于 0.94 和 1.87，与 MR 大致一致。 “不可接受变量”的 MCR 范围等于 [1.00,1.56]，这表明对于 FD,rk 中的任何模型，经验损失不超过 fref 的 E，如果满足以下条件，则模型的损失增加不超过 56%种族和性别被置换。这样的声明不能仅仅基于 AR 或 MR 方法，因为这些方法不会限制表现良好的模型的依赖值。 MCR 在“不可接受的变量”上的引导 95% CI 为 [1.00, 1.73]。因此，在我们的假设下，如果 COMPAS 以大于 1.73 的因子依赖于性别、种族或其未测量的代理，那么与我们观察到的一样低的间隔将以 < 0.05 的概率出现。

For “admissible variables” the MCR range is equal to [1.77,3.61], with a 95% bootstrap CI of [1.62, 3.96]. Under our assumptions, this implies if COMPAS relied on age, number of priors, felony indication, or their unmeasured proxies by a factor lower than 1.77, then intervals as high as what we observe would occur with probability < 0.05. This result is consistent with Rudin et al. (2019), who find age to be highly predictive of COMPAS score.

It is worth noting that the upper limit of 3.61 maximizes empirical MR on “admissible variables” not only among well-performing models, but globally across all models in the class (see Figure 8, and Eq 6.5). In other words, it is not possible to find models in FD,rk that perform arbitrarily poorly on perturbed data, but still perform well on unperturbed data, and so the ratio of eˆswitch(f ) to eˆorig(f ) has a finite upper bound. Because the regu- larization constraints of FD,rk preclude MR values higher than 3.61, the MR of COMPAS on “admissible variables” may be underestimated by empirical MCR. Note also that both MCR intervals are left-truncated at 1, as it is often sufficiently precise to conclude that there exists a well-performing model with no reliance on the variables of interest (that is, MR equal to 1; see Appendix A.2).

对于“允许变量”，MCR 范围等于 [1.77,3.61]，95% 引导 CI 为 [1.62, 3.96]。根据我们的假设，这意味着如果 COMPAS 以低于 1.77 的因子依赖于年龄、先验数量、重罪指示或其未测量的代理，那么与我们观察到的间隔一样高的间隔将以 < 0.05 的概率发生。这一结果与 Rudin 等人的一致。 (2019)，他们发现年龄对 COMPAS 评分具有高度预测性。

值得注意的是，上限 3.61 不仅在表现良好的模型中，而且在该类中的所有模型中都使“可接受变量”的经验 MR 最大化（参见图 8 和方程 6.5）。换句话说，不可能在 FD,rk 中找到在扰动数据上表现任意差的模型，但在未扰动数据上仍然表现良好，因此 e^switch(f ) 与 e^orig(f ) 的比率具有有限的上限.因为 FD,rk 的正则化约束排除了高于 3.61 的 MR 值，COMPAS 在“可接受变量”上的 MR 可能被经验 MCR 低估。另请注意，两个 MCR 区间均在 1 处左截断，因为通常可以足够精确地得出结论，即存在一个性能良好的模型，不依赖于感兴趣的变量（即 MR 等于 1；参见附录 A. 2）。

10.2 Discussion & Limitations

‌Asking whether a proprietary model relies on sex and race, after adjusting for other co- variates, is related to the fairness metric known as conditional statistical parity (CSP). A decision rule satisfies CSP if its decisions are independent of a sensitive variable, conditional	在调整其他协变量后，询问专有模型是否依赖于性别和种族，这与称为条件统计平价 (CSP) 的公平指标有关。如果决策规则的决策独立于敏感变量、有条件的，则决策规则满足 CSP

11. Conclusion‌

In this article, we propose MCR as the upper and lower limit on how important a set of variables can be to any well-performing model in a class. In this way, MCR provides a more comprehensive and robust measure of importance than traditional importance measures for a single model. We derive bounds on MCR, which motivate our choice of point estimates. We also derive connections between permutation importance, U-statistics, conditional vari- able importance, and conditional causal effects. We apply MCR in a data set of criminal recidivism, in order to help inform the characteristics of the proprietary model COMPAS.

Several exciting areas remain open for future research. One research direction closely related to our current work is the development of exact or approximate MCR computation procedures for other model classes and loss functions. We have shown that, for model classes where minimizing the empirical loss is a convex optimization problem, MCR can be conservatively computed via a series of convex optimization problems. Further, we have shown that computing M二CR− is often no more challenging that minimizing the empirical loss over a reweighted sample. General computation procedures for MCR are still an open research area.

在本文中，我们建议将 MCR 作为一组变量对类中任何表现良好的模型的重要性的上限和下限。通过这种方式，MCR 为单一模型提供了比传统重要性度量更全面、更稳健的重要性度量。我们推导出 MCR 的界限，这促使我们选择点估计。我们还推导出排列重要性、U-统计量、条件变量重要性和条件因果效应之间的联系。我们将 MCR 应用于犯罪再犯数据集，以帮助了解专有模型 COMPAS 的特征。

几个令人兴奋的领域仍有待未来研究。与我们当前工作密切相关的一个研究方向是开发用于其他模型类和损失函数的精确或近似MCR计算程序。我们已经证明，对于最小化经验损失是凸优化问题的模型类，可以通过一系列凸优化问题保守地计算 MCR。此外，我们已经表明，计算 M =CR- 通常并不比最小化重新加权样本的经验损失更具挑战性。 MCR 的通用计算方法仍然是一个开放的研究领域。

Another direction is to consider MCR for variable selection. If M CR+ is small for a variable, then no well-performing predictive model can heavily depend on that variable, indicating that it can be eliminated.

Our theoretical analysis of Rashomon sets depends on F and fref being prespecified. Above, we have actualized this by splitting our sample into subsets of size n1 and n2, using the first subset to determine F and fref, and conditioning on F and fref when estimating MCR in the second subset. As a result, the boundedness constants in our assumptions (Bind, Bref, Bswitch, and borig) depend on F , and hence on n1. However, because our results are non-asymptotic, we have not explored how Rashomon sets behave when n1 and n2 grow at different rates. An exciting future extension of this work is to study sequences of triples {En1 , fref,n1 , Fn1 } that change as n1 increases, and the corresponding Rashomon sets R(En1 , fref,n1 , Fn1 ), as this may more thoroughly capture how model classes are determined by analysts.

While we develop Rashomon sets with the goal of studying MR, Rashomon sets can also be useful for finite sample inferences about a wide variety of other attributes of best-in-class models (for example, Section 5).Characterizations of a Rashomon set itself may also be of interest. For example, in ongoing work, we are studying the size of a Rashomon set, and its connection to generalization of models and model classes (Semenova and Rudin, 2019). We are additionally developing methods for visualizing Rashomon sets (Dong and Rudin, 2019).

另一个方向是考虑 MCR 进行变量选择。如果一个变量的 M CR+ 很小，那么没有一个性能良好的预测模型可以严重依赖该变量，这表明它可以被消除。

我们对Rashomon集的理论分析取决于预先指定的 F 和 fref。上面，我们通过将样本分成大小为 n1 和 n2 的子集来实现这一点，使用第一个子集来确定 F 和 fref，并在估计第二个子集中的 MCR 时以 F 和 fref 为条件。因此，我们假设中的有界常数（Bind、Bref、Bswitch 和 borig）取决于 F，因此取决于 n1。然而，因为我们的结果是非渐近的，所以我们没有探索当 n1 和 n2 以不同的速率增长时Rashomon集的行为。这项工作的一个令人兴奋的未来扩展是研究随着 n1 增加而变化的三元组序列 {En1, fref,n1, Fn1}，以及相应的Rashomon集 R(En1, fref,n1, Fn1)，因为这可能更彻底地捕捉分析师如何确定模型类别。

虽然我们以研究 MR 为目标开发 Rashomon 集，但 Rashomon 集也可用于关于一流模型的各种其他属性的有限样本推断（例如，第 5 节）。 Rashomon 集本身的特征也可能感兴趣。例如，在正在进行的工作中，我们正在研究 Rashomon 集的大小，以及它与模型和模型类的泛化的联系（Semenova 和 Rudin， 2019）。我们还在开发可视化Rashomon集的方法（Dong 和 Rudin，2019）。

一个处女座的程序猿

关注

5
点赞
踩
16

收藏

觉得还不错? 一键收藏
打赏
1
评论
Paper：可解释性之PFI《All Models are Wrong, but Many are Useful: Learning a Variable’s Importance》翻译与解读

Paper：可解释性之PFI《All Models are Wrong, but Many are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously-所有模型都是错误的，但许多模型都是有用的：通过同时研究一整类预测模型来了解变量的重要性》翻译与解读目录《All Models are Wrong, but Many are Useful:
复制链接

扫一扫