@INPROCEEDINGS{Tagasovska2019QinputLoss,
title = {Single-Model Uncertainties for Deep Learning},
author = {Natasa Tagasovska and David Lopez-Paz},
booktitle = {NIPS},
year = {2019},
pages = {6417--6428}
}
1. 摘要
We provide single-model estimates of
aleatoric
andepistemic uncertainty
for deep neural networks.
To estimate aleatoric uncertainty, we propose Simultaneous Quantile Regression (
SQR
), a loss function to learn all the conditional quantiles of a given target variable.
These quantiles can be used to compute well-calibrated prediction intervals.
To estimate epistemic uncertainty, we propose Orthonormal Certificates (
OCs
), a collection of diverse non-constant functions that map all training samples to zero.
These certificates map out-of-distribution examples to non-zero values, signaling epistemic uncertainty.
Our uncertainty estimators are computationally attractive, as they do not require ensembling or retraining deep models, and achieve competitive performance.
Note:
- 本文研究的问题是量化深度神经网络的不确定性。其中,不确定性可以大致分为两种,一种是
aleatoric uncertainty
, 另外一种是epistemic uncertainty
。 - 本文针对两种不确定性,分别提出了两种量化方法,一种是Simultaneous Quantile Regression(
SQR
),另外一种是Orthonormal Certificates(OCs
)。
2. 前置知识
2.1. approximation uncertainty
近似误差
。它描述的是简单模型无法拟合复杂数据的所产生的误差(例如,线性模型拟合正弦曲线所产生的误差)。
2.2. aleatoric uncertainty
偶然不确定性
。我理解的是标记数据中本身存在的误差,本质上是由于测量导致的误差。
2.3. epistemic uncertainty
认知不确定性
。模型对于测试数据预测的不确定性,体现的是泛化误差。
3. 算法
训练数据包含非高斯噪声(蓝点)、预测中值(实线)、65%和80%分位数(虚线)、任意不确定性或95%预测间隔(灰色阴影,由SQR估计)和认知不确定性(粉色阴影,由QCs估计)。
3.1. SQR
ℓ τ ( y ^ , y ) = { τ ∣ y − y ^ ∣ , if y − y ^ ≥ 0 ; ( 1 − τ ) ∣ y − y ^ ∣ , else . \ell_{\tau}(\hat{y}, y)=\begin{cases} \tau|y-\hat{y}|, &\text{if } y-\hat{y} \geq 0;\\ (1- \tau)|y-\hat{y}|,& \text{else}. \end{cases} ℓτ(y^,y)={τ∣y−y^∣,(1−τ)∣y−y^∣,if y−y^≥0;else.
ℓ
τ
(
y
^
,
y
)
=
{
τ
(
y
−
y
^
)
,
if
y
−
y
^
≥
0
;
(
1
−
τ
)
(
y
^
−
y
)
,
else
.
\ell_{\tau}(\hat{y}, y)=\begin{cases} \tau (y-\hat{y}), &\text{if } y-\hat{y} \geq 0;\\ (1- \tau)(\hat{y}-y),& \text{else}. \end{cases}
ℓτ(y^,y)={τ(y−y^),(1−τ)(y^−y),if y−y^≥0;else.
Note:
注意这里的符号,上面是
y
−
y
^
y- \hat{y}
y−y^,下面是
y
^
−
y
\hat{y}-y
y^−y,本质上是MAE Loss,损失是正的,同时保证归一化。
class QuantileLoss(torch.nn.Module):
def __init__(self):
super(QuantileLoss, self).__init__()
def forward(self, yhat, y, tau):
diff = yhat - y
mask = (diff.ge(0).float() - tau).detach()
return (mask * diff).mean()
普通的分位数损失
f
^
τ
∈
arg min
f
1
n
∑
i
=
1
n
ℓ
τ
(
f
(
x
i
)
,
y
i
)
(1)
\hat{f}_{\tau} \in \argmin_f{\frac{1}{n}\sum_{i=1}^n\ell_{\tau}(f(x_i), y_i)} \tag{1}
f^τ∈fargminn1i=1∑nℓτ(f(xi),yi)(1)
普通的分位数损失只能拟合对应的分位数
τ
\tau
τ, 注意拟合函数
f
f
f的入参只有训练数据
x
i
x_i
xi。
SQR Loss:
f
^
∈
arg min
f
1
n
∑
i
=
1
n
E
τ
∼
U
[
0
,
1
]
[
ℓ
τ
(
f
(
x
i
,
τ
)
,
y
i
)
]
(2)
\hat{f} \in \argmin_f{\frac{1}{n}\sum_{i=1}^n\mathbb{E}_{\tau \sim U[0, 1]}[\ell_{\tau}(f(x_i, \tau), y_i)]} \tag{2}
f^∈fargminn1i=1∑nEτ∼U[0,1][ℓτ(f(xi,τ),yi)](2)
SQR Loss想拟合对应所有的分位数
τ
∼
U
[
0
,
1
]
\tau \sim U[0, 1]
τ∼U[0,1],注意此时拟合函数
f
f
f的入参有训练数据
x
i
x_i
xi和对应的分位数
τ
\tau
τ。
置信区间
:
u
α
(
x
∗
)
:
=
f
^
(
x
∗
,
1
−
α
/
2
)
−
f
^
(
x
∗
,
α
/
2
)
(3)
u_\alpha(x^*) := \hat{f}(x^*, 1 - \alpha/2) - \hat{f}(x^*, \alpha/2) \tag{3}
uα(x∗):=f^(x∗,1−α/2)−f^(x∗,α/2)(3)
其中,
α
\alpha
α表示显著性水平,
u
α
(
x
∗
)
u_\alpha(x^*)
uα(x∗)表示置信度
1
−
α
1- \alpha
1−α下,
x
∗
x^*
x∗的置信区间。例如,
α
=
0.05
\alpha=0.05
α=0.05,也就是95%置信度下,
u
0.05
(
x
∗
)
:
=
f
^
(
x
∗
,
0.975
)
−
f
^
(
x
∗
,
0.025
)
u_{0.05}(x^*) := \hat{f}(x^*, 0.975) - \hat{f}(x^*, 0.025)
u0.05(x∗):=f^(x∗,0.975)−f^(x∗,0.025)。
3.2. QCs
QCs是为了量化认知不确定性(epistemic uncertainty)的技术。认知不确定性可以描述与我们的模型在特征空间的某些区域缺乏经验有关的错误。口水话就是说没见过的数据,就会存在认知不确定性,这种没见过体现在类别,如训练集只有猫狗,没见过猪; 也体现在数据变换,比如训练集有猫狗,但是没有见过进行了图像转化后的猫狗(旋转,裁剪,加噪声)。
针对上述的,论文中想将数据分为见过的,和没见过的,这里我理解的是变成了一个二分类的问题。特殊的是,这个二分类问题中,只会有正类(也就是见过的),不会有没见过的数据(因为训练集中训练后都会变成见过的),很神奇😑。
C ^ ∈ arg min C ∈ R h × k 1 n ∑ i = 1 n ℓ c ( C T ϕ ( x i ) , 0 ) + λ ⋅ ∥ C T C − I k ∥ (4) \hat{C} \in \argmin_{C \in \mathbb{R}^{h \times k}}\frac{1}{n}\sum_{i=1}^n\ell_c(C^{\mathsf{T}}\phi(x_i), 0) + \lambda \cdot \|C^{\mathsf{T}}C - I_k\| \tag{4} C^∈C∈Rh×kargminn1i=1∑nℓc(CTϕ(xi),0)+λ⋅∥CTC−Ik∥(4)
这里 ϕ ( x i ) \phi(x_i) ϕ(xi)相当于是是一个特征转换层,转化后的特征维数是 h h h。 C ^ \hat{C} C^相当于 k k k个度量是否见过当前数据(在训练集见过),见过就是0,训练集中没有标签为1的数据。正则项 ∥ C T C − I k ∥ \|C^{\mathsf{T}}C - I_k\| ∥CTC−Ik∥是为了确保 k k k个分类器各自不同,保证多样性(diverse),这里就是简单的做了一次正交,保证线性无关。
最后计算认知不确定性公式:
u
e
(
x
∗
)
:
=
∥
C
T
ϕ
(
x
∗
)
∥
=
∥
a
T
ϕ
(
x
∗
)
+
b
∥
u_e(x^*):=\|C^{\mathsf{T}}\phi(x^*)\| = \|\mathbf{a}^{\mathsf{T}}\phi(x^*)+\mathbf{b}\|
ue(x∗):=∥CTϕ(x∗)∥=∥aTϕ(x∗)+b∥
4. 代码
# Copyright 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
from matplotlib import pyplot as plt
import torch
from torch.utils.data import DataLoader, TensorDataset
class QuantileLoss(torch.nn.Module):
def __init__(self):
super(QuantileLoss, self).__init__()
def forward(self, yhat, y, tau):
diff = yhat - y
mask = (diff.ge(0).float() - tau).detach()
return (mask * diff).mean()
def augment(x, tau=None):
if tau is None:
tau = torch.zeros(x.size(0), 1).fill_(0.5)
return torch.cat((x, (tau - 0.5) * 12), 1)
# return x
def build_certificates(x, k=100, epochs=500):
c = torch.nn.Linear(x.size(1), k)
loader = DataLoader(TensorDataset(x),
shuffle=True,
batch_size=128)
opt = torch.optim.Adam(c.parameters())
for epoch in range(epochs):
for xi in loader:
opt.zero_grad()
error = c(xi[0]).pow(2).mean()
penalty = (c.weight @ c.weight.t() - torch.eye(k)).pow(2).mean()
(error + penalty).backward()
opt.step()
return c
def simple_network(n_inputs=1, n_outputs=1, n_hiddens=100):
return torch.nn.Sequential(
torch.nn.Linear(n_inputs, n_hiddens),
torch.nn.ReLU(),
torch.nn.Linear(n_hiddens, n_hiddens),
torch.nn.ReLU(),
torch.nn.Linear(n_hiddens, 1))
def generate_data(n=1024):
sep = 1
x = torch.zeros(n // 2, 1).uniform_(0, 0.5)
x = torch.cat((x, torch.zeros(n // 2, 1).uniform_(0.5 + sep, 1 + sep)), 0)
m = torch.distributions.Exponential(torch.tensor([3.0]))
noise = m.rsample((n,))
y = (2 * 3.1416 * x).sin() + noise
x_test = torch.linspace(-0.5, 2.5, 100).view(-1, 1)
return x, y, x_test
def train_network(x, y, epochs=500):
net = simple_network(x.size(1) + 1, y.size(1))
# net = simple_network(x.size(1), y.size(1))
optimizer = torch.optim.Adam(net.parameters())
loss = QuantileLoss()
loader = DataLoader(TensorDataset(x, y), shuffle=True, batch_size=128)
for _ in range(epochs):
for xi, yi in loader:
optimizer.zero_grad()
taus = torch.rand(xi.size(0), 1)
loss(net(augment(xi, taus)), yi, taus).backward()
optimizer.step()
return net
# train main network ########################################################
torch.manual_seed(0)
x, y, test_x = generate_data(1000)
test_x = (test_x - x.mean(0)) / x.std(0)
x = (x - x.mean(0)) / x.std(0)
net = train_network(x, y)
taus = torch.zeros(test_x.size(0), 1)
pred_low = net(augment(test_x, taus + 0.025)).detach().numpy().ravel()
pred_med = net(augment(test_x, taus + 0.500)).detach().numpy().ravel()
pred_hig = net(augment(test_x, taus + 0.975)).detach().numpy().ravel()
f = net[:-2](augment(x)).detach()
test_f = net[:-2](augment(test_x)).detach()
test_y = net(augment(test_x)).detach().numpy().ravel()
cert = build_certificates(f)
scores = cert(test_f).pow(2).mean(1).detach().numpy()
scores = (scores - scores.min()) / (scores.max() - scores.min()) * 3
plt.figure(figsize=(5, 3))
plt.plot(x.numpy(), y.numpy(), '.', alpha=0.15)
plt.plot(test_x.view(-1).numpy(),
pred_med,
color="gray",
alpha=0.5,
lw=2)
plt.fill_between(test_x.view(-1).numpy(),
pred_low,
pred_hig,
color="gray",
alpha=0.25,
label='aleatoric')
plt.fill_between(test_x.view(-1).numpy(),
pred_med - scores,
pred_med + scores,
color="pink",
alpha=0.25,
label='epistemic')
plt.ylim(-2, 2.75)
plt.legend(loc=3)
# plt.tight_layout(0, 0, 0)
plt.savefig("toy_example.pdf")
plt.show()
5. 总结
本文和前有工作最大的不同是,用简单的线性网络结构量化不确定性,大部分的不确定性预测工作是用贝叶斯神经网络。对于偶然不确定性,拓展了原有的分位数损失,缓解了分位数交叉(crossing quantiles)的问题。对于认知不确定性,用一组正交证书(Orthonormal Certificates,即一组二分类器
,在训练集中只见过正类,在预测集中对于在训练集中没见过的样本会偏离正类的预测,用一组具有多样性的二分类器对于测试机样本预测的波动来量化认知不确定性。