Contents
Introduction
- 作者提出 hyperspherical prototype networks,可以利用 prototypes 以统一的框架完成分类和回归任务
Hyperspherical prototypes
Classification
- Positioning hyperspherical prototypes. 在训练模型之前,作者先提前确定 hyperspherical prototypes 的位置,使其均匀分布在整个超球上。设最优的 prototypes 集合为
P
∗
\mathbf P^*
P∗,则
P
∗
\mathbf P^*
P∗ 会使得任意两个 prototypes 间的最大余弦距离最小,即
将上式中的 max ( k , l , k ≠ l ) ∈ C cos θ ( p k ′ , p l ′ ) \max _{(k, l, k \neq l) \in C} \cos \theta_{\left(\mathbf{p}_k^{\prime}, \mathbf{p}_l^{\prime}\right)} max(k,l,k=l)∈Ccosθ(pk′,pl′) 作为损失函数,用梯度下降法优化即可得到 hyperspherical prototypes. 但作者认为这样优化效率太低,因为每次需要计算出所有 prototypes 间的余弦距离但却只优化距离最大的一对 prototypes,为此,作者提出采用以下损失函数,对每个 prototype,优化其距离最大的一对 prototypes,每次优化 K K K 对 prototypes
其中, K K K 为类别数, C C C 为类别集合, P ^ P ^ T \hat {\mathbf P}\hat {\mathbf P}^T P^P^T 为 pairwise prototype similarities,减去 2 I 2\mathbf I 2I 是为了避免 self selection. 将上式作为损失函数使用梯度下降法优化 prototypes,再将其投影回超球,不断迭代即可得到理想的 hyperspherical prototypes (SGD, with a learning rate of 0.01, momen-tum of 0.9) - Prototypes with privileged information. 为了进一步融入类别的语义信息,使得语义接近的 prototypes 相较于语义不同的 prototypes 间更加接近,作者利用了类别名的 word embed
W
=
{
w
1
,
.
.
.
,
w
K
}
\mathbf W=\{\mathbf w_1,...,\mathbf w_K\}
W={w1,...,wK},引入了如下 ranking-based loss function,
其中, T T T 为所有类别三元组的集合,ground truth S ˉ i j k = ⟦ cos θ w i , w j ≥ cos θ w i , w k ⟧ \bar S_{ijk}=\llbracket \cos \theta_{\mathbf{w}_i, \mathbf{w}_j} \geq \cos \theta_{\mathbf{w}_i, \mathbf{w}_k} \rrbracket Sˉijk=[[cosθwi,wj≥cosθwi,wk]],output S i j k ≡ e o i j k 1 + e o i j k S_{i j k} \equiv \frac{e^{o_{i j k}}}{1+e^{o_{i j k}}} Sijk≡1+eoijkeoijk, o i j k = cos θ p i , p j − cos θ p i , p k o_{ijk}=\cos\theta_{\mathbf p_i,\mathbf p_j}-\cos\theta_{\mathbf p_i,\mathbf p_k} oijk=cosθpi,pj−cosθpi,pk. 上述两个损失函数相加即为最终的 hyperspherical prototypes 预训练损失函数 - Classification. 损失函数最大化样本特征和其 class prototype 间的余弦距离,并且在此过程中不更新 prototypes
推理时,模型的预测结果为
Regression
- 在进行回归时,假设回归值的上下限分别为
v
u
,
v
l
v_u,v_l
vu,vl,作者为
v
u
,
v
l
v_u,v_l
vu,vl 各自设定了两个 prototypes
p
u
,
p
l
\mathbf p_u,\mathbf p_l
pu,pl 并规定它们方向相反,即
cos
θ
p
u
,
p
l
=
−
1
\cos\theta_{\mathbf p_u,\mathbf p_l}=-1
cosθpu,pl=−1,训练时的损失函数为
样本特征与 p u \mathbf p_u pu 间的余弦相似度即为归一化后的预测值
- Our approach to regression differs from standard regression, which backpropagate losses on one-dimensional outputs. In the context of our work, this corresponds to an optimization on the line from p u \mathbf p_u pu to p l \mathbf p_l pl. Our approach generalizes regression to higher dimensional output spaces. While we still interpolate between two points, the ability to project to higher dimensional outputs provides additional degrees of freedom to help the regression optimization. As we will show in the experiments, this generalization results in a better and more robust performance than mean squared error.
Joint regression and classification
- hyperspherical prototype networks 可以在同一个超球上完成分类和回归任务,只需要满足回归任务上下限对应的 prototypes 对应欧式空间的一个轴,其余轴则用于分类任务
Experiments
Classification
- Evaluating hyperspherical prototypes
- Prototypes with privileged information.
- Comparison to other prototype networks.
- Comparison to softmax cross-entropy. We conclude that we are comparable to softmax cross-entropy for sufficient examples and preferred when examples per class are unevenly distributed or scarce.
Regression
Joint regression and classification
- Rotated MNIST. We classify the digits and regress on their rotation. We employ
S
2
\mathbb S^2
S2 as output, where the classes are separated along the
(
x
,
y
)
(x, y)
(x,y)-plane and the rotations are projected along the
z
z
z-axis.
- Predicting creation year and art style.