[论文精读]Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification

最新推荐文章于 2024-06-15 11:56:23 发布

夏莉莉iy

最新推荐文章于 2024-06-15 11:56:23 发布

阅读量849

点赞数 28

分类专栏：论文精读文章标签：人工智能深度学习笔记机器学习学习 python 神经网络

本文链接：https://blog.csdn.net/sherlily/article/details/138510670

版权

论文精读专栏收录该内容

53 篇文章 8 订阅

订阅专栏

论文原文：[2110.14012] Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification (arxiv.org)

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4. Uncertainty Quantification for Node Classification

2.4.1. Axioms

2.4.2. Graph Posterior Network

2.4.3. Uncertainty Estimation Guarantees

2.4.4. Limitations & Impact

3.1. Label Propagation (LP)

3.2. Pseudo-counts

3.3. Radial normalizing flows

4. Reference List

1. 省流版

1.1. 心得

（1）2.2.的“聚合步骤通常假设网络同质性”，的确，促进的也正聚合相反的也正聚合会有问题，不过相关性不存在这个问题。“这样破坏了独立同分布假设”

2. 论文逐段精读

2.1. Abstract

①GNN cand Label Propagation (LP) can both predict nodes

②Lack of uncertain non-independent node-level predictions

axiom n. [数]公理；公设；原理

2.2. Introduction

①Uncertainty can be categorized by aleatoric and epistemic uncertainty, where aleatoric uncertainty (AU) represents irreducible and epistemic uncertainty (EU) means lack of accurate data

②Uncertainty estimation is applied on out-of-distribution (OOD) or shift detection, active learning, continual learning and reinforcement learning

③⭐"The aggregation step commonly assumes network homophily"

④They derive three axioms, propose Graph Posterior Network (GPN), and build a general uncertainty assessment

aleatoric adj. 任意的；偶然的 epistemic adj. 知识的；认识的，与认识有关的

account for （数量或比例上）占；解释，说明（某事）；导致，解释（某种事实或情况）；（某人）对（行动、政策等）负有责任；将（钱款）列入（预算）

2.3. Related Work

（1）Uncertainty for i.i.d. inputs

（2）Uncertainty for graphs

2.4. Uncertainty Quantification for Node Classification

①They set graph $G=(A,X)$ , adjacency matrix $A\in \left \{ 0,1 \right \}^{N \times N}$ , node attribute matrix $X \in \mathbb{R}^{N \times D}$

②Node set $V=\tau \cup \upsilon$ , where $u \in \upsilon$ are labelled and $v \in \tau$ are unlabelled nodes

③Task: infer the node label $y^{\left ( v \right )} \in \left \{ 1,...,C \right \}$ of $v$

2.4.1. Axioms

①Overview of axioms:

which aims at research uncertainty situation and network effects

②这一节很难很抽象因此我中文简单叙述

（1）Axiom 3.1.（网络效应）

①Self-attributes based analysis should assign more uncertainty on nodes with high difference between trained nodes

②如果一个节点有异常特征，要么用无网络影响的先验经验去分类它，要么远离独立同分布的训练数据（好绕口啊，也不知道有没有更简单的解释）

（2）Axiom 3.2.（节点外扩）

①Epistemical certain in one node w/o network effects $\Rightarrow$ epistemical certain in its neighbors w/ network effects, keeping all other conditions same

②同质图中，对当前节点预测的epistemical certain高 $\Rightarrow$ 它会将它的高置信度也传递给邻居。但要是当前节点有异常特征 $\Rightarrow$ 邻居也会受影响，不那么容易确定

③For non-attributed graph/plain graph (graphs which do not contain node attributes), they still hope there are confidence influence on neighbors

（3）Axiom 3.3.（邻居聚合）

①High aleatoric uncertainty on neighbors of one node w/o network effects $\Rightarrow$ High aleatoric uncertainty on that node w/ network effects, keeping all other conditions same

②如果对节点的邻居分类都已经不明确了 $\Rightarrow$ 就别再分类当前节点了

③如过邻居的分类是冲突的（盲猜不同类别，此时当前节点可能位于边界上） $\Rightarrow$ 当前节点的分类也挺难的

④一个节点具有high aleatoric uncertainty，此时它邻居可能是既有②low aleatoric uncertainty也有③不同类别，这样会导致这个处于中间的节点非常难做人

anomalous adj. 异常的；不规则的；不恰当的

2.4.2. Graph Posterior Network

①Bayesian concentrates on uncertain samples

②Bayesian straightly update single categorical distribution $y\sim Cat\left ( p\right )$ . And the natural choice for a prior distribution over $p$ is its conjugate prior, such as Dirichlet distribution $\mathbb{P}(\boldsymbol{p})=\mathrm{Dir}(\boldsymbol{\alpha}^{\mathrm{prior}})$ with $\alpha_{c}^{\mathbf{prior}}\in\mathbb{R}_{+}^{C}$

③Bayesian uptate at given observation $y^{(1)},...,y^{(N)}$ :

$\mathbb{P}\left(\boldsymbol{p}\mid\{y^{(j)}\}_{j=1}^N\right)\propto\mathbb{P}\left(\{y^{(j)}\}_{j=1}^N\mid\boldsymbol{p}\right)\times\mathbb{P}(\boldsymbol{p})$

then the posterior distribution $\mathbb{P}(\boldsymbol{p}|\{y^{(j)}\}_{j=1}^{N})=\mathrm{Dir}(\boldsymbol{\alpha}^{post})$ with posterior parameter $\alpha^{\mathrm{post}}=\alpha^{\mathrm{prior}}+\beta$ and class counts $\beta_{c}=\sum_{j}=\mathbb{I}_{y^{(j)}=c}$

④They solve the AU and EU problems by Dirichlet mean $\bar{p}=\frac{\alpha}{\alpha_{0}}$ and the total evidence count $\alpha _0=\sum _c \alpha _c$

⑤"The aleatoric uncertainty is commonly measured by the entropy of the categorical distribution, such as $u_{\mathrm{alea}}=\mathbb{H}\left[\mathrm{Cat}(\bar{\boldsymbol{p}})\right]$ "

⑥The epistemic uncertainty can be measured by the total evidence count $u_{\mathrm{epist}}=-\alpha_{0}$ or Dirichlet differential entropy

⑦For classification, the specific class label $v$ in $\hat{y}^{(v)}\sim\mathrm{Cat}(\boldsymbol{p}^{(v)})$ also shows significance. It usually set $\alpha_{c}^{\mathbf{prior}}\in\mathbb{R}_{+}^{C}$ in Dirichlet prior to 1 and then predict/update $\beta ^{\left ( v \right )}$ to get posterior $p^{(v)}\sim\mathrm{Dir}(\alpha^{\mathrm{post},(v)})$ with posterior parameter $\alpha^{\mathrm{post},(v)}=\alpha^{\mathrm{prior}}+\beta^{(v)}$ where the $\beta^{(v)}$ can be regarded as class pseudo-counts

（1）Bayesian Update for Interdependent Inputs

①Their improvement: they diffuse $\beta^{ft,(v)}$ predicted by independent node classification task to $\beta^{agg,(v)}$ based on features of neighbors（相当于就是无网络结构的用 $\beta^{ft,(v)}$ 然后有网络结构的用 $\beta^{agg,(v)}$ 啦）

②Schematic od GPN:

left: the total feature evidence $\alpha_{0}^{\mathrm{ft},(\hat{v})}=\sum_{c}\beta_{c}^{\mathrm{ft},(v)}$ and $\bar{p}^{\mathrm{ft},(v)}=\beta^{\mathrm{ft},(v)}/\alpha_{0}^{\mathrm{ft},(v)}$ are EU and AU only based on node features;

middle: Personalized Page Rank (PPR) message passing $\beta_{c}^{\mathrm{agg},(v)}=\sum_{u\in\mathcal{V}}\Pi_{v,u}^{ppr}\beta_{c}^{\mathrm{ft},(u)}$ to get aggregated class pseudo-counts, where $\mathbb{P}\left ( v\mid u \right )=\Pi_{v,u}^{ppr}$ with $\sum_{u}\Pi_{v,u}^{ppr}=1$ are the dense PPR scores implicitly reflecting the importance of node $u$ on $v$ （但作者又说他们是用power iteration similarly来替代的PPR，为啥要替代，计算成本太大吗？不造，没提）. Furthermore, PPR only utilize edge connections and $\mathbb{P}\left ( u\mid c \right )=\mathbb{P}(\boldsymbol{z}^{(u)}\mid{c};\boldsymbol{\phi})$ only use node features. Then the authors combined them two to:

$\beta_{c}^{\mathrm{agg},(v)}\propto\bar{\mathbb{P}}(v\mid c)=\sum_{u\in\mathcal{V}}\mathbb{P}(v\mid u)\mathbb{P}(u\mid c)$

right: Bayesian updating.

③Loss function with Bayesian loss:

$\mathcal{L}^{(v)}=- \mathbb{E}_{p^{(v)}\sim\mathbb{Q}^{post,(v)}}\left[\log\mathbb{P}(y^{(v)}\mid\boldsymbol{p}^{(v)})\right]-\lambda \mathbb{H}\left[\mathbb{Q}^{post,(v)}\right]$

with regularization factor $\lambda$

2.4.3. Uncertainty Estimation Guarantees

①They test a parameterized GPN model with "a (feature) encoder $f_\phi$ with piecewise ReLU activations, a PPR diffusion, and a density estimator $\mathbb{P}(\boldsymbol{z^{\mathrm{ft,}(v)}}\mid\omega)$ "

2.4.4. Limitations & Impact

（1）OOD data close to ID data

①GPN guaranteed uncertainty estimates with extreme OOD, but can not guarantee OOD data close to ID

（2）Non-homophilic uncertainty

①They did not consider heterophilic graphs

（3）Task-specific OOD

①有些特征空间用密度检测不到OOD

tabular adj. 扁平的；列成表格的

（4）Broader Impact

①......data breach......privacy......

2.5. Experiments

2.5.1. Set-up

（1）Ablation

①Ablation study on module:

②Ablation study on misclassification:

（2）Baselines

比较了很多很多很多但是在附录里太多太长不放

（3）Datasets

①CoraML, CiteSeer, PubMed, CoauthorPhysics,CoauthorCS, AmazonPhotos, AmazonComputers, OGBN Arxiv

2.5.2. Results

（1）OOD Detection

（2）Attributed Graph Shifts

（3）Qualitative Evaluation

（4）Inference & training time

2.6. Conclusion

3. 知识补充

3.1. Label Propagation (LP)

参考学习：半监督学习之labelPropagation原理与实现 - 知乎 (zhihu.com)

3.2. Pseudo-counts

（1）定义：

在人工智能，尤其是机器学习和统计建模中，pseudo-counts（伪计数）是一种用于处理数据稀疏性和平滑概率分布的技术。当我们在处理离散数据（如文本数据、类别数据等）时，经常会遇到某些事件或类别在训练集中很少或从未出现过的情况，这可能导致在测试或预测时得到不合理的概率或估计。

为了解决这个问题，我们可以使用pseudo-counts来“平滑”这些概率。具体来说，pseudo-counts是在观察到的计数上添加的一个小的、固定的值，以增加那些很少或从未出现的事件或类别的概率。这有助于防止模型对训练集中未出现的事件或类别做出过于极端的预测。

例如，在朴素贝叶斯分类器中，我们可能会使用Laplace平滑（也称为加1平滑），其中pseudo-count被设置为1。如果我们有一个类别在训练集中从未出现过，使用Laplace平滑可以确保该类别在预测时仍有一个非零的概率。

总的来说，pseudo-counts是一种用于处理数据稀疏性和平滑概率分布的技术，它通过在观察到的计数上添加一个小值来增加那些很少或从未出现的事件或类别的概率。

（2）感觉就像是模拟了一个噪音

3.3. Radial normalizing flows

？

4. Reference List

Stadler, M. et al. (2021) 'Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification', Neural Information Processing Systems. doi: https://doi.org/10.48550/arXiv.2110.14012