[论文精读]Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification

论文原文:[2110.14012] Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification (arxiv.org)

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 省流版

1.1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.4. Uncertainty Quantification for Node Classification

2.4.1. Axioms

2.4.2. Graph Posterior Network

2.4.3. Uncertainty Estimation Guarantees

2.4.4.  Limitations & Impact

2.5. Experiments

2.5.1. Set-up

2.5.2.  Results

2.6. Conclusion

3. 知识补充

3.1. Label Propagation (LP)

3.2. Pseudo-counts

3.3. Radial normalizing flows

4. Reference List


1. 省流版

1.1. 心得

(1)2.2.的“聚合步骤通常假设网络同质性”,的确,促进的也正聚合相反的也正聚合会有问题,不过相关性不存在这个问题。“这样破坏了独立同分布假设”

2. 论文逐段精读

2.1. Abstract

        ①GNN cand Label Propagation (LP) can both predict nodes

        ②Lack of uncertain non-independent node-level predictions

 axiom  n. [数]公理;公设;原理

2.2. Introduction

        ①Uncertainty can be categorized by aleatoric and epistemic uncertainty, where aleatoric uncertainty (AU) represents irreducible and epistemic uncertainty (EU) means lack of accurate data

        ②Uncertainty estimation is applied on out-of-distribution (OOD) or shift detection, active learning, continual learning and reinforcement learning

        ③⭐"The aggregation step commonly assumes network homophily"

        ④They derive three axioms, propose Graph Posterior Network (GPN), and build a general uncertainty assessment

aleatoric  adj. 任意的;偶然的    epistemic  adj. 知识的;认识的,与认识有关的

account for  (数量或比例上)占;解释,说明(某事);导致,解释(某种事实或情况);(某人)对(行动、政策等)负有责任;将(钱款)列入(预算)

2.3. Related Work

(1)Uncertainty for i.i.d. inputs

(2)Uncertainty for graphs

2.4. Uncertainty Quantification for Node Classification

        ①They set graph G=(A,X), adjacency matrix A\in \left \{ 0,1 \right \}^{N \times N}, node attribute matrix X \in \mathbb{R}^{N \times D}

        ②Node set V=\tau \cup \upsilon, where u \in \upsilon are labelled and v \in \tau are unlabelled nodes

        ③Task: infer the node label y^{\left ( v \right )} \in \left \{ 1,...,C \right \} of v

2.4.1. Axioms

        ①Overview of axioms:

which aims at research uncertainty situation and network effects

        ②这一节很难很抽象因此我中文简单叙述

(1)Axiom 3.1.(网络效应)

        ①Self-attributes based analysis should assign more uncertainty on nodes with high difference between trained nodes

        ②如果一个节点有异常特征,要么用无网络影响的先验经验去分类它,要么远离独立同分布的训练数据(好绕口啊,也不知道有没有更简单的解释)

(2)Axiom 3.2.(节点外扩)

        ①Epistemical certain in one node w/o network effects \Rightarrow epistemical certain in its neighbors w/ network effects, keeping all other conditions same

        ②同质图中,对当前节点预测的epistemical certain高\Rightarrow它会将它的高置信度也传递给邻居。但要是当前节点有异常特征\Rightarrow邻居也会受影响,不那么容易确定

        ③For non-attributed graph/plain graph (graphs which do not contain node attributes), they still hope there are confidence influence on neighbors 

(3)Axiom 3.3.(邻居聚合)

        ①High aleatoric uncertainty on neighbors of one node w/o network effects \Rightarrow High aleatoric uncertainty on that node w/ network effects, keeping all other conditions same

        ②如果对节点的邻居分类都已经不明确了\Rightarrow就别再分类当前节点了

        ③如过邻居的分类是冲突的(盲猜不同类别,此时当前节点可能位于边界上)\Rightarrow当前节点的分类也挺难的

        ④一个节点具有high aleatoric uncertainty,此时它邻居可能是既有②low aleatoric uncertainty也有③不同类别,这样会导致这个处于中间的节点非常难做人

anomalous  adj. 异常的;不规则的;不恰当的

2.4.2. Graph Posterior Network

        ①Bayesian concentrates on uncertain samples

        ②Bayesian straightly update single categorical distribution y\sim Cat\left ( p\right ). And the natural choice for a prior distribution over p is its conjugate prior, such as Dirichlet distribution \mathbb{P}(\boldsymbol{p})=\mathrm{Dir}(\boldsymbol{\alpha}^{\mathrm{prior}}) with \alpha_{c}^{\mathbf{prior}}\in\mathbb{R}_{+}^{C}

        ③Bayesian uptate at given observation y^{(1)},...,y^{(N)}:

\mathbb{P}\left(\boldsymbol{p}\mid\{y^{(j)}\}_{j=1}^N\right)\propto\mathbb{P}\left(\{y^{(j)}\}_{j=1}^N\mid\boldsymbol{p}\right)\times\mathbb{P}(\boldsymbol{p})

then the posterior distribution \mathbb{P}(\boldsymbol{p}|\{y^{(j)}\}_{j=1}^{N})=\mathrm{Dir}(\boldsymbol{\alpha}^{post}) with posterior parameter \alpha^{\mathrm{post}}=\alpha^{\mathrm{prior}}+\beta and class counts \beta_{c}=\sum_{j}=\mathbb{I}_{y^{(j)}=c}

        ④They solve the AU and EU problems by Dirichlet mean \bar{p}=\frac{\alpha}{\alpha_{0}} and the total evidence count \alpha _0=\sum _c \alpha _c

        ⑤"The aleatoric uncertainty is commonly measured by the entropy of the categorical distribution, such as u_{\mathrm{alea}}=\mathbb{H}\left[\mathrm{Cat}(\bar{\boldsymbol{p}})\right]"

        ⑥The epistemic uncertainty can be measured by the total evidence count u_{\mathrm{epist}}=-\alpha_{0} or Dirichlet differential entropy 

        ⑦For classification, the specific class label v in \hat{y}^{(v)}\sim\mathrm{Cat}(\boldsymbol{p}^{(v)}) also shows significance. It usually set \alpha_{c}^{\mathbf{prior}}\in\mathbb{R}_{+}^{C} in Dirichlet prior to 1 and then predict/update \beta ^{\left ( v \right )} to get posterior p^{(v)}\sim\mathrm{Dir}(\alpha^{\mathrm{post},(v)}) with posterior parameter \alpha^{\mathrm{post},(v)}=\alpha^{\mathrm{prior}}+\beta^{(v)} where the \beta^{(v)} can be regarded as class pseudo-counts

(1)Bayesian Update for Interdependent Inputs

        ①Their improvement: they diffuse \beta^{ft,(v)} predicted by independent node classification task to \beta^{agg,(v)} based on features of neighbors(相当于就是无网络结构的用\beta^{ft,(v)}然后有网络结构的用\beta^{agg,(v)}啦)

        ②Schematic od GPN:

left: the total feature evidence \alpha_{0}^{\mathrm{ft},(\hat{v})}=\sum_{c}\beta_{c}^{\mathrm{ft},(v)} and \bar{p}^{\mathrm{ft},(v)}=\beta^{\mathrm{ft},(v)}/\alpha_{0}^{\mathrm{ft},(v)} are EU and AU only based on node features;

middle: Personalized Page Rank (PPR) message passing \beta_{c}^{\mathrm{agg},(v)}=\sum_{u\in\mathcal{V}}\Pi_{v,u}^{ppr}\beta_{c}^{\mathrm{ft},(u)} to get aggregated class pseudo-counts, where \mathbb{P}\left ( v\mid u \right )=\Pi_{v,u}^{ppr} with \sum_{u}\Pi_{v,u}^{ppr}=1 are the dense PPR scores implicitly reflecting the importance of node u on v(但作者又说他们是用power iteration similarly来替代的PPR,为啥要替代,计算成本太大吗?不造,没提). Furthermore, PPR only utilize edge connections and \mathbb{P}\left ( u\mid c \right )=\mathbb{P}(\boldsymbol{z}^{(u)}\mid{c};\boldsymbol{\phi}) only use node features. Then the authors combined them two to:

\beta_{c}^{\mathrm{agg},(v)}\propto\bar{\mathbb{P}}(v\mid c)=\sum_{u\in\mathcal{V}}\mathbb{P}(v\mid u)\mathbb{P}(u\mid c)

right: Bayesian updating.

        ③Loss function with Bayesian loss:

\mathcal{L}^{(v)}=- \mathbb{E}_{p^{(v)}\sim\mathbb{Q}^{post,(v)}}\left[\log\mathbb{P}(y^{(v)}\mid\boldsymbol{p}^{(v)})\right]-\lambda \mathbb{H}\left[\mathbb{Q}^{post,(v)}\right]

with regularization factor \lambda

2.4.3. Uncertainty Estimation Guarantees

        ①They test a parameterized GPN model with "a (feature) encoder f_\phi with piecewise ReLU activations, a PPR diffusion, and a density estimator \mathbb{P}(\boldsymbol{z^{\mathrm{ft,}(v)}}\mid\omega)"

2.4.4.  Limitations & Impact

(1)OOD data close to ID data

        ①GPN guaranteed uncertainty estimates with extreme OOD, but can not guarantee OOD data close to ID

(2)Non-homophilic uncertainty

        ①They did not consider heterophilic graphs

(3)Task-specific OOD

        ①有些特征空间用密度检测不到OOD

tabular  adj. 扁平的;列成表格的

(4)Broader Impact

        ①......data breach......privacy......

2.5. Experiments

2.5.1. Set-up

(1)Ablation

        ①Ablation study on module:

        ②Ablation study on misclassification:

(2)Baselines

        比较了很多很多很多但是在附录里太多太长不放

(3)Datasets

        ①CoraML, CiteSeer, PubMed, CoauthorPhysics,CoauthorCS, AmazonPhotos, AmazonComputers, OGBN Arxiv

2.5.2.  Results

(1)OOD Detection

(2)Attributed Graph Shifts

(3)Qualitative Evaluation

(4)Inference & training time

2.6. Conclusion

3. 知识补充

3.1. Label Propagation (LP)

参考学习:半监督学习之labelPropagation原理与实现 - 知乎 (zhihu.com)

3.2. Pseudo-counts

(1)定义:

在人工智能,尤其是机器学习和统计建模中,pseudo-counts(伪计数)是一种用于处理数据稀疏性和平滑概率分布的技术。当我们在处理离散数据(如文本数据、类别数据等)时,经常会遇到某些事件或类别在训练集中很少或从未出现过的情况,这可能导致在测试或预测时得到不合理的概率或估计。

为了解决这个问题,我们可以使用pseudo-counts来“平滑”这些概率。具体来说,pseudo-counts是在观察到的计数上添加的一个小的、固定的值,以增加那些很少或从未出现的事件或类别的概率。这有助于防止模型对训练集中未出现的事件或类别做出过于极端的预测。

例如,在朴素贝叶斯分类器中,我们可能会使用Laplace平滑(也称为加1平滑),其中pseudo-count被设置为1。如果我们有一个类别在训练集中从未出现过,使用Laplace平滑可以确保该类别在预测时仍有一个非零的概率。

总的来说,pseudo-counts是一种用于处理数据稀疏性和平滑概率分布的技术,它通过在观察到的计数上添加一个小值来增加那些很少或从未出现的事件或类别的概率。

(2)感觉就像是模拟了一个噪音

3.3. Radial normalizing flows

4. Reference List

Stadler, M. et al. (2021) 'Graph Posterior Network: Bayesian Predictive Uncertainty for Node Classification', Neural Information Processing Systems. doi: https://doi.org/10.48550/arXiv.2110.14012

  • 28
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值