Paper Notes: Graph Attention Networks

Graph Attention Networks
  • LINK: https://arxiv.org/abs/1710.10903

  • CLASSIFICATION: SPATIAL-BASED GCN

  • YEAR: Submitted on 30 Oct 2017 (v1), last revised 4 Feb 2018 (this version, v3)

  • FROM: ICLR 2018

  • WHAT PROBLEM TO SOLVE: To overcome the limitations of GCNs.

  • SOLUTION: By stacking layers in which nodes are able to attend over their neighborhoods’ features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront.

  • CORE POINT:

    • Serveral interestring properties of attention architecture

      1. The operation is efficient, since it is parallelizable across node-neighbor pairs.
      2. It can be applied to graph nodes having different degrees by specifying arbitrary weights to the neighbors.
      3. The model is directly applicable to inductive learning problems, including tasks where the model has to generalize to completely unseen graphs.
    • Graph Attentional layer

      1. a shared linear transformation, parametrized by a weight matrix, W ∈ R F ′ × F W∈R^{F′×F} WRF×F , is applied to every node. We then perform self-attention on the nodes—a shared attentional mechanism a : R F ′ × R F ′ → R a:R^{F′}×R^{F′}→R a:RF×RFR computes attention coefficients. That indicate the importance of node j’s features to node i. The attention mechanism a is a single-layer feedforward neural network.

        image.png
      2. We inject the graph structure into the mechanism by performing masked attention—we only compute eij for nodes j ∈ N i j ∈ N_i jNi, where Ni is some neighborhood of node i in the graph. In all our experiments, these will be exactly the first-order neighbors of i (including i). To make coefficients easily comparable across different nodes, we normalize them across all choices of j using the softmax function:

        image.png
      3. Parametrized by a weight vector a ⃗ ∈ R 2 F ′ \vec{a}∈R^{2F′} a R2F , and applying the LeakyReLU nonlinearity (with negative input slope α = 0.2).

        image.png
      4. To stabilize the learning process of self-attention, we have found extending our mechanism to employ multi-head attention to be beneficial.

        image.png
      5. Specially, if we perform multi-head attention on the final (prediction) layer of the network, concatenation is no longer sensible—instead, we employ averaging, and delay applying the final nonlinearity (usually a softmax or logistic sigmoid for classification problems) until then:

        image.png

        image.png

    • Comparisons to related work

      • Computationally, it is highly efficient: the operation of the self-attentional layer can be parallelized across all edges, and the computation of output features can be parallelized across all nodes.
      • As opposed to GCNs, our model allows for (implicitly) assigning different importances to nodes of a same neighborhood, enabling a leap in model capacity. Furthermore, analyzing the learned attentional weights may lead to benefits in interpretability.
      • The attention mechanism is applied in a shared manner to all edges in the graph, and therefore it does not depend on upfront access to the global graph structure or (features of) all of its nodes (a limitation of many prior techniques).
        • The graph is not required to be undirected.
        • It makes our technique directly applicable to inductive learning.
      • GraphSAGE samples a fixed-size neighborhood of each node, in order to keep its computational footprint consistent; this does not allow it access to the entirety of the neighborhood while performing inference.
      • In comparison to previ- ously considered MoNet instances, our model uses node features for similarity computations, rather than the node’s structural properties (which would assume knowing the graph structure upfront).
    • Datasets

      image.png

    • Experimental setup

      • Transductive learning
        • Two-layer GAT model.
        • The first layer consists of K = 8 attention heads computing F ′ = 8 features each (for a total of 64 features), followed by an exponential linear unit (ELU) nonlinearity.
        • The second layer is used for classification: a single attention head that computes C features (where C is the number of classes), followed by a softmax activation.
        • L2 regularization with λ = 0.0005.
        • Dropout with p = 0.6 is applied to both layers’ inputs, as well as to the normalized attention coefficients (critically, this means that at each training iteration, each node is exposed to a stochastically sampled neighborhood).
      • Inductive learning
        • Three-layer GAT model.
        • Both of the first two layers consist of K = 4 attention heads computing F ′ = 256 features (for a total of 1024 features), followed by an ELU nonlinearity.
        • The final layer is used for (multi-label) classification: K = 6 attention heads computing 121 features each, that are averaged and followed by a logistic sigmoid activation.
        • No need to apply L2 regularization or dropout.
        • Employed skip connections across the intermediate attentional layer.

      Both models are initialized using Glorot initialization and trained to minimize cross-entropy on the training nodes using the Adam SGD optimizer with an initial learning rate of 0.01 for Pubmed, and 0.005 for all other datasets. In both cases we use an early stopping strategy on both the cross-entropy loss and accuracy (transductive) or micro-F1(inductive) score on the validation nodes, with a patience of 100 epochs .

    • Results

      image.png image.png

      image.png

    • Interestring research direction

      Taking advantage of the attention mechanism to perform a thorough analysis on the model interpretability.

  • EXISTING PROBLEMS: Overcoming the practical problems described in subsection 2.2 to be able to handle larger batch sizes.

  • IMPROVEMENT IDEAS: 1. Sampling nodes base on the sorted result of attention weights maybe better. 2. Extending the method to perform graph classification instead of node classification would also be relevant from the application perspective. 3. Extending the model to incorporate edge features (possibly indicating relationship among nodes) would allow us to tackle a larger variety of problems.

### 回答1: Graph Attention Networks(GAT)是一种用于图神经网络的重要模型。GAT可以对任意大小和结构的图进行监督学习和无监督学习。 GAT基于注意力机制,通过计算节点之间的注意力权重来对图进行建模。与传统的图神经网络不同之处在于,GAT在每个节点与其相邻节点之间引入了注意力权重。这样,每个节点可以根据其邻居节点的特征和注意力权重来更新自身特征表示。通过自适应地学习权重,GAT可以捕捉到不同节点之间的重要性和关联程度。 具体来说,GAT模型主要包括两个关键组件:多头注意力和特征变换。多头注意力允许模型在不同注意力机制下学习到多种节点表示。而特征变换则通过使用多个线性变换层来改变节点特征的维度。 在GAT模型中,每个节点都会与其邻居节点进行信息交互和特征更新。节点会计算与其邻居节点的相似度得分,然后通过softmax函数进行归一化,以得到注意力权重。最后,节点会将邻居节点的特征与对应的注意力权重相乘并求和,从而得到更新后的特征表示。 GAT模型的优点是能够解决不同节点之间的连接关系和重要性差异的建模问题。由于引入了注意力机制,GAT能够对相邻节点的特征进行自适应的加权处理,从而更好地捕捉到有意义的模式和关联。 总之,Graph Attention Networks是一种基于注意力机制的图神经网络模型,能够对任意大小和结构的图进行监督学习和无监督学习。它通过自适应地计算节点之间的注意力权重,实现了对图中节点特征的有效建模。 GAT模型在社交网络、推荐系统和生物信息学等领域具有广泛的应用前景。 ### 回答2: Graph Attention Networks(GAT)是一种用于处理图数据的深度学习模型。传统的图神经网络模型使用了节点邻居的平均值来更新节点的表示,这种方法忽略了不同节点在图中的重要性和关联度。而GAT模型引入了注意力机制,可以在节点之间动态地学习权重,从而更好地捕捉图中节点之间的关系。 GAT模型的核心思想是在每个节点层使用自注意力机制来计算节点之间的注意力权重。具体来说,对于每个节点,GAT模型通过计算与之相邻的节点之间的相似度得到一个归一化的注意力权重。这个相似度可以通过神经网络模块学习得到,其中包括一个共享的权重矩阵。然后,通过将相邻节点的表示与对应的注意力权重相乘并求和,得到一个新的节点表示。这个过程可以通过多头注意力机制来并行计算,从而更好地捕捉节点的重要性和关联度。 GAT模型具有许多优点。首先,GAT模型可以自动学习节点之间的关系,并且可以根据节点之间的重要性分配不同的权重。其次,GAT模型具有较强的可解释性,可以通过注意力权重的可视化来解释模型的决策。此外,GAT模型还可以处理不同类型的图数据,包括社交网络、生物网络和推荐系统等。最后,GAT模型在一些图数据上表现出了较好的性能,在节点分类、链接预测和图分类等任务中取得了良好的结果。 总之,Graph Attention Networks是一种用于处理图数据的深度学习模型,通过引入注意力机制,可以动态地学习节点之间的权重,从而更好地捕捉图中节点之间的关系。该模型具有较好的可解释性和适用性,在许多图数据上取得了较好的性能。 ### 回答3: Graph Attention Networks(GAT)是一种基于图神经网络的模型。GAT的目标是在图数据上进行节点分类或边预测等任务。与传统的图神经网络不同,GAT在节点之间引入了注意力机制,以便在图中自动学习节点之间的关系。 GAT的核心思想是为每个节点分配不同的注意力权重,以更好地聚焦于重要的邻居节点。这种分配是通过学习每对节点间的注意力系数来实现的,而不是像传统方法一样使用固定的加权平均。 具体地说,GAT中的每个节点都有自己的特征向量表示,在计算节点之间的注意力权重时,GAT通过将节点对的特征向量与学习到的注意力权重相乘来评估节点之间的关系强度。然后,它将这些关系强度进行归一化处理,以产生每个节点对的注意力系数。最后,通过将注意力系数与邻居节点的特征向量相乘并进行加权求和,可以得到每个节点的输出特征。 与其他图神经网络方法相比,GAT具有以下优点:1)它能够自动学习节点之间的关系,而不需要手动指定图的拓扑结构;2)它能够根据节点之间的重要性自适应地分配注意力权重;3)它具有较强的可解释性,可以通过分析注意力系数来理解节点之间的关系。 GAT已经在许多图数据任务上取得了很好的效果,如社交网络分析、推荐系统和药物发现等。由于其良好的性能和可解释性,GAT在学术界和工业界都得到了广泛的应用,并且也有很多相关的改进和扩展方法出现。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值