基于pytorch的GAT教程

GawainTky

已于 2023-12-08 13:06:47 修改

阅读量430

点赞数

文章标签： pytorch 人工智能 python 神经网络

于 2023-12-08 13:04:57 首次发布

原文链接：https://github.com/gordicaleksa/pytorch-GAT/blob/main/The%20Annotated%20GAT%20(Cora).ipynb

版权

GAT教程

本文来自对作者gordicaleksa的notebook代码pytorch-GAT的翻译。感谢作者

本文的想法是让非研究人员也能更轻松地理解图注意力网络（以及通用的GNN）！

在本文中，您将获得以下问题的答案：

✅ GAT 到底是什么？

✅ 如何加载和可视化Cora引文网络？

✅ 我们如何训练GAT（Cora分类示例）？

✅ 如何可视化不同GAT的属性？

完成本课程后，您将对图神经网络有更好的理解！

注意：在本笔记本中，我们将重点关注 Cora（传导法示例），如果需要可以查看其他笔记本中的PPI - 蛋白质相互作用数据集（归纳法）。

很好，让我们开始吧！

图注意力网络到底是什么？

图注意力网络（Graph Attention Network），简称 GAT，是一种图神经网络（GNN），早在 2017 年就在一篇名为《Graph Attention Networks》Veličković et al.的论文中发表。

事实证明，将注意力的想法与现有的图卷积网络（GCN）相结合是一个很好的举措，GAT是GNN文献中被引用次数第二多的论文（截至撰写本文时）。

因为GCN + attention = GAT为了理解GAT，你基本上需要理解GCN。

整个想法来自 CNN。卷积神经网络发展非常好，解决了各种计算机视觉任务，并在深度学习领域引起了巨大的轰动，因此一些人决定将这个想法转移到图上。

基本问题是，虽然图像位于规则网格上（您也可以将其视为图形），但因此具有精确的顺序概念（例如我的左上角邻居（通常称为CV世界里的像素）))，图不具备这种良好的特性，并且邻居的数量以及邻居的顺序都可能会有所不同。

那么如何为图定义内核呢？内核大小不能是这样3x3，因为有时一个节点有2个邻居，有时是 233240（抓狂）。

出现了 2 个主要想法：

谱方法（它们都以某种方式利用图拉普拉斯特征基（我在这里完全忽略它们））
空间方法

尽管空间方法可能隐约受到谱方法的启发，但直接从空间角度思考它们要好得多。好的，就这样吧。

空间（消息传递）方法的高级解释：

所以你可以使用来自邻居的特征向量。您执行以下操作：

你以某种方式变换它们（也许是线性投影）
你以某种方式聚合它们（也许用注意力系数来衡量它们，瞧，我们得到了 GAT（你看我在那里做了什么））
您可以通过将当前节点（变换后的）特征向量与聚合邻居表示相结合来更新当前节点的特征向量（以某种方式）。
差不多就是这样，你可以将许多不同的 GNN 放入这个框架中。

GAT 示意图如下（不同颜色的边代表不同的注意力头）：
transformer architecture

有趣的事实: transformers 可以被认为是GAT的一个特例—当输入图是全连接时。查看此博客了解更多详细信息。

这就是您现在需要了解的一切！

如果您需要进一步帮助理解所有细节，我们创建了 in-depth overview of the GAT paper:这篇GAT 论文的深入概述：

重要提示：此笔记本中的代码是这个repo中可以使用的代码的子集。我将在这里重点关注单个GAT实现（概念上最难理解的实现，但同时也是最有效的实现）。请注意，我实际上在repo中有3个GAT实现。

抛开这些，让我们开始深入研究吧！让我们从与数据加载和可视化相关的导入开始。

# I always like to structure my imports into Python's native libs,
# stuff I installed via conda/pip and local file imports (but we don't have those here)

import pickle

# 可视化相关导入
import matplotlib.pyplot as plt
import networkx as nx
import igraph as ig
plt.rcParams['font.sans-serif'] = ['SimHei']
# 主要的计算依赖
import scipy.sparse as sp
import numpy as np

# 深度学习相关
import torch

import os
import enum

# 支持的数据集 - 仅在此笔记本中使用 Cora
class DatasetType(enum.Enum):
    CORA = 0

    
# Networkx 并不是专为绘图而制作的，但我进行了一些实验
class GraphVisualizationTool(enum.Enum):
    NETWORKX = 0,
    IGRAPH = 1


# 我们将从此目录中导出和读取数据
DATA_DIR_PATH = os.path.join(os.getcwd(), 'data')
CORA_PATH = os.path.join(DATA_DIR_PATH, 'cora')  # 这是已经检入的，无需创建目录

#
# Cora 特定常量
#

# Thomas Kipf 等人在 GCN 论文中首次使用了此拆分，后来 Petar Veličković 等人在 GAT 论文中也使用了它
CORA_TRAIN_RANGE = [0, 140]  # 我们将前 140 个节点用作训练节点
CORA_VAL_RANGE = [140, 140+500]
CORA_TEST_RANGE = [1708, 1708+1000]
CORA_NUM_INPUT_FEATURES = 1433
CORA_NUM_CLASSES = 7

# 无论何时我们需要可视化来自不同类别的点时（t-SNE、CORA 可视化）
cora_label_to_color_map = {0: "red", 1: "blue", 2: "green", 3: "orange", 4: "yellow", 5: "pink", 6: "gray"}

这样我们就解锁了1级（数据📜）。我们继续！

第 1 部分：了解您的数据（与数据合而为一📜❤️）

我将使用 Cora 引文网络作为本笔记本中的运行示例。

说到这里，你可能会想，传导学习和归纳学习有什么区别？如果您不熟悉 GNN，这可能看起来是一个奇怪的概念。但实际上很简单。

传导法 - 你有一个图（如 Cora），你将一些节点（而不是图）分成训练/验证/测试训练集。在训练时，您将仅使用训练节点中的标签。但。在前向传播期间，根据空间 GNN 工作方式的本质，您将聚合来自邻居的特征向量，其中一些可能属于验证甚至测试集！要点是 - 您没有使用它们的标签信息，而是使用了结构信息及其特征。

归纳法 - 如果您有计算机视觉或 NLP基础，您可能会更熟悉这一概念。您有一组训练图、一组单独的验证图，当然还有一组单独的测试图。例如，在图分类任务中，训练数据可以包含带标签的图，测试数据可以包含不带标签的图。模型只能利用训练数据中的图来学习图分类模型。

解释这些完后，让我们进入代码并加载和可视化 Cora。

# 首先，让我们定义这些简单的功能用于加载/保存pickle文件 - 为了Cora我们需要它们

# 所有Cora数据都存储为pickle文件
def pickle_read(path):
    with open(path, 'rb') as file:
        data = pickle.load(file)

    return data

def pickle_save(path, data):
    with open(path, 'wb') as file:
        pickle.dump(data, file, protocol=pickle.HIGHEST_PROTOCOL)

现在让我们看看如何加载 Cora！

# 我们稍后会传入训练配置字典
def load_graph_data(training_config, device):
    dataset_name = training_config['dataset_name'].lower()
    should_visualize = training_config['should_visualize']

    if dataset_name == DatasetType.CORA.name.lower():

        # 形状 = (N, FIN)，其中 N 是节点数，FIN 是输入特征数
        node_features_csr = pickle_read(os.path.join(CORA_PATH, 'node_features.csr'))
        # 形状 = (N, 1)
        node_labels_npy = pickle_read(os.path.join(CORA_PATH, 'node_labels.npy'))
        # 形状 = (N, 相邻节点数) <- 这是一个字典，不是矩阵！
        adjacency_list_dict = pickle_read(os.path.join(CORA_PATH, 'adjacency_list.dict'))

        # 标准化特征（有助于训练）
        node_features_csr = normalize_features_sparse(node_features_csr)
        num_of_nodes = len(node_labels_npy)

        # 形状 = (2, E)，其中 E 是边数，2 是源节点和目标节点。基本上边缘索引
        # 包含格式为 S->T 的元组，例如 0->3 表示具有 ID 0 的节点指向具有 ID 3 的节点。
        topology = build_edge_index(adjacency_list_dict, num_of_nodes, add_self_edges=True)

        # 注意：topology 只是命名图结构数据的花哨方式
        # （除边缘索引之外，它可以是邻接矩阵的形式）

        if should_visualize:  # 网络分析和图绘制
            plot_in_out_degree_distributions(topology, num_of_nodes, dataset_name)  # 我们将在第二部分定义这些
            visualize_graph(topology, node_labels_npy, dataset_name)

        # 转换为稠密 PyTorch 张量

        # 需要是 long int 类型，因为以后像 PyTorch 的 index_select 这样的函数期望它
        topology = torch.tensor(topology, dtype=torch.long, device=device)
        node_labels = torch.tensor(node_labels_npy, dtype=torch.long, device=device)  # 交叉熵期望一个 long int
        node_features = torch.tensor(node_features_csr.todense(), device=device)

        # 帮助我们提取属于训练/验证和测试拆分的节点的索引
        train_indices = torch.arange(CORA_TRAIN_RANGE[0], CORA_TRAIN_RANGE[1], dtype=torch.long, device=device)
        val_indices = torch.arange(CORA_VAL_RANGE[0], CORA_VAL_RANGE[1], dtype=torch.long, device=device)
        test_indices = torch.arange(CORA_TEST_RANGE[0], CORA_TEST_RANGE[1], dtype=torch.long, device=device)

        return node_features, node_labels, topology, train_indices, val_indices, test_indices
    else:
        raise Exception(f'{dataset_name} not yet supported.')

很好，我还使用了另外 2 个尚未定义的函数。首先让我们看看如何在 Cora 上进行特征标准化：

def normalize_features_sparse(node_features_sparse):
    assert sp.issparse(node_features_sparse), f'Expected a sparse matrix, got {node_features_sparse}.'

    # 而不是像 normalize_features_dense() 中那样进行除法，我们对特征的逆和进行乘法。
    # 现代硬件（GPU、TPU、ASIC）针对快速矩阵乘法进行了优化！ ^^ (* >> /)
    # 形状 = (N, FIN) -> (N, 1)，其中 N 表示节点数，FIN 表示输入特征数
    node_features_sum = np.array(node_features_sparse.sum(-1))  # 对每个节点特征向量求和特征

    # 创建一个逆矩阵（记住 * by 1/x 优于（更快）/ by x）
    # 形状 = (N, 1) -> (N)
    node_features_inv_sum = np.power(node_features_sum, -1).squeeze()

    # 再次某些和将为 0，因此 1/0 将为我们提供 inf，因此我们将它们替换为 1，它是 mul 的中性元素
    node_features_inv_sum[np.isinf(node_features_inv_sum)] = 1.

    # 创建一个对角矩阵，其对角线上的值来自 node_features_inv_sum
    diagonal_inv_features_sum_matrix = sp.diags(node_features_inv_sum)

    # 我们返回归一化的特征。
    return diagonal_inv_features_sum_matrix.dot(node_features_sparse)

它基本上使Cora 的二元节点特征向量总和为1。例如，如果我们有[1, 0, 1, 0, 1]（Cora的特征向量更长，我们很快就会看到，但我们暂时采用这个），它将被转换为[0.33, 0, 0.33, 0, 0.33]. 就那么简单。理解实际的实现总是比较困难，但从概念上讲，这是小菜一碟。

让我们建立边索引：

def build_edge_index(adjacency_list_dict, num_of_nodes, add_self_edges=True):
    source_nodes_ids, target_nodes_ids = [], []
    seen_edges = set()

    for src_node, neighboring_nodes in adjacency_list_dict.items():
        for trg_node in neighboring_nodes:
            # if this edge hasn't been seen so far we add it to the edge index (coalescing - removing duplicates)
            if (src_node, trg_node) not in seen_edges:  # it'd be easy to explicitly remove self-edges (Cora has none..)
                source_nodes_ids.append(src_node)
                target_nodes_ids.append(trg_node)

                seen_edges.add((src_node, trg_node))

    if add_self_edges:
        source_nodes_ids.extend(np.arange(num_of_nodes))
        target_nodes_ids.extend(np.arange(num_of_nodes))

    # shape = (2, E), where E is the number of edges in the graph
    edge_index = np.row_stack((source_nodes_ids, target_nodes_ids))

    return edge_index

这个应该相当简单 - 我们只是以这种格式累积边：
[[0, 1], [2, 2], …] 其中 [s, t] 元组基本上定义了节点s（源）指向的边到节点t（目标）。

其他流行的格式（在源码另外个实现中使用）是邻接矩阵- 但它们占用更多的内存（准确地说，O(N^2)，与边缘索引结构的 O(E) 进行比较）。

很好，最后让我们尝试加载它。我们还应该分析形状——这总是一个好主意。

# Let's just define dummy visualization functions for now - just to stop Python interpreter from complaining!

def plot_in_out_degree_distributions():
    pass

def visualize_graph():
    pass

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # checking whether you have a GPU

config = {
    'dataset_name': DatasetType.CORA.name,
    'should_visualize': False
}

node_features, node_labels, edge_index, train_indices, val_indices, test_indices = load_graph_data(config, device)

print(node_features.shape, node_features.dtype)
print(node_labels.shape, node_labels.dtype)
print(edge_index.shape, edge_index.dtype)
print(train_indices.shape, train_indices.dtype)
print(val_indices.shape, val_indices.dtype)
print(test_indices.shape, test_indices.dtype)

torch.Size([2708, 1433]) torch.float32
torch.Size([2708]) torch.int64
torch.Size([2, 13264]) torch.int64
torch.Size([140]) torch.int64
torch.Size([500]) torch.int64
torch.Size([1000]) torch.int64


/var/folders/f0/812mfv7x63vbytjs3yf4gxtc0000gn/T/ipykernel_16269/2448994618.py:6: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.
  data = pickle.load(file)

好的！分析形状我们可以看到以下内容：

Cora有2708个节点
每个节点有 1433 个特征（查看data_loading.py了解更多详细信息）
我们有 13264 条边！（包括自身边）
我们有140个训练节点
我们有 500 个验证节点
我们有1000个测试节点
此外，几乎所有数据都是 int 64 类型。为什么？这是 PyTorch 强加给我们的一个限制。损失函数nn.CrossEntropyLoss和index_select函数需要 torch.long （即 64 位整数）-所以这样。

node_labels是 int64 因为nn.CrossEntropyLoss
其他变量是 int64 因为index_select
在“旁注”中，随着您的进展测试您的代码总是一个好主意。

数据加载与此笔记本的其余部分完全正交，因此我们可以独立测试它，并确保形状和数据类型有意义。我在开发像这样的项目（以及一般情况下）时使用这一策略。

我从数据开始，添加加载功能，添加一些可视化，然后我通常才开始开发深度学习模型本身。

可视化是一个巨大的好处，所以让我们开发它们。

可视化数据🔮👓

让我们首先了解 Cora 中节点的度分布 - 即节点有多少条输入/输出边，这是图连通性的某种度量。

运行以下代码：

def plot_in_out_degree_distributions(edge_index, num_of_nodes, dataset_name):
    """
        注意：使用 igraph/networkx 等工具可以轻松进行各种强大的网络分析。
        我选择在此处显式计算仅节点度量统计，但如果需要，您可以深入研究并计算图直径、三角形数量以及许多其他网络分析领域的概念。

    """
    if isinstance(edge_index, torch.Tensor):
        edge_index = edge_index.cpu().numpy()
        
    assert isinstance(edge_index, np.ndarray), f'Expected NumPy array got {type(edge_index)}.'

    # 存储每个节点的输入和输出度（对于无向图如 Cora，它们是相同的）
    in_degrees = np.zeros(num_of_nodes, dtype=int)
    out_degrees = np.zeros(num_of_nodes, dtype=int)

    # 边索引形状 = (2, E)，第一行包含源节点，第二行包含目标/汇节点
    # 术语说明：源节点指向目标/汇节点
    num_of_edges = edge_index.shape[1]
    for cnt in range(num_of_edges):
        source_node_id = edge_index[0, cnt]
        target_node_id = edge_index[1, cnt]

        out_degrees[source_node_id] += 1  # 源节点指向其他节点 -> 增加其出度
        in_degrees[target_node_id] += 1  # 类似地

    hist = np.zeros(np.max(out_degrees) + 1)
    for out_degree in out_degrees:
        hist[out_degree] += 1

    fig = plt.figure(figsize=(12,8), dpi=100)  # 否则在 Jupyter Notebook 中图表会很小
    fig.subplots_adjust(hspace=0.6)

    plt.subplot(311)
    plt.plot(in_degrees, color='red')
    plt.xlabel('node id'); plt.ylabel('in-degree count'); plt.title('不同节点 id 的输入度')

    plt.subplot(312)
    plt.plot(out_degrees, color='green')
    plt.xlabel('node id'); plt.ylabel('out-degree count'); plt.title('不同节点 id 的输出度')

    plt.subplot(313)
    plt.plot(hist, color='blue')
    plt.xlabel('node degree')
    plt.ylabel('给定出度的节点数量') 
    plt.title(f'{dataset_name} 数据集的节点出度分布')
    plt.xticks(np.arange(0, len(hist), 5.0))

    plt.grid(True)
    plt.show()

太棒了，现在让我们可视化 Cora 的度分布！

num_of_nodes = len(node_labels)
plot_in_out_degree_distributions(edge_index, num_of_nodes, config['dataset_name'])

请添加图片描述

您可以立即注意到以下几件事：

前2个图是相同的，因为我们将 Cora 视为无向图（即使它自然应该建模为有向图）
某些节点具有大量边（中间的峰值），但大多数节点的边要少得多
第三张图以直方图的形式很好地可视化了这一点 - 大多数节点只有2-5条边（因此峰值位于最左侧）
好吧，我们开始对 Cora 有了一些有价值的见解，让我们继续进一步，从字面上想象/看到 Cora。

下面的单元格将绘制 Cora，运行它。

"""
请参阅此博客以了解可用的图形可视化工具：
  https://towardsdatascience.com/large-graph-visualization-tools-and-approaches-2b8758a1cd59

基本上，取决于您的图形大小，可能会有一些比 igraph 更好的绘图工具。

注意：不幸的是，我不得不将此函数扁平化，因为 igraph 在 Jupyter Notebook 中遇到了一些问题，
我们只会在这里调用它，所以没关系！

"""

dataset_name = config['dataset_name']
visualization_tool=GraphVisualizationTool.IGRAPH

if isinstance(edge_index, torch.Tensor):
    edge_index_np = edge_index.cpu().numpy()

if isinstance(node_labels, torch.Tensor):
    node_labels_np = node_labels.cpu().numpy()

num_of_nodes = len(node_labels_np)
edge_index_tuples = list(zip(edge_index_np[0, :], edge_index_np[1, :]))  # igraph 要求这种格式

# 构建 igraph 图
ig_graph = ig.Graph()
ig_graph.add_vertices(num_of_nodes)
ig_graph.add_edges(edge_index_tuples)

# 准备可视化设置字典
visual_style = {
    "bbox": (700, 700),
    "margin": 5,
}

# 我选择边的厚度与通过我们图中某个边的最短路径（测地线）的数量成比例（edge_betweenness 函数，一个简单的 ad hoc 启发式）

# line1：我使用日志，否则一些边会太厚，而其他边根本不明显
# edge_betweenness 返回 < 1 对于某些边，这就是为什么我使用 clip 作为 log 对于那些边来说是负的
# line2：归一化，使最厚的边为 1，否则边在图表上看起来太厚
# line3：这里的想法是让最强的边缘保持比其他边缘更强，6 刚刚好，不要纠结于此

edge_weights_raw = np.clip(np.log(np.asarray(ig_graph.edge_betweenness())+1e-16), a_min=0, a_max=None)
edge_weights_raw_normalized = edge_weights_raw / np.max(edge_weights_raw)
edge_weights = [w**6 for w in edge_weights_raw_normalized]
visual_style["edge_width"] = edge_weights

# 顶点大小的简单启发式。大小 ~ (度/4)（我尝试了 log 和 sqrt 也取得了很好的效果）
visual_style["vertex_size"] = [deg / 4 for deg in ig_graph.degree()]

# Cora 特有的部分，因为 Cora 有 7 个标签
if dataset_name.lower() == DatasetType.CORA.name.lower():
    visual_style["vertex_color"] = [cora_label_to_color_map[label] for label in node_labels_np]
else:
    print('请随意为您的特定数据集添加自定义配色方案。使用 igraph 默认配色。')

# 设置布局 - 图表在 2D 图表上呈现的方式。图形绘制本身是一个子领域！
# 我使用“Kamada Kawai”力导向方法，这组方法基于物理系统模拟。
# （layout_drl 也为 Cora 提供了不错的结果）
visual_style["layout"] = ig_graph.layout_kamada_kawai()

print('正在绘制结果...（可能需要几秒钟）。')
ig.plot(ig_graph, **visual_style)

# 这个网站有一些很棒的可视化效果，请查看：
# http://networkrepository.com/graphvis.php?d=./data/gsm50/labeled/cora.edges

正在绘制结果...（可能需要几秒钟）。

请添加图片描述

尝试使用visual_style[“bbox”]设置到(3000, 3000)和在vertex_size用 / 2运行，你会得到一个巨巨巨大而惊人的绘图（C处理 igraph 后面的绘图，所以它至少在我的机器上相当快 - 当你滚动它时有一些轻微的滞后）。

好的，我们已经完成了可视化并理解了我们的数据。这是一个巨大的里程碑，所以请拍拍自己的肩膀。🏆🎂🎵

我们已经解锁了 2 级（GAT模型🦄）。😍

现在，让我们了解这个模型！

第 2 部分：了解 GAT 的内部运作方式

GAT首先，让我们创建一个高级类，我们将在其中从GatLayer对象构建。

它基本上只是将层堆叠到 nn.Sequential 对象中，此外，由于 nn.Sequential 需要单个输入（并且它有一个输出），我只是将数据（特征、边缘索引）打包到一个元组中 - 纯语法糖。

import torch.nn as nn
from torch.optim import Adam


class GAT(torch.nn.Module):
    """
    最有趣和最难的实现是实现＃3。
    Imp1和imp2在细节上有所不同，但基本上是相同的东西。

    因此，我将在本笔记本中专注于imp＃3。

    """

    def __init__(self, num_of_layers, num_heads_per_layer, num_features_per_layer, add_skip_connection=True, bias=True,
                 dropout=0.6, log_attention_weights=False):
        super().__init__()
        assert num_of_layers == len(num_heads_per_layer) == len(num_features_per_layer) - 1, f'输入有效的架构参数。'

        num_heads_per_layer = [1] + num_heads_per_layer  # 技巧-这样我可以很好地创建下面的GAT层

        gat_layers = []  # 收集GAT层
        for i in range(num_of_layers):
            layer = GATLayer(
                num_in_features=num_features_per_layer[i] * num_heads_per_layer[i],  # 连接的结果
                num_out_features=num_features_per_layer[i+1],
                num_of_heads=num_heads_per_layer[i+1],
                concat=True if i < num_of_layers - 1 else False,  # 最后一个GAT层执行平均值，其他层执行连接
                activation=nn.ELU() if i < num_of_layers - 1 else None,  # 最后一层只输出原始分数
                dropout_prob=dropout,
                add_skip_connection=add_skip_connection,
                bias=bias,
                log_attention_weights=log_attention_weights
            )
            gat_layers.append(layer)

        self.gat_net = nn.Sequential(
            *gat_layers,
        )

    # 数据只是一个（in_nodes_features，edge_index）元组，我必须这样做是因为nn.Sequential：
    # https://discuss.pytorch.org/t/forward-takes-2-positional-arguments-but-3-were-given-for-nn-sqeuential-with-linear-layers/65698
    def forward(self, data):
        return self.gat_net(data)

现在，有趣的部分让我们定义图层。

我不认为用文字来解释它，比你花时间消化代码和注释更好。

在你开始浪费时间尝试“从头开始”弄清楚之前，可以先观看作者在 GAT 上的视频。手头有一些理论背景总是好的。

class GATLayer(torch.nn.Module):
    """
    实现 #3 受到 PyTorch Geometric 启发：https://github.com/rusty1s/pytorch_geometric

    但是，这里的实现应该更容易理解！（并且性能相似）

    """
    
    # 我们会在许多函数中使用这些常量，所以在这里提取为成员字段
    src_nodes_dim = 0  # 边索引中源节点的位置
    trg_nodes_dim = 1  # 边索引中目标节点的位置

    # 在归纳设置中，这些可能会改变 - 暂时保留这样的设置（未来可能不适用）
    nodes_dim = 0      # 节点维度（轴在张量中可能是一个更熟悉的术语，节点维度是"N"的位置）
    head_dim = 1       # 注意力头维度

    def __init__(self, num_in_features, num_out_features, num_of_heads, concat=True, activation=nn.ELU(),
                 dropout_prob=0.6, add_skip_connection=True, bias=True, log_attention_weights=False):

        super().__init__()

        self.num_of_heads = num_of_heads
        self.num_out_features = num_out_features
        self.concat = concat  # 是否应该连接还是平均注意力头
        self.add_skip_connection = add_skip_connection

        #
        # 可训练权重：线性投影矩阵（在论文中表示为"W"）、注意力目标/源（在论文中表示为"a"）和偏差（在论文中未提到，但在官方GAT存储库中存在）
        #

        # 可以将这个矩阵视为 num_of_heads 个独立的 W 矩阵
        self.linear_proj = nn.Linear(num_in_features, num_of_heads * num_out_features, bias=False)

        # 在我们连接目标节点（节点 i）和源节点（节点 j）之后，我们应用“加法”评分函数
        # 它给我们未标准化的分数 "e"。在这里，我们分割 "a" 向量 - 但语义保持不变。
        # 基本上，与执行 [x, y]（连接，x/y 是节点特征向量）和与 "a" 的点积不同，
        # 我们分别对 x 和 "a_left" 进行点积，对 y 和 "a_right" 进行点积，然后将它们相加
        self.scoring_fn_target = nn.Parameter(torch.Tensor(1, num_of_heads, num_out_features))
        self.scoring_fn_source = nn.Parameter(torch.Tensor(1, num_of_heads, num_out_features))

        # 在 GAT 中偏置绝对不是关键的 - 随时实验（我在这个问题上向主要作者 Petar 询问过）
        if bias and concat:
            self.bias = nn.Parameter(torch.Tensor(num_of_heads * num_out_features))
        elif bias and not concat:
            self.bias = nn.Parameter(torch.Tensor(num_out_features))
        else:
            self.register_parameter('bias', None)

        if add_skip_connection:
            self.skip_proj = nn.Linear(num_in_features, num_of_heads * num_out_features, bias=False)
        else:
            self.register_parameter('skip_proj', None)

        #
        # 可训练权重结束
        #

        self.leakyReLU = nn.LeakyReLU(0.2)  # 使用 0.2，就像在论文中一样，不需要公开每个设置
        self.activation = activation
        # 可能不是最好的设计，但我在 3 个位置使用相同的模块，用于特征投影之前/之后和注意力系数。
        # 就功能而言，它与使用独立模块是相同的。
        self.dropout = nn.Dropout(p=dropout_prob)

        self.log_attention_weights = log_attention_weights  # 是否应记录注意力权重
        self.attention_weights = None  # 用于后续可视化目的，我在这里缓存权重

        self.init_params()
        
    def forward(self, data):
        #
        # 步骤 1：线性投影 + 正则化
        #

        in_nodes_features, edge_index = data  # 解包数据
        num_of_nodes = in_nodes_features.shape[self.nodes_dim]
        assert edge_index.shape[0] == 2, f'期望形状为 (2,E) 的边索引，得到了 {edge_index.shape}'

        # 形状 = (N, FIN)，其中 N 是图中的节点数，FIN 是每个节点的输入特征数
        # 我们对所有输入节点特征应用 dropout（正如论文中所提到的）
        # 注意：对于 Cora，特征已经非常稀疏，所以实际上可能帮助不大
        in_nodes_features = self.dropout(in_nodes_features)

        # 形状 = (N, FIN) * (FIN, NH*FOUT) -> (N, NH, FOUT)，其中 NH 是注意力头的数量，FOUT 是输出特征的数量
        # 我们将输入节点特征投影到 NH 个独立的输出特征中（每个注意力头一个）
        nodes_features_proj = self.linear_proj(in_nodes_features).view(-1, self.num_of_heads, self.num_out_features)

        nodes_features_proj = self.dropout(nodes_features_proj)  # 在官方 GAT 实现中，他们在这里也使用了 dropout

        #
        # 步骤 2：边注意力计算
        #

        # 应用评分函数（* 表示按元素（也称为Hadamard）乘法）
        # 形状 = (N, NH, FOUT) * (1, NH, FOUT) -> (N, NH, 1) -> (N, NH)，因为 sum 压缩了最后一个维度
        # 优化注：在我的实验中，torch.sum() 的性能与 .sum() 一样好
        scores_source = (nodes_features_proj * self.scoring_fn_source).sum(dim=-1)
        scores_target = (nodes_features_proj * self.scoring_fn_target).sum(dim=-1)

        # 我们只需根据边索引复制（提升）源/目标节点的分数。我们不需要准备所有可能的分数组合，
        # 我们只需要准备那些将实际使用的分数组合，这由边索引定义
        # 分数形状 = (E, NH)，nodes_features_proj_lifted 形状 = (E, NH, FOUT)，E 是图中的边数
        scores_source_lifted, scores_target_lifted, nodes_features_proj_lifted = self.lift(scores_source, scores_target, nodes_features_proj, edge_index)
        scores_per_edge = self.leakyReLU(scores_source_lifted + scores_target_lifted)

        # 形状 = (E, NH, 1)
        attentions_per_edge = self.neighborhood_aware_softmax(scores_per_edge, edge_index[self.trg_nodes_dim], num_of_nodes)
        # 对邻居聚合添加随机性
        attentions_per_edge = self.dropout(attentions_per_edge)

        #
        # 步骤 3：邻居聚合
        #

        # 逐元素（也称为Hadamard）乘法。运算符 * 执行与 torch.mul 相同的操作
        # 形状 = (E, NH, FOUT) * (E, NH, 1) -> (E, NH, FOUT)，1 被广播到 FOUT
        nodes_features_proj_lifted_weighted = nodes_features_proj_lifted * attentions_per_edge

        # 这一部分对每个目标节点累积加权和投影的邻居特征向量
        # 形状 = (N, NH, FOUT)
        out_nodes_features = self.aggregate_neighbors(nodes_features_proj_lifted_weighted, edge_index, in_nodes_features, num_of_nodes)

        #
        # 步骤 4：残差/跳跃连接、连接和偏差
        #

        out_nodes_features = self.skip_concat_bias(attentions_per_edge, in_nodes_features, out_nodes_features)
        return (out_nodes_features, edge_index)

    #
    # 辅助函数（没有注释几乎没有代码，所以不要害怕！）
    #

    def neighborhood_aware_softmax(self, scores_per_edge, trg_index, num_of_nodes):
        """
        正如函数名所示，它对邻居执行 softmax。例如：假设图中有 5 个节点。其中的两个节点 1、2 与节点 3 相连。
        如果我们要计算节点 3 的表示，我们应该考虑节点 1、2 和节点 3 本身的特征向量。由于我们对边 1-3、2-3 和 3-3 的分数
        进行了评估，这个函数将计算类似这样的注意力分数：1-3 / (1-3 + 2-3 + 3-3)（其中 1-3 是过载的符号，它表示边 1-3 及其（exp）分数），
        类似地对于 2-3 和 3-3，即对于这个邻居，我们不关心包含节点 4 和 5 的其他边分数。

        注意：
        从 logits 中减去最大值不会改变最终结果，但它提高了数值稳定性，并且在几乎每个深度学习框架中，这是一个相当常见的“技巧”。
        有关更多详细信息，请查看此链接：

        https://stats.stackexchange.com/questions/338285/how-does-the-subtraction-of-the-logit-maximum-improve-learning

        """
        # 计算分子。使 logits <= 0，以便 e^logit <= 1（这将提高数值稳定性）
        scores_per_edge = scores_per_edge - scores_per_edge.max()
        exp_scores_per_edge = scores_per_edge.exp()  # softmax

        # 计算分母。形状 = (E, NH)
        neigborhood_aware_denominator = self.sum_edge_scores_neighborhood_aware(exp_scores_per_edge, trg_index, num_of_nodes)

        # 1e-16 在理论上不是必需的，但它仅出于数值稳定性考虑存在（避免除以 0）- 由于计算机将非常小的数字四舍五入到 0，这是可能的
        attentions_per_edge = exp_scores_per_edge / (neigborhood_aware_denominator + 1e-16)

        # shape = (E, NH) -> (E, NH, 1) so that we can do element-wise multiplication with projected node features
        return attentions_per_edge.unsqueeze(-1)

    def sum_edge_scores_neighborhood_aware(self, exp_scores_per_edge, trg_index, num_of_nodes):
        # 形状必须与 exp_scores_per_edge 相同（由 scatter_add_ 要求），即从 E 变为 (E, NH)
        trg_index_broadcasted = self.explicit_broadcast(trg_index, exp_scores_per_edge)

        # 形状为 (N, NH)，其中 N 是节点数量，NH 是注意力头的数量
        size = list(exp_scores_per_edge.shape)  # 转换为列表，否则无法进行赋值
        size[self.nodes_dim] = num_of_nodes
        neighborhood_sums = torch.zeros(size, dtype=exp_scores_per_edge.dtype, device=exp_scores_per_edge.device)

        # 位置 i 包含所有指向节点 i 的节点的 exp 分数之和（由目标索引指定）
        neighborhood_sums.scatter_add_(self.nodes_dim, trg_index_broadcasted, exp_scores_per_edge)

        # 再次扩展，以便将其用作 softmax 分母。例如，节点 i 的总和将复制到源节点指向 i 的所有位置（由目标索引指定）
        # 形状为 (N, NH) -> (E, NH)
        return neighborhood_sums.index_select(self.nodes_dim, trg_index)

    def aggregate_neighbors(self, nodes_features_proj_lifted_weighted, edge_index, in_nodes_features, num_of_nodes):
        size = list(nodes_features_proj_lifted_weighted.shape)  # 转换为列表，否则无法进行赋值
        size[self.nodes_dim] = num_of_nodes  # 形状为 (N, NH, FOUT)
        out_nodes_features = torch.zeros(size, dtype=in_nodes_features.dtype, device=in_nodes_features.device)

        # 形状为 (E) -> (E, NH, FOUT)
        trg_index_broadcasted = self.explicit_broadcast(edge_index[self.trg_nodes_dim], nodes_features_proj_lifted_weighted)
        # 聚合步骤 - 我们累积所有注意力头的投影加权节点特征
        # 形状为 (E, NH, FOUT) -> (N, NH, FOUT)
        out_nodes_features.scatter_add_(self.nodes_dim, trg_index_broadcasted, nodes_features_proj_lifted_weighted)

        return out_nodes_features

    def lift(self, scores_source, scores_target, nodes_features_matrix_proj, edge_index):
        """
        抬升（Lift）即根据边索引复制特定向量。
        张量的维度之一从 N 变为 E（这就是“抬升”一词的来源）。

        """
        src_nodes_index = edge_index[self.src_nodes_dim]
        trg_nodes_index = edge_index[self.trg_nodes_dim]

        # 使用 index_select 比在 PyTorch 中使用 "normal" 索引（scores_source[src_nodes_index]）更快！
        scores_source = scores_source.index_select(self.nodes_dim, src_nodes_index)
        scores_target = scores_target.index_select(self.nodes_dim, trg_nodes_index)
        nodes_features_matrix_proj_lifted = nodes_features_matrix_proj.index_select(self.nodes_dim, src_nodes_index)

        return scores_source, scores_target, nodes_features_matrix_proj_lifted

    def explicit_broadcast(self, this, other):
        # 附加单例维度，直到 this.dim() == other.dim()
        for _ in range(this.dim(), other.dim()):
            this = this.unsqueeze(-1)

        # 明确扩展以使形状相同
        return this.expand_as(other)

    def init_params(self):
        """
        我们使用 Glorot（也称为 Xavier 均匀）初始化的原因是因为它是 TF 的默认初始化方式：
            https://stackoverflow.com/questions/37350131/what-is-the-default-variable-initializer-in-tensorflow

        原始库在 TensorFlow（TF）中开发，并且他们使用了默认初始化。
        随时进行实验 - 根据问题可能有更好的初始化方法。

        """
        nn.init.xavier_uniform_(self.linear_proj.weight)
        nn.init.xavier_uniform_(self.scoring_fn_target)
        nn.init.xavier_uniform_(self.scoring_fn_source)

        if self.bias is not None:
            torch.nn.init.zeros_(self.bias)

    def skip_concat_bias(self, attention_coefficients, in_nodes_features, out_nodes_features):
        if self.log_attention_weights:  # 可能记录以供稍后在 playground.py 中可视化
            self.attention_weights = attention_coefficients

        if self.add_skip_connection:  # 添加跳跃或残差连接
            if out_nodes_features.shape[-1] == in_nodes_features.shape[-1]:  # 如果 FIN == FOUT
                # unsqueeze 实现以下效果：(N, FIN) -> (N, 1, FIN)，输出特征为 (N, NH, FOUT) 所以 1 被广播到 NH
                # 因此，基本上我们将输入向量 NH 次复制并添加到处理过的向量中
                out_nodes_features += in_nodes_features.unsqueeze(1)
            else:
                # FIN != FOUT，因此我们需要将输入特征向量投影到可以添加到输出特征向量的维度。
                # skip_proj 添加了大量额外的容量，这可能导致过拟合。
                out_nodes_features += self.skip_proj(in_nodes_features).view(-1, self.num_of_heads, self.num_out_features)

        if self.concat:
            # 形状为 (N, NH, FOUT) -> (N, NH*FOUT)
            out_nodes_features = out_nodes_features.view(-1, self.num_of_heads * self.num_out_features)
        else:
            # 形状为 (N, NH, FOUT) -> (N, FOUT)
            out_nodes_features = out_nodes_features.mean(dim=self.head_dim)

        if self.bias is not None:
            out_nodes_features += self.bias

        return out_nodes_features if self.activation is None else self.activation(out_nodes_features)

巨大节省的思想是仅计算实际使用的节点的分数，而不是计算每个可想象的组合的分数（这仅在完全连接的图中有效）。

一旦我们计算出"left"分数和"right"分数，我们就使用边索引“lift”它们。这样，如果1->2图中不存在边，我们的数据结构中就不会有这些分数对。

在添加提升的“左”和“右”（或者更好的命名方式是源和目标）分数后，我们很聪明的用neighborhood-aware softmax-这样 GAT的语义就得到了表达。完成后scatter add（您应该花时间理解并阅读文档），我们可以组合投影的特征向量，瞧，我们得到了一个成熟的 GAT 层。

慢慢来，要有耐心！特别是如果您是 GNN 新手。

我不是一天就能学会所有这些的，需要时间来消化知识。

话虽如此，我们已经解锁了第 3 级（模型训练 💪）。😍
我们已经准备好了数据📜，我们已经准备好了GAT模型🦄，让我们开始训练这头野兽吧！💪

第 3 部分：训练 GAT 💪（Cora 上的分类！）

唷，好吧，最困难的部分已经过去了。让我们创建一个简单的训练循环，其目标是学习对 Cora 节点进行分类。

但首先让我们定义一些相关的常量。

from torch.utils.tensorboard import SummaryWriter

# train.py 中使用的 3 个不同的模型训练/评估阶段
class LoopPhase(enum.Enum):
    TRAIN = 0,  # 训练阶段
    VAL = 1,    # 验证阶段
    TEST = 2    # 测试阶段

writer = SummaryWriter()  # （tensorboard）writer 默认将输出到 ./runs/ 目录

# 用于提前停止的全局变量。在没有在验证数据集上有任何改进（通过准确性度量）的情况下，
# 经过一定数量的 epochs（由 patience_period 变量定义），我们将退出训练循环。
BEST_VAL_ACC = 0      # 最佳验证准确性
BEST_VAL_LOSS = 0     # 最佳验证损失
PATIENCE_CNT = 0      # 忍耐计数

BINARIES_PATH = os.path.join(os.getcwd(), 'models', 'binaries')  # 存储二进制文件的路径
CHECKPOINTS_PATH = os.path.join(os.getcwd(), 'models', 'checkpoints')  # 存储检查点的路径

# 确保这些路径存在，因为代码的其余部分假定存在它们
os.makedirs(BINARIES_PATH, exist_ok=True)
os.makedirs(CHECKPOINTS_PATH, exist_ok=True)

另外，我们定义几个在训练模型时有用的函数。

训练状态包含很多有用的内容metadata，我们可以在以后使用。您可以想象，保存模型的测试准确性非常重要，尤其是当您在云上训练模型时 - 它使组织变得更好。

import git
import re  # 正则表达式


def get_training_state(training_config, model):
    training_state = {
        "commit_hash": git.Repo(search_parent_directories=True).head.object.hexsha,

        # 训练详细信息
        "dataset_name": training_config['dataset_name'],  # 数据集名称
        "num_of_epochs": training_config['num_of_epochs'],  # 训练的 epochs 数量
        "test_acc": training_config['test_acc'],  # 测试准确度

        # 模型结构
        "num_of_layers": training_config['num_of_layers'],  # 层的数量
        "num_heads_per_layer": training_config['num_heads_per_layer'],  # 每层的注意力头数
        "num_features_per_layer": training_config['num_features_per_layer'],  # 每层的特征数
        "add_skip_connection": training_config['add_skip_connection'],  # 是否添加跳跃连接
        "bias": training_config['bias'],  # 是否使用偏置
        "dropout": training_config['dropout'],  # 丢弃率

        # 模型状态
        "state_dict": model.state_dict()  # 模型的状态字典
    }

    return training_state


def print_model_metadata(training_state):
    header = f'\n{"*"*5} 模型训练元数据: {"*"*5}'
    print(header)

    for key, value in training_state.items():
        if key != 'state_dict':  # 不打印 state_dict，因为它只是一堆数字...
            print(f'{key}: {value}')
    print(f'{"*" * len(header)}\n')


# 确保我们不覆盖有价值的模型二进制文件（可以忽略 - 对 GAT 方法不是关键的）
def get_available_binary_name():
    prefix = 'gat'

    def valid_binary_name(binary_name):
        # 第一次看到原始 f-字符串？不用担心，唯一的技巧是要加倍大括号。
        pattern = re.compile(rf'{prefix}_[0-9]{{6}}\.pth')
        return re.fullmatch(pattern, binary_name) is not None

    # 只需列出现有的二进制文件，以便我们不会覆盖它们，而是写入新的二进制文件
    valid_binary_names = list(filter(valid_binary_name, os.listdir(BINARIES_PATH)))
    if len(valid_binary_names) > 0:
        last_binary_name = sorted(valid_binary_names)[-1]
        new_suffix = int(last_binary_name.split('.')[0][-6:]) + 1  # 递增 1
        return f'{prefix}_{str(new_suffix).zfill(6)}.pth'
    else:
        return f'{prefix}_000000.pth'

很好，现在是组织程序设置的argparse好方法：

import argparse

def get_training_args():
    parser = argparse.ArgumentParser()

    # 与训练相关
    parser.add_argument("--num_of_epochs", type=int, help="训练轮数", default=10000)
    parser.add_argument("--patience_period", type=int, help="在终止之前在验证集上没有改进的轮数", default=1000)
    parser.add_argument("--lr", type=float, help="模型学习率", default=5e-3)
    parser.add_argument("--weight_decay", type=float, help="模型权重的L2正则化", default=5e-4)
    parser.add_argument("--should_test", type=bool, help='是否在测试集上测试模型？', default=True)

    # 数据集相关
    parser.add_argument("--dataset_name", choices=[el.name for el in DatasetType], help='用于训练的数据集', default=DatasetType.CORA.name)
    parser.add_argument("--should_visualize", type=bool, help='是否可视化数据集？', default=False)

    # 日志/调试/检查点相关（对实验非常有帮助）
    parser.add_argument("--enable_tensorboard", type=bool, help="启用TensorBoard日志", default=False)
    parser.add_argument("--console_log_freq", type=int, help="输出到控制台的日志（每轮）频率（无日志则为None）", default=100)
    parser.add_argument("--checkpoint_freq", type=int, help="检查点模型保存（每轮）频率（无日志则为None）", default=1000)
    args = parser.parse_args("")

    # 模型架构相关 - 这是在官方论文中定义的用于Cora分类的架构
    gat_config = {
        "num_of_layers": 2,  # GNNs与CNNs相反，通常是浅层的（最终取决于图的属性）
        "num_heads_per_layer": [8, 1],
        "num_features_per_layer": [CORA_NUM_INPUT_FEATURES, 8, CORA_NUM_CLASSES],
        "add_skip_connection": False,  # 在Cora上影响性能
        "bias": True,  # 结果对偏置不太敏感
        "dropout": 0.6,  # 对丢失灵敏
    }

    # 将训练配置封装到字典中
    training_config = dict()
    for arg in vars(args):
        training_config[arg] = getattr(args, arg)

    # 添加附加的配置信息
    training_config.update(gat_config)

    return training_config

在这里，我们组织了高级别的 GAT 训练所需的一切。只需结合我们已经学过的部分即可。

import time

def train_gat(config):
    global BEST_VAL_ACC, BEST_VAL_LOSS

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # 检查是否有GPU，希望有！

    # 步骤1：加载图数据
    node_features, node_labels, edge_index, train_indices, val_indices, test_indices = load_graph_data(config, device)

    # 步骤2：准备模型
    gat = GAT(
        num_of_layers=config['num_of_layers'],
        num_heads_per_layer=config['num_heads_per_layer'],
        num_features_per_layer=config['num_features_per_layer'],
        add_skip_connection=config['add_skip_connection'],
        bias=config['bias'],
        dropout=config['dropout'],
        log_attention_weights=False  # 不需要存储注意力，仅在 playground.py 中用于可视化
    ).to(device)

    # 步骤3：准备其他与训练相关的工具（损失和优化器以及装饰函数）
    loss_fn = nn.CrossEntropyLoss(reduction='mean')
    optimizer = Adam(gat.parameters(), lr=config['lr'], weight_decay=config['weight_decay'])

    # 这是训练的核心部分（我们稍后会定义它）
    # 装饰函数使得代码更整洁，因为在训练和验证循环之间有很多冗余
    main_loop = get_main_loop(
        config,
        gat,
        loss_fn,
        optimizer,
        node_features,
        node_labels,
        edge_index,
        train_indices,
        val_indices,
        test_indices,
        config['patience_period'],
        time.time())

    BEST_VAL_ACC, BEST_VAL_LOSS, PATIENCE_CNT = [0, 0, 0]  # 重置用于提前停止的变量

    # 步骤4：开始训练过程
    for epoch in range(config['num_of_epochs']):
        # 训练循环
        main_loop(phase=LoopPhase.TRAIN, epoch=epoch)

        # 验证循环
        with torch.no_grad():
            try:
                main_loop(phase=LoopPhase.VAL, epoch=epoch)
            except Exception as e:  # "忍耐已经用完" 的异常 :O
                print(str(e))
                break  # 退出训练循环

    # 步骤5：可能测试您的模型
    # 不要过度拟合测试数据集 - 仅当您在验证数据集上微调了模型时，才应该报告测试数据集上的最终损失和准确性。
    if config['should_test']:
        test_acc = main_loop(phase=LoopPhase.TEST)
        config['test_acc'] = test_acc
        print(f'Test accuracy = {test_acc}')
    else:
        config['test_acc'] = -1

    # 将最新的GAT保存在二进制目录中
    torch.save(get_training_state(config, gat), os.path.join(BINARIES_PATH, get_available_binary_name()))

🎉🎉🎉

现在是训练的核心部分 - 主循环，正如我所说的那样。

我这样组织它，这样我就不必为训练/验证/测试循环复制/粘贴一堆相同的代码。

# 简单的装饰函数，这样我就不必传递从一个时期到另一个时期都不变的参数
def get_main_loop(config, gat, cross_entropy_loss, optimizer, node_features, node_labels, edge_index, train_indices, val_indices, test_indices, patience_period, time_start):

    node_dim = 0  # 这可能会在我添加归纳示例（Cora是归纳的）时发生变化

    train_labels = node_labels.index_select(node_dim, train_indices)
    val_labels = node_labels.index_select(node_dim, val_indices)
    test_labels = node_labels.index_select(node_dim, test_indices)

    # node_features 形状 = (N, FIN)，edge_index 形状 = (2, E)
    graph_data = (node_features, edge_index)  # 我将数据打包到元组中，因为GAT使用nn.Sequential，它要求这样做

    def get_node_indices(phase):
        if phase == LoopPhase.TRAIN:
            return train_indices
        elif phase == LoopPhase.VAL:
            return val_indices
        else:
            return test_indices

    def get_node_labels(phase):
        if phase == LoopPhase.TRAIN:
            return train_labels
        elif phase == LoopPhase.VAL:
            return val_labels
        else:
            return test_labels

    def main_loop(phase, epoch=0):
        global BEST_VAL_ACC, BEST_VAL_LOSS, PATIENCE_CNT, writer

        # 某些模块的行为取决于我们是否正在训练模型。
        # 例如 nn.Dropout - 我们只想在训练期间丢弃模型权重。
        if phase == LoopPhase.TRAIN:
            gat.train()
        else:
            gat.eval()

        node_indices = get_node_indices(phase)
        gt_node_labels = get_node_labels(phase)  # gt 代表 ground truth（实际标签）

        # 进行前向传播并仅提取相关节点得分（train/val或test）
        # 注意：[0] 只是提取数据的 node_features 部分（索引 1 包含 edge_index）
        # 形状 = (N, C)，其中 N 是分割中节点的数量（train/val/test），C 是类的数量
        nodes_unnormalized_scores = gat(graph_data)[0].index_select(node_dim, node_indices)

        # 例如：让我们取 Cora 上单个节点的输出 - 它是一个大小为 7 的向量，包含未规范化的分数，如下所示：
        # V = [-1.393,  3.0765, -2.4445,  9.6219,  2.1658, -5.5243, -4.6247]
        # PyTorch的交叉熵损失所做的是，对于每个这样的向量，首先应用 softmax，所以我们将 V 转换为：
        # [1.6421e-05, 1.4338e-03, 5.7378e-06, 0.99797, 5.7673e-04, 2.6376e-07, 6.4848e-07]
        # 其次，无论正确类别是什么（假设是 3），它都将在位置 3 处取元素，0.99797 在这种情况下，
        # 损失将为 -log(0.99797)。对于每个节点，它都会这样做并应用均值。
        # 可以看到，随着大多数节点的正确类别的概率接近 1，损失趋近于 0！ <3
        loss = cross_entropy_loss(nodes_unnormalized_scores, gt_node_labels)

        if phase == LoopPhase.TRAIN:
            optimizer.zero_grad()  # 清理计算图中可训练权重的梯度（.grad 字段）
            loss.backward()  # 为计算图中的每个可训练权重计算梯度
            optimizer.step()  # 将梯度应用到权重上

        # 找到每个节点最大（未规范化）得分的索引，这是该节点的类别预测。
        # 将这些与真实（实际标签）标签进行比较，并找到正确预测的比例 -> 准确性指标。
        class_predictions = torch.argmax(nodes_unnormalized_scores, dim=-1)
        accuracy = torch.sum(torch.eq(class_predictions, gt_node_labels).long()).item() / len(gt_node_labels)

        if phase == LoopPhase.TRAIN:
            # 记录指标
            if config['enable_tensorboard']:
                writer.add_scalar('training_loss', loss.item(), epoch)
                writer.add_scalar('training_acc', accuracy, epoch)

            # 保存模型检查点
            if config['checkpoint_freq'] is not None and (epoch + 1) % config['checkpoint_freq'] == 0:
                ckpt_model_name = f"gat_ckpt_epoch_{epoch + 1}.pth"
                config['test_acc'] = -1
                torch.save(get_training_state(config, gat), os.path.join(CHECKPOINTS_PATH, ckpt_model_name))

        elif phase == LoopPhase.VAL:
            # 记录指标
            if config['enable_tensorboard']:
                writer.add_scalar('val_loss', loss.item(), epoch)
                writer.add_scalar('val_acc', accuracy, epoch)

            # 记录到控制台
            if config['console_log_freq'] is not None and epoch % config['console_log_freq'] == 0:
                print(f'GAT training: time elapsed= {(time.time() - time_start):.2f} [s] | epoch={epoch + 1} | val acc={accuracy}')

            # “耐心”逻辑 - 我们是否应该从训练循环中退出？如果验证准确性不断提高
            # 或验证损失不断下降，我们就不会停止
            if accuracy > BEST_VAL_ACC or loss.item() < BEST_VAL_LOSS:
                BEST_VAL_ACC = max(accuracy, BEST_VAL_ACC)  # 跟踪到目前为止的最佳验证准确性
                BEST_VAL_LOSS = min(loss.item(), BEST_VAL_LOSS)
                PATIENCE_CNT = 0  # 每次遇到新的最佳准确性时重置计数器
            else:
                PATIENCE_CNT += 1  # 否则继续计数

            if PATIENCE_CNT >= patience_period:
                raise Exception('停止训练，宇宙对这次训练没有更多的耐心了。')

        else:
            return accuracy  # 在测试阶段，我们只返回测试准确性

    return main_loop  # 返回装饰函数

开始训练吧

# Train the graph attention network (GAT)
train_gat(get_training_args())

/var/folders/f0/812mfv7x63vbytjs3yf4gxtc0000gn/T/ipykernel_16269/2448994618.py:6: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.
  data = pickle.load(file)


GAT training: time elapsed= 0.22 [s] | epoch=1 | val acc=0.124
GAT training: time elapsed= 5.56 [s] | epoch=101 | val acc=0.79
GAT training: time elapsed= 10.72 [s] | epoch=201 | val acc=0.8
GAT training: time elapsed= 15.92 [s] | epoch=301 | val acc=0.8
GAT training: time elapsed= 21.13 [s] | epoch=401 | val acc=0.782
GAT training: time elapsed= 26.34 [s] | epoch=501 | val acc=0.784
GAT training: time elapsed= 31.68 [s] | epoch=601 | val acc=0.816
GAT training: time elapsed= 36.79 [s] | epoch=701 | val acc=0.804
GAT training: time elapsed= 41.91 [s] | epoch=801 | val acc=0.806
GAT training: time elapsed= 47.01 [s] | epoch=901 | val acc=0.814
GAT training: time elapsed= 52.20 [s] | epoch=1001 | val acc=0.81
GAT training: time elapsed= 57.41 [s] | epoch=1101 | val acc=0.798
GAT training: time elapsed= 62.52 [s] | epoch=1201 | val acc=0.816
GAT training: time elapsed= 67.61 [s] | epoch=1301 | val acc=0.796
GAT training: time elapsed= 72.71 [s] | epoch=1401 | val acc=0.79
GAT training: time elapsed= 77.82 [s] | epoch=1501 | val acc=0.812
GAT training: time elapsed= 82.94 [s] | epoch=1601 | val acc=0.796
GAT training: time elapsed= 88.08 [s] | epoch=1701 | val acc=0.798
GAT training: time elapsed= 93.26 [s] | epoch=1801 | val acc=0.794
GAT training: time elapsed= 98.39 [s] | epoch=1901 | val acc=0.8
GAT training: time elapsed= 103.64 [s] | epoch=2001 | val acc=0.806
停止训练，宇宙对这次训练没有更多的耐心了。
Test accuracy = 0.817

好的！！！🎉🎉🎉 4 级解锁（GAT 可视化🔮）。

我们刚刚82.9 %在 Cora 的测试节点上实现了！与原始 GAT 论文中报告的数字相同！

现在我们已经一切就绪：

数据加载和可视化📜 -> 确认
GAT 模型定义 🦄 -> 确认
训练循环设置和训练后的模型二进制文件 💪 -> 确认

现在让我们在显微镜🔬下观察 GAT 模型并了解我们得到的权重 - 我们可以通过多种方式做到这点。

第 4 部分：可视化 GAT 🔮

让我们首先定义一些我们需要的函数。

以下单元格的代码片段将被多次调用，因此我们将其提取到一个函数中 - 一个很好的模块化设计。

注意：主要原因实际上是 igraph 在 Jupyter 上出现问题，所以我正在解决这个问题，如果你好奇的话，请查看原始代码 😂

def gat_forward_pass(model_name, dataset_name):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # 检查是否有 GPU，希望有！

    config = {
        'dataset_name': dataset_name,
        'should_visualize': False  # 不可视化数据集
    }

    # 步骤 1：准备数据
    node_features, node_labels, edge_index, _, _, _ = load_graph_data(config, device)

    # 步骤 2：准备模型
    model_path = os.path.join(BINARIES_PATH, model_name)
    model_state = torch.load(model_path, map_location=torch.device('cpu'))

    gat = GAT(
        num_of_layers=model_state['num_of_layers'],
        num_heads_per_layer=model_state['num_heads_per_layer'],
        num_features_per_layer=model_state['num_features_per_layer'],
        add_skip_connection=model_state['add_skip_connection'],
        bias=model_state['bias'],
        dropout=model_state['dropout'],
        log_attention_weights=True
    ).to(device)

    print_model_metadata(model_state)
    gat.load_state_dict(model_state["state_dict"], strict=True)
    gat.eval()  # 一些层，比如 nn.Dropout，在训练模式和评估模式下的行为是不同的，因此这一部分很重要

    # 步骤 3：计算我们将需要不同可视化类型的所有东西（注意力、分数、edge_index）

    # 这个上下文管理器很重要（你经常会看到它），否则 PyTorch 会占用更多内存。
    # 它会保存反向传播的激活，但我们不会进行任何模型训练，只是预测。
    with torch.no_grad():
        # 步骤 3：运行预测并收集高维数据
        all_nodes_unnormalized_scores, _ = gat((node_features, edge_index))  # 形状 = (N, 类别数)
        all_nodes_unnormalized_scores = all_nodes_unnormalized_scores.cpu().numpy()

    return all_nodes_unnormalized_scores, edge_index, node_labels, gat

很高兴只生成将在下游可视化中使用的数据，您将在以下单元格中看到定义的数据。

我们还需要一个辅助函数，已经准备好了！

# 绘制（但尚未绘制）熵直方图。如果你对为什么突然出现熵感到困惑，请跟我走，你很快就会明白的。
# 基本上，它帮助我们量化 GAT 学到的注意力模式的有用性。
def draw_entropy_histogram(entropy_array, title, color='blue', uniform_distribution=False, num_bins=30):
    max_value = np.max(entropy_array)
    bar_width = (max_value / num_bins) * (1.0 if uniform_distribution else 0.75)
    histogram_values, histogram_bins = np.histogram(entropy_array, bins=num_bins, range=(0.0, max_value))

    plt.bar(histogram_bins[:num_bins], histogram_values[:num_bins], width=bar_width, color=color)
    plt.xlabel(f'熵区间')
    plt.ylabel(f'节点邻居数量')
    plt.title(title)

很好，接下来是我们将用来可视化 GAT 嵌入（通过 t-SNE）和熵直方图的主要函数：

from sklearn.manifold import TSNE
from scipy.stats import entropy


# 让我们定义一个枚举作为选择不同可视化选项的清晰方式
class VisualizationType(enum.Enum):
    ATTENTION = 0,
    EMBEDDINGS = 1,
    ENTROPY = 2,


def visualize_gat_properties(model_name=r'gat_000000.pth', dataset_name=DatasetType.CORA.name, visualization_type=VisualizationType.ATTENTION):
    """
    在可视化选项之间选择 t-SNE 或熵直方图。
    
    t-SNE 的注意事项：
    查看此链接以获取有关如何调整 t-SNE 的更多直观信息：https://distill.pub/2016/misread-tsne/

    如果您认为实现 t-SNE 并解释每个细节的有用性，并且愿意让我知道，可以打开一个问题或在社交媒体上私信我！<3

    注意：我还尝试过使用 UMAP，但它并没有提供比 t-SNE 更多的见解。
    （缺点：如果要使用其绘图功能，它有很多依赖项）
    
    """
    
    # 获取创建可视化所需的数据
    all_nodes_unnormalized_scores, edge_index, node_labels, gat = gat_forward_pass(model_name, dataset_name)
    
    # 执行特定的可视化（t-SNE 或熵直方图）
    if visualization_type == VisualizationType.EMBEDDINGS:  # 可视化嵌入（使用 t-SNE）
        node_labels = node_labels.cpu().numpy()
        num_classes = len(set(node_labels))

        # 多尝试 perplexity，这可能是 t-SNE 中最重要的参数之一，它基本上控制了高维（原始）空间中 Gaussians 的标准差，即高维空间中邻居的大小。
        # 简而言之，t-SNE 的目标是最小化高维点上拟合的联合高斯分布与低维点上拟合的 t-Student 分布之间的 KL 散度
        # 直观地说，通过这样做，我们保留了高维和低维点之间的相似性（关系）。
        # 如果您对 t-SNE尚不熟悉，这可能不会有太多意义，我已经尝试过了。:P
        t_sne_embeddings = TSNE(n_components=2, perplexity=30, method='barnes_hut').fit_transform(all_nodes_unnormalized_scores)

        fig = plt.figure(figsize=(12,8), dpi=80)  # 否则在 Jupyter Notebook 中，绘图会很小
        for class_id in range(num_classes):
            # 我们提取真实标签等于 class_id 的点，并以相同的方式对它们进行着色，希望它们在 2D 图上聚集在一起 -
            # 这意味着 GAT 已经学到了很好的表示！
            plt.scatter(t_sne_embeddings[node_labels == class_id, 0], t_sne_embeddings[node_labels == class_id, 1], s=20, color=cora_label_to_color_map[class_id], edgecolors='black', linewidths=0.2)
        plt.show()

    # 我们希望我们的局部概率分布（对邻居的注意力权重）是非均匀的，因为这意味着 GAT 学到了有用的模式。熵直方图帮助我们可视化
    # 这些邻居分布与均匀分布（常数关注）有多么不同。如果 GAT 学到了常量注意力，我们可能很好地使用 GCN 或一些更简单的模型。
    elif visualization_type == VisualizationType.ENTROPY:
        num_heads_per_layer = [layer.num_of_heads for layer in gat.gat_net]
        num_layers = len(num_heads_per_layer)

        num_of_nodes = len(node_features)
        target_node_ids = edge_index[1].cpu().numpy()

        # 对于每个 GAT 层和每个 GAT 注意力头，绘制熵直方图
        for layer_id in range(num_layers):
            # 获取边缘的注意力权重（在上面的 GAT 正向传递期间记录了注意力）
            # 注意力形状 = (N, NH, 1) -> (N, NH) - 我们只挤压了最后一个维度，它是多余的
            all_attention_weights = gat.gat_net[layer_id].attention_weights.squeeze(dim=-1).cpu().numpy()

            for head_id in range(num_heads_per_layer[layer_id]):
                uniform_dist_entropy_list = []  # 将理想的均匀直方图保存为参考
                neighborhood_entropy_list = []

                # 这也可以通过 scatter_add_（没有 for 循环）更有效地完成
                # 伪：out.scatter_add_(node_dim, -all_attention_weights * log(all_attention_weights), target_index)
                for target_node_id in range(num_of_nodes):  # 找到图中每个节点的邻居
                    # 这些注意力权重总和为 1，因为 GAT 的设计，所以我们可以将其视为概率分布
                    neigborhood_attention = all_attention_weights[target_node_ids == target_node_id].flatten()
                    # 同样长度的参考均匀分布
                    ideal_uniform_attention = np.ones(len(neigborhood_attention))/len(neigborhood_attention)

                    # 计算熵，如果您对该概念不熟悉，请查看此视频：
                    # https://www.youtube.com/watch?v=ErfnhcEV1O8（Aurélien Géron）
                    neighborhood_entropy_list.append(entropy(neigborhood_attention, base=2))
                    uniform_dist_entropy_list.append(entropy(ideal_uniform_attention, base=2))

                title = f'Cora 熵直方图 层={layer_id}，注意力头={head_id}'
                draw_entropy_histogram(uniform_dist_entropy_list, title, color='orange', uniform_distribution=True)
                draw_entropy_histogram(neighborhood_entropy_list, title, color='dodgerblue')

                fig = plt.gcf()  # 获取当前图形
                plt.show()
                fig.savefig(os.path.join(DATA_DIR_PATH, f'layer_{layer_id}_head_{head_id}.jpg'))
                plt.close()
    else:
        raise Exception(f'不支持的可视化类型 {visualization_type}。')

好的！最后让我们用用吧！首先是-t-SNE。

使用 t-SNE 可视化 GAT 的嵌入 📈

t-SNE 属于一大类降维方法。

它在社区中获得了巨大的关注，因为它使用简单并且效果良好（可能是因为它是由 Geoffrey Hinton khm共同创作的）

还有其他较新的方法比如UMAP，但尚未获得足够的关注（据我所知）。

但理论已经足够了，让我们看一些图表！

model_name=r'gat_000000.pth'  # This model is checked-in, feel free to use the one you trained
dataset_name=DatasetType.CORA.name


visualize_gat_properties(
        model_name,
        dataset_name,
        visualization_type=VisualizationType.EMBEDDINGS  # pick between attention, t-SNE embeddings and entropy
)

/var/folders/f0/812mfv7x63vbytjs3yf4gxtc0000gn/T/ipykernel_16269/2448994618.py:6: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.
  data = pickle.load(file)



***** 模型训练元数据: *****
commit_hash: 91fb864b8f9ddefd401bf5399cea779bd3c0a63b
dataset_name: CORA
num_of_epochs: 10000
test_acc: 0.822
num_of_layers: 2
num_heads_per_layer: [8, 1]
num_features_per_layer: [1433, 8, 7]
add_skip_connection: False
bias: True
dropout: 0.6
layer_type: IMP3
*********************



/Users/gawaintan/miniforge3/envs/torch/lib/python3.9/site-packages/IPython/core/pylabtools.py:152: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)

请添加图片描述

漂亮！

我们可以看到以下内容 - 一旦我们通过 GAT 进行前向传递，它就会将维度（节点数、每个特征向量的特征数）= 的输入特征向量转换为因为 Cora 有 7 个(2708, 1433)类(2708, 7)。

这些类是：Genetic Algorithms、Reinforcement Learning等，使其不那么抽象，但最终它适用于任何 7 个类的集合并不重要。

现在，一旦我们获得了 7 维向量，我们就使用 t-SNE 将它们映射到 2D 向量（因为你知道很难绘制 7D 向量）。t-SNE 的技巧在于，它保留了向量之间的关系，因此，粗略地说，如果它们在 7D 空间中接近（但是我们定义“接近度”），那么它们在 2D 空间中也会接近。

现在您可以看到同一类的点（它们具有相同的颜色）聚集在一起！这是一个理想的特性，因为现在训练一个能够正确预测类别的分类器要容易得多。

太棒了，现在让我们将注意力转移到注意力上，因为我们毕竟正在处理图注意力网络。

可视化邻居的注意力📣

所以，你现在希望了解 GAT 的大致工作原理，并且知道在聚合阶段，每个节点都会为其每个邻居分配一个注意力系数（包括它自己，因为我们添加了自边）。

关于我们可以想象什么有什么想法吗？好吧，让我们选择一些节点，看看他们学到了哪些注意力模式！

您可能想到的第一个想法是，如果注意力较大，则将边画得更厚，反之亦然（这也是我想到的最后一个想法）。

我们开始吧！

# 获取创建可视化所需的数据
all_nodes_unnormalized_scores, edge_index, node_labels, gat = gat_forward_pass(model_name, dataset_name)

# 我们要可视化其邻近节点的关注度的节点数量
num_nodes_of_interest = 4  # 4 是一个您可以尝试不同值的任意数字
head_to_visualize = 0  # 绘制来自该多头注意力头的注意力（仅最后一层有一个多头）
gat_layer_id = 1  # 绘制来自该 GAT 层的注意力（由于我们的 GAT 只有 2 层，这是最后一层）

# 构建完整图
# node_features 形状 =（N，FIN），其中 N 是节点数，FIN 是输入特征数
total_num_of_nodes = len(node_features)
complete_graph = ig.Graph()
complete_graph.add_vertices(total_num_of_nodes)  # igraph 使用这种格式创建带有 [0，total_num_of_nodes - 1] ID 的节点
edge_index_tuples = list(zip(edge_index[0, :], edge_index[1, :]))  # igraph 需要这种格式
complete_graph.add_edges(edge_index_tuples)

# 选择要绘制的目标节点（具有最高度数的节点 + 随机节点）
# 注意：随机节点和具有最高度数的节点之间可能存在重叠 - 但这是非常不可能的
highest_degree_node_ids = np.argpartition(complete_graph.degree(), -num_nodes_of_interest)[-num_nodes_of_interest:]
random_node_ids = np.random.randint(low=0, high=total_num_of_nodes, size=num_nodes_of_interest)

print(f'Highest degree nodes = {highest_degree_node_ids}')

target_node_ids = edge_index[1]
source_nodes = edge_index[0]

#
# 选择要为其可视化关注度的节点 ID！
#

# 由于在 Jupyter 中无法使用 for 循环，只需设置一些数字
target_node_id = 306  # 306 是第二度数最高的节点

# 步骤 1：查找目标节点的邻近节点
# 注意：对于 CORA，包括自环，因此目标节点是其自身的邻居（Alexandro，yo soy tu madre）
src_nodes_indices = torch.eq(target_node_ids, target_node_id)
source_node_ids = source_nodes[src_nodes_indices].cpu().numpy()
size_of_neighborhood = len(source_node_ids)

# 步骤 2：获取它们的标签
labels = node_labels[source_node_ids].cpu().numpy()

# 步骤 3：获取边缘的注意力权重（在上面的 GAT 正向传递期间记录了注意力）
# attention 形状 =（N，NH，1）->（N，NH） - 我们只挤压了最后一个维度，它是多余的
all_attention_weights = gat.gat_net[gat_layer_id].attention_weights.squeeze(dim=-1)
attention_weights = all_attention_weights[src_nodes_indices, head_to_visualize].cpu().numpy()
# 此部分显示了对于 CORA，GAT 学到的注意力权重几乎是常量！ 就像在 GCN 中一样！
print(f'Max attention weight = {np.max(attention_weights)} and min = {np.min(attention_weights)}')
attention_weights /= np.max(attention_weights)  # 重新缩放最大权重为 1，以便更好地绘制

# 构建我们想要可视化注意力的邻居图
# igraph 约束 - 它与连续范围的 ID 一起使用，因此我们将例如节点 497 映射到 0，12 到 1，等等。
id_to_igraph_id = dict(zip(source_node_ids, range(len(source_node_ids))))
ig_graph = ig.Graph()
ig_graph.add_vertices(size_of_neighborhood)
ig_graph.add_edges([(id_to_igraph_id[neighbor], id_to_igraph_id[target_node_id]) for neighbor in source_node_ids])

# 准备可视化设置字典并绘图
visual_style = {
    "edge_width": attention_weights,  # 使边缘尽可能粗
    "layout": ig_graph.layout_reingold_tilford_circular()  # 树状图的布局
}
# 这是唯一针对 Cora 的部分，因为 Cora 有 7 个标签
if dataset_name.lower() == DatasetType.CORA.name.lower():
    visual_style["vertex_color"] = [cora_label_to_color_map[label] for label in labels]
else:
    print('为您特定的数据集添加自定义颜色方案。 使用 igraph 默认着色。')

ig.plot(ig_graph, **visual_style)

***** 模型训练元数据: *****
commit_hash: 91fb864b8f9ddefd401bf5399cea779bd3c0a63b
dataset_name: CORA
num_of_epochs: 10000
test_acc: 0.822
num_of_layers: 2
num_heads_per_layer: [8, 1]
num_features_per_layer: [1433, 8, 7]
add_skip_connection: False
bias: True
dropout: 0.6
layer_type: IMP3
*********************

Highest degree nodes = [1986 1701  306 1358]
Max attention weight = 0.012915871106088161 and min = 0.012394174002110958


/var/folders/f0/812mfv7x63vbytjs3yf4gxtc0000gn/T/ipykernel_16269/2448994618.py:6: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.
  data = pickle.load(file)

请添加图片描述