ASTGCN模型数学原理与代码详解（网络模型部分）

Akatsuky.

已于 2024-10-02 11:17:46 修改

阅读量3.8k

点赞数 42

分类专栏：图神经网络(GNN) 文章标签：深度学习人工智能卷积神经网络 github 机器学习

于 2024-09-29 00:57:37 首次发布

本文链接：https://blog.csdn.net/m0_62483049/article/details/142578747

版权

图神经网络(GNN) 专栏收录该内容

8 篇文章

订阅专栏

这是一份用于ASTGCN模型理解的入门教程，将以代码的视角阐述模型中空间与时间注意力机制是如何实现的，并梳理模型框架与卷积运算过程。本文关于数学原理部分不一定完全严谨，如有错误请在评论区指出。

当然，如果要完全掌握模型的运用，学习该模型的数据预处理是必要的，然而本站已经有博主细致直观地讲解过数据处理部分，可以移步学习：

【ASTGCN】模型调试学习笔记--数据生成详解（超详细）_astgcn week-CSDN博客文章浏览阅读926次，点赞10次，收藏30次。【ASTGCN】模型调试学习笔记--数据生成详解（超详细）_astgcn weekhttps://blog.csdn.net/Yukee_/article/details/140325814?utm_medium=distribute.pc_relevant.none-task-blog-2~default~baidujs_baidulandingword~default-8-140325814-blog-124980394.235%5Ev43%5Epc_blog_bottom_relevance_base7&spm=1001.2101.3001.4242.5&utm_relevant_index=11同时，为了更好的理解模型中的卷积过程以及矩阵相关的知识，可以移步学习：GCN的基础入门及数学原理-CSDN博客文章浏览阅读1k次，点赞20次，收藏18次。一个入门图卷积神经网络的基础攻略，主要介绍了图卷积网络（Graph Convolutional Network，GCN）的数学原理，包括卷积的基本概念、特征值与特征向量的理解，拉普拉斯矩阵重要性质的证明以及GCN卷积的实现方法。（本文不包含pytorch代码讲解，仅仅从数学原理的角度出发阐述什么是“图卷积”）https://blog.csdn.net/m0_62483049/article/details/142467296?spm=1001.2014.3001.5502

1 模型介绍与代码总览

ASTGCN模型全称为：Attention Based Spatial-Temporal Graph Convolutional Networks。该模型来自2019年的论文Attention Based Spatial-Temporal Graph Convolutional Networks for Traffic Flow Forecasting| Proceedings of the AAAI Conference on Artificial Intelligencehttps://ojs.aaai.org/index.php/AAAI/article/view/3881ASTGCN是图卷积神经网络领域十分经典的模型之一。

算法的亮点即是加入了时空注意力机制，通俗的说就是：能更好的捕捉图结构中的时间与空间的隐藏信息。本文将争取以通俗的语言阐述ASTGCN网络模型部分的代码与其底层数学原理。

完整的网络模型包含的函数/类如下所示：

def scaled_Laplacian(W):
    
def cheb_polynomial(L_tilde, K):
    
class Spatial_Attention_layer(nn.Module):

class cheb_conv_withSAt(nn.Module):

class Temporal_Attention_layer(nn.Module):
   
class cheb_conv(nn.Module):   

class ASTGCN_block(nn.Module):
    
class ASTGCN_submodule(nn.Module):

def ASTGCN(DEVICE, nb_block, in_channels, K, nb_chev_filter, nb_time_filter, time_strides, adj_mx, num_for_predict, len_input, num_of_vertices):

2 函数与类详解

2.1 scaled_Laplacian

def scaled_Laplacian(W):

    assert W.shape[0] == W.shape[1]

    D = np.diag(np.sum(W, axis=1))

    L = D - W

    lambda_max = eigs(L, k=1, which='LR')[0].real

    return (2 * L) / lambda_max - np.identity(W.shape[0])

该函数的输入是一个图的邻接矩阵 $W$ ，输出一个图的标准化拉普拉斯矩阵 $\hat{L}=\frac{2L}{\lambda_{\max}}-I$ 。其中 $L=D-W$ ， $D$ 是图的度矩阵。以下图为例：

邻接矩阵 $W$ ：

$W=\begin{bmatrix}0&1&1&1&1&1\\1&0&1&1&0&1\\1&1&0&0&0&1\\1&1&0&0&0&0\\1&0&0&0&0&1\\1&1&1&0&1&0\end{bmatrix}$

度矩阵 $D$ ：

$D=\begin{bmatrix}5&0&0&0&0&0\\0&4&0&0&0&0\\0&0&3&0&0&0\\0&0&0&2&0&0\\0&0&0&0&2&0\\0&0&0&0&0&4\end{bmatrix}$

拉普拉斯矩阵 $L$ ：

$L=D-W=\begin{bmatrix}5&-1&-1&-1&-1&-1\\-1&4&-1&-1&0&-1\\-1&-1&3&0&0&-1\\-1&-1&0&2&0&0\\-1&0&0&0&2&-1\\-1&-1&-1&0&-1&4\end{bmatrix}$

由于我们之前证明过拉普拉斯矩阵是半正定的，所以其特征值在 $[0,\lambda_{\max}]$ 范围内。又参考标准化拉普拉斯矩阵的定义 $\hat{L}=\frac{2L}{\lambda_{\mathrm{max}}}-I$ 可知，标准化拉普拉斯矩阵是拉普拉斯矩阵线性变化换得到的，所以标准化拉普拉斯矩阵的特征值在 $[0-1,\frac{2\lambda_{\max}}{\lambda_{\max}}-1]$ 即 $[-1,1]$ 范围内。

为什么要强调特征值范围在 $[-1,1]$ 之间呢，这是由于下面要提到的切比雪夫多项式适合在该区间上进行逼近。

现在我们已经得到了图的标准化拉普拉斯矩阵 scaled_Laplacian(W)。

2.2 cheb_polynomial

def cheb_polynomial(L_tilde, K):

    N = L_tilde.shape[0]  # 获取矩阵的维度

    cheb_polynomials = [np.identity(N), L_tilde.copy()]  # 初始化 T_0 和 T_1

    for i in range(2, K):
        # 使用递推公式计算 T_i
        cheb_polynomials.append(2 * L_tilde * cheb_polynomials[i - 1] - cheb_polynomials[i - 2])

    return cheb_polynomials

这个函数用来计算从 $T_0$ 到 $T_{K-1}$ 的切比雪夫多项式，其中 $K$ 是多项式的阶数。采用多项式拟合的方法是图卷积的重要思想，能够有效避免矩阵特征分解带来的计算计算压力，提升计算效率。

函数传入参数为 L_tilde 与 K， L_tilde是图的标准化拉普拉斯矩阵 scaled_Laplacian(W)， $K$ 是多项式的阶数。

N = L_tilde.shape[0]

表示获提取scaled_Laplacian(W) 的维度信息。

cheb_polynomials = [np.identity(N), L_tilde.copy()]

表示初始化切比雪夫多项式列表，切比雪夫多项式通项公式如下：

$T_0(x)=1,T_1(x)=x,T_n(x)=2x T_{n-1}(x)-T_{n-2}(x)$

np.identity(N)表示生成一个与scaled_Laplacian(W)维度数目相同的单位矩阵，对应通项公式中的第0项： $T_{0}(x)=I$ 。

L_tilde.copy()表示创建一个scaled_Laplacian(W)的副本，对应通项公式中的第1项： $T_{1}(x)=x$ 。

for i in range(2, K):
        # 使用递推公式计算 T_i
        cheb_polynomials.append(2 * L_tilde * cheb_polynomials[i - 1] - cheb_polynomials[i - 2])

for循环部分则是从 $i=2$ 开始递推计算。

$T_i=2\hat{L}T_{i-1}-T_{i-2}$

函数 cheb_polynomial 返回的是一个数组，确切地说是一个列表。这个列表包含了从 $T_0$ 到 $T_{K-1}$ 的Chebyshev多项式矩阵，每个矩阵都是一个 $N \times N$ 的 numpy 数组。

2.3 Spatial_Attention_layer

class Spatial_Attention_layer(nn.Module):
    '''
    compute spatial attention scores
    '''
    def __init__(self, DEVICE, in_channels, num_of_vertices, num_of_timesteps):
        super(Spatial_Attention_layer, self).__init__()
        self.W1 = nn.Parameter(torch.FloatTensor(num_of_timesteps).to(DEVICE))
        self.W2 = nn.Parameter(torch.FloatTensor(in_channels, num_of_timesteps).to(DEVICE))
        self.W3 = nn.Parameter(torch.FloatTensor(in_channels).to(DEVICE))
        self.bs = nn.Parameter(torch.FloatTensor(1, num_of_vertices, num_of_vertices).to(DEVICE))
        self.Vs = nn.Parameter(torch.FloatTensor(num_of_vertices, num_of_vertices).to(DEVICE))


    def forward(self, x):
        '''
        :param x: (batch_size, N, F_in, T)
        :return: (B,N,N)
        '''

        lhs = torch.matmul(torch.matmul(x, self.W1), self.W2)  # (b,N,F,T)(T)->(b,N,F)(F,T)->(b,N,T)

        rhs = torch.matmul(self.W3, x).transpose(-1, -2)  # (F)(b,N,F,T)->(b,N,T)->(b,T,N)

        product = torch.matmul(lhs, rhs)  # (b,N,T)(b,T,N) -> (B, N, N)

        S = torch.matmul(self.Vs, torch.sigmoid(product + self.bs))  # (N,N)(B, N, N)->(B,N,N)

        S_normalized = F.softmax(S, dim=1)

        return S_normalized

代码实现了一个空间注意力层，主要用于计算输入数据的空间注意力分数。

其中 $W_1,W_2,W_3$ 与 $b_s,V_s$ ，均是矩阵。传入的数据格式为(batch_size, N, F_in, T)，分别代表批次、图中节点个数、输入特征数量、时间。

我们举个例子来展示计算过程，假设我们有以下参数：

$\begin{aligned}&N=3\\&F_{in}=2\\&T=4\end{aligned}$

我们默认只有一个批次，现在有一个简单图包含3个节点，每个节点有2个特征，时间长度为4。

 self.W1 = nn.Parameter(torch.FloatTensor(num_of_timesteps).to(DEVICE))

$W_1=[w_1,w_2,w_3,w_4]^T$

$W_{1}$ 形状与 num_of_timesteps 有关，即与时间长度有关，该代码定义 $W_{1}$ 为一个一维张量（W1=torch.Size([4])）。（四列代表四个时间段）

self.W2 = nn.Parameter(torch.FloatTensor(in_channels, num_of_timesteps).to(DEVICE))

$W_{2}=\begin{bmatrix}v_{11}&v_{12}&v_{13}&v_{14}\\v_{21}&v_{22}&v_{23}&v_{24}\end{bmatrix}$

$W_{2}$ 形状与 in_channels，num_of_timesteps 有关，即输入特征数量、时间长度有关，该代码定义 $W_{2}$ 为一个二维张量（W2=torch.Size([2,4])）。（四列代表四个时间段的数据，每列两个值代表两个特征）

self.W3 = nn.Parameter(torch.FloatTensor(in_channels).to(DEVICE))

$W_3=\begin{bmatrix}w_{31},w_{32}\end{bmatrix}^T$

$W_{3}$ 形状与 in_channels 有关，即与输入特征数量有关，该代码定义 $W_{3}$ 为一个一维张量（W1=torch.Size([2])）。（两列代表两个特征）

现在我们进入计算部分，探究时间注意力机制的内涵，举例来说，对于一个 $N=3,F_{in}=2,T=4$ 的数据来说，大致结构如下所示，可以看出每个节点都包含两个特征、四个时间段的记录数据。

$\mathbf{x}=\begin{bmatrix}\begin{bmatrix}x_{1,1,1,1}&x_{1,1,1,2}&x_{1,1,1,3}&x_{1,1,1,4}\\x_{1,1,2,1}&x_{1,1,2,2}&x_{1,1,2,3}&x_{1,1,2,4}\end{bmatrix}\\\begin{bmatrix}x_{1,2,1,1}&x_{1,2,1,2}&x_{1,2,1,3}&x_{1,2,1,4}\\x_{1,2,2,1}&x_{1,2,2,2}&x_{1,2,2,3}&x_{1,2,2,4}\end{bmatrix}\\\begin{bmatrix}x_{1,3,1,1}&x_{1,3,1,2}&x_{1,3,1,3}&x_{1,3,1,4}\\x_{1,3,2,1}&x_{1,3,2,2}&x_{1,3,2,3}&x_{1,3,2,4}\end{bmatrix}\end{bmatrix}$

代码首先计算了输入与权重矩阵相乘，捕捉输入在时间维度上的特征。这个结果表示在时间维度上对空间特征的综合影响。

lhs = torch.matmul(torch.matmul(x, self.W1), self.W2)

$lhs=\mathbf{x}\cdot W_{1}\cdot W_{2}$

$\mathbf{x}\cdot W_1=lhs_{temp}=\begin{bmatrix}\sum_{t=1}^Tx_{1,i,j,t}\cdot w_t&(i=1,2)&\text{for}\quad j=1,2\end{bmatrix}$

计算过后 $\mathbf{x}\cdot W_1$ 的形状为（B,N,F），矩阵表达为：

$\mathbf{x}\cdot W_{1}=lhs_{temp}=\begin{bmatrix}\sum^{T=4}_{t=1}(x_{1,1,1,t}\times w_{t})&\sum^{T=4}_{t=1}(x_{1,1,2,t}\times w_{t})\\\sum^{T=4}_{t=1}(x_{1,2,1,t}\times w_{t})&\sum^{T=4}_{t=1}(x_{1,2,2,t}\times w_{t})\\\sum^{T=4}_{t=1}(x_{1,3,1,t}\times w_{t})&\sum^{T=4}_{t=1}(x_{1,3,2,t}\times w_{t})\end{bmatrix}$

可以看出这一步操作整合了时间信息，将四个时间点的信息加权求和，其中矩阵 $W_1$ 可视为权重矩阵，在未来的训练中，重要的时间节点可能会赋予更高的权重。

由于 $lhs=lhs_{temp}\cdot W_{2}$ ，计算表达式如下：

$lhs=\begin{bmatrix}\sum_{f=1}^Flhs_{temp_{(1,i,f)}}\cdot w_{f,t}&(i=1,2,t=1,2,3,4)\end{bmatrix}$

计算过后 $lhs$ 的形状为 (B,N,T），矩阵表达为：

$lhs=\begin{bmatrix}\sum^{T=4}_{t=1}(x_{1,1,1,t}\times w_{t})&\sum^{T=4}_{t=1}(x_{1,1,2,t}\times w_{t})\\\sum^{T=4}_{t=1}(x_{1,2,1,t}\times w_{t})&\sum^{T=4}_{t=1}(x_{1,2,2,t}\times w_{t})\\\sum^{T=4}_{t=1}(x_{1,3,1,t}\times w_{t})&\sum^{T=4}_{t=1}(x_{1,3,2,t}\times w_{t})\end{bmatrix}\begin{bmatrix}v_{11}&v_{12}&v_{13}&v_{14}\\v_{21}&v_{22}&v_{23}&v_{24}\end{bmatrix}$

$lhs=\begin{bmatrix} X_{1,1} &X_{1,2} &X_{1,3} &X_{1,4} \\ X_{2,1} &X_{2,2} &X_{2,3} &X_{2,4} \\ X_{3,1} &X_{3,2} &X_{3,3} &X_{3,4} \end{bmatrix}$

$X_{m,n}=\sum^{F}_{f=1}(\sum^{T=4}_{t=1}(x_{1,m,1,t}\times w_{t})\times v_{f,n})$

虽然这些式子看着复杂其实他们经历的只不过是简单的矩阵乘法运算，只是项数太多了看着有些眼花缭乱。省略 $B$ 的大小，我们算出的 $lhs$ 是一个 $(3\times 4)$ 形状的矩阵，这里的3代表的仍然是节点个数，4代表的仍然是时间点个数。

rhs = torch.matmul(self.W3, x).transpose(-1, -2)

$rhs=(W_{3}\cdot \mathbf{x})^T$

计算过后 $rhs$ 的形状为 (B,T,N），矩阵表达为：

$rhs=\begin{bmatrix} \sum^{2}_{i=1}x_{1,1,i,1}\cdot w_{3,i}&\sum^{2}_{i=1}x_{1,1,i,1}\cdot w_{3,i}&\sum^{2}_{i=1}x_{1,1,i,1}\cdot w_{3,i} \\ \sum^{2}_{i=1}x_{1,2,i,1}\cdot w_{3,i}&\sum^{2}_{i=1}x_{1,2,i,2}\cdot w_{3,i}&\sum^{2}_{i=1}x_{1,2,i,3}\cdot w_{3,i} \\ \sum^{2}_{i=1}x_{1,3,i,1}\cdot w_{3,i}&\sum^{2}_{i=1}x_{1,3,i,2}\cdot w_{3,i}&\sum^{2}_{i=1}x_{1,3,i,3}\cdot w_{3,i} \\ \sum^{2}_{i=1}x_{1,4,i,1}\cdot w_{3,i}&\sum^{2}_{i=1}x_{1,4,i,2}\cdot w_{3,i}&\sum^{2}_{i=1}x_{1,4,i,3}\cdot w_{3,i} \end{bmatrix}$

可以看出这一步操作整合了特征信息，将两个的信息加权求和，其中矩阵 $W_3$ 可视为权重矩阵，在未来的训练中，重要的特征可能会赋予更高的权重。

注意力机制基于“聚焦”思想，即模型根据输入特征的重要性分配不同的权重。通过计算输入特征之间的相似性（或相关性），模型能够“关注”更重要的特征。

左侧计算 ()：通过将输入与权重矩阵相乘，捕捉输入在时间维度上的特征。这个结果表示在时间维度上对空间特征的综合影响。

右侧计算 ()：计算输入特征的加权和并进行转置，得到与左侧结果的匹配。这样做的目的是形成一个相互比较的基础，计算每个特征与其他特征之间的关系。

product = torch.matmul(lhs, rhs)

通过 product = torch.matmul(lhs, rhs) 计算的积，这一步实际上是计算每个节点对之间的注意力分数的初始值。

$Product={lhs}(N,T)\cdot{rhs}(T,N)$

最终生成的是一个 $(N\times N)$ 的矩阵。

self.bs = nn.Parameter(torch.FloatTensor(1, num_of_vertices, num_of_vertices).to(DEVICE))
self.Vs = nn.Parameter(torch.FloatTensor(num_of_vertices, num_of_vertices).to(DEVICE))

根据定义， $V_{s}$ 形状与 num_of_vertices 有关，即与节点数量有关，该代码定义 $V_{s}$ 为一个二维张量（Vs=torch.Size([3,3])）。

$b_s$ 形状与 num_of_vertices 有关，即与节点数量有关，该代码定义 $b_s$ 为一个三维张量（bs=torch.Size([1,3,3])）。

S = torch.matmul(self.Vs, torch.sigmoid(product + self.bs))
S_normalized = F.softmax(S, dim=1)

$S={V_s}\cdot\sigma(\mathrm{product}+{b_s})$

$S_{\text{normalized}}=\text{softmax}(S,\text{dim}=1)$

$V_{s}$ : 权重矩阵，用于进一步调整节点间的关系。
$b_s$ : 偏置矩阵，帮助控制输出的基线值。
$\sigma$ : 使用sigmoid函数将结果限制在(0, 1)之间。
使用softmax在节点维度上归一化，确保每个节点的注意力权重和为1

注意力机制通过计算节点间的相关性来动态调整模型关注的重点。矩阵乘法将节点的特征表示相乘，得到节点之间的交互关系。模型中的可学习参数包含 $(W_{1},W_{2},W_{3},V_{s},b_{s})$ 通过训练捕捉输入数据中的复杂模式。其中， $V_s$ 用于捕捉节点间的复杂交互，而 $b_{s}$ 提供灵活性，调整注意力的基线，使用sigmoid和softmax进行非线性变换，以增强模型的表示能力。

2.4 cheb_conv_withSAt

class cheb_conv_withSAt(nn.Module):
    '''
    K-order chebyshev graph convolution
    '''

    def __init__(self, K, cheb_polynomials, in_channels, out_channels):
        '''
        :param K: int
        :param in_channles: int, num of channels in the input sequence
        :param out_channels: int, num of channels in the output sequence
        '''
        super(cheb_conv_withSAt, self).__init__()
        self.K = K
        self.cheb_polynomials = cheb_polynomials
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.DEVICE = cheb_polynomials[0].device
        self.Theta = nn.ParameterList([nn.Parameter(torch.FloatTensor(in_channels, out_channels).to(self.DEVICE)) for _ in range(K)])

    def forward(self, x, spatial_attention):
        '''
        Chebyshev graph convolution operation
        :param x: (batch_size, N, F_in, T)
        :return: (batch_size, N, F_out, T)
        '''

        batch_size, num_of_vertices, in_channels, num_of_timesteps = x.shape

        outputs = []

        for time_step in range(num_of_timesteps):

            graph_signal = x[:, :, :, time_step]  # (b, N, F_in)

            output = torch.zeros(batch_size, num_of_vertices, self.out_channels).to(self.DEVICE)  # (b, N, F_out)

            for k in range(self.K):

                T_k = self.cheb_polynomials[k]  # (N,N)

                T_k_with_at = T_k.mul(spatial_attention)   # (N,N)*(N,N) = (N,N) 多行和为1, 按着列进行归一化

                theta_k = self.Theta[k]  # (in_channel, out_channel)

                rhs = T_k_with_at.permute(0, 2, 1).matmul(graph_signal)  # (N, N)(b, N, F_in) = (b, N, F_in) 因为是左乘，所以多行和为1变为多列和为1，即一行之和为1，进行左乘

                output = output + rhs.matmul(theta_k)  # (b, N, F_in)(F_in, F_out) = (b, N, F_out)

            outputs.append(output.unsqueeze(-1))  # (b, N, F_out, 1)

        return F.relu(torch.cat(outputs, dim=-1))  # (b, N, F_out, T)

这段代码实现了一个带有空间注意力机制的 K 次切比雪夫图卷积（Chebyshev Graph Convolution），用于图神经网络中的时序数据处理。

构造函数传入四个参数：

K：切比雪夫多项式的阶数。
cheb_polynomials：切比雪夫多项式的预计算列表。（函数 cheb_polynomial 返回的列表）
in_channels：输入通道数。（输入特征数量）
out_channels：输出通道数。（输出特征数量）

 self.Theta = nn.ParameterList([nn.Parameter(torch.FloatTensor(in_channels, out_channels).to(self.DEVICE)) for _ in range(K)])

self.Theta 是一个参数列表，每个切比雪夫多项式都有一个对应的权重矩阵，大小为（in_channels，out_channels），其中 in_channels 表示你输入通道\特征的数量，out_channels 表示你想要的输出通道\特征的数量，根据数据的特点与工作需求自定义。举例来说，在（in_channels=1，out_channels=8）的情况下，有：

$\Theta=[\theta_1,\theta_2,\dots,\theta_8]^T$

在前向传播代码（Forward）部分，传入的参数包含形状为 (batch_size, N, F_in, T) 的数据与 Spatial_Attention_layer 部分计算出的 $(N\times N)$ 的注意力评分矩阵

for time_step in range(num_of_timesteps):
    graph_signal = x[:, :, :, time_step]  # (b, N, F_in)
    output = torch.zeros(batch_size, num_of_vertices, self.out_channels).to(self.DEVICE)  # (b, N, F_out)

代码包含循环嵌套，第一个循环中， graph_signal 首先将每一段数据按时间分解，加入我的定义时间 T=12，即该操作会将原始数据分解为12张图，并在每一张图上进行运算操作。
output 则是创建了一个形状为 (batch_size, N, F_out) 的三维张量，具体来说即是每个批次包含一个形状为 $(N\times F_{out})$ 的矩阵。

for k in range(self.K):
    T_k = self.cheb_polynomials[k]  # (N,N)
    T_k_with_at = T_k.mul(spatial_attention)  # (N,N)*(N,N) = (N,N) 多行和为1, 按着列进行归一化
    theta_k = self.Theta[k]  # (in_channel, out_channel)
    rhs = T_k_with_at.permute(0, 2, 1).matmul(graph_signal)  # (N, N)(b, N, F_in) = (b, N, F_in) 因为是左乘，所以多行和为1变为多列和为1，即一行之和为1，进行左乘
    output = output + rhs.matmul(theta_k)  # (b, N, F_in)(F_in, F_out) = (b, N, F_out)

该循环则通过空间注意力机制加权每个切比雪夫多项式。注意，由于我们现在提取出的12张图，这一操作是在每一张图上完成的，所以这部分没有时间的概念。我们还是以具体的例子来探究这个过程：

假设我们的图有三个节点，两个输入特征，我们需要八个输出特征，对于每一个批次有：

$B=1,N=3,F_{in}=2,F_{out}=8$

输入的特征矩阵为：

$\mathrm{graph_signal}=H^{(0)}=\begin{bmatrix}\begin{bmatrix}x_{11}&x_{12}\\x_{21}&x_{22}\\x_{31}&x_{32}\end{bmatrix}\end{bmatrix}$

计算出的空间相关性评分矩阵为：

$\text{spatial attention}=A_{spatial}=\begin{bmatrix}s_{11}&s_{12}&s_{13}\\s_{21}&s_{22}&s_{23}\\s_{31}&s_{32}&s_{33}\end{bmatrix}$

计算出的切比雪夫多项式为（这里假设 $k=1$ ，实际上 $k$ 的取值决定了循环次数，如果 $k=n$ ，那么就会按多项式的每一项计算，每一项都是一个 $(N\times N)$ 矩阵）：

$T_k=\begin{bmatrix}t_{11}&t_{12}&t_{13}\\t_{21}&t_{22}&t_{23}\\t_{31}&t_{32}&t_{33}\end{bmatrix}$

定义权重矩阵为：

$\Theta_k=\begin{bmatrix}\theta_{11}&\theta_{12}&\dots&\theta_{18}\\\theta_{21}&\theta_{22}&\dots&\theta_{28}\end{bmatrix}$

计算过程如下：

 T_k_with_at = T_k.mul(spatial_attention)  # (N,N)*(N,N) = (N,N) 多行和为1, 按着列进行归一化

这一部分为矩阵的Hadamard乘积，与普通的乘法不同，两个矩阵在相同位置的元素会被相乘，生成一个新的张量，其中每个元素都是原始张量对应元素的乘积：

$T_k^\text{att}=T_k\odot A_{spatial}=\begin{bmatrix}t_{11}s_{11}&t_{12}s_{12}&t_{13}s_{13}\\t_{21}s_{21}&t_{22}s_{22}&t_{23}s_{23}\\t_{31}s_{31}&t_{32}s_{32}&t_{33}s_{33}\end{bmatrix}$

由于计算 $A_{spatial}$ 时我们使用softmax在节点维度上归一化，所以每个节点的注意力权重和为1，这步计算起到了一个对节点评分的作用。

 rhs = T_k_with_at.permute(0, 2, 1).matmul(graph_signal)

$rhs=(T_k^\text{att})^T\cdot H^{(0)}=\begin{bmatrix}t_{11}s_{11}&t_{21}s_{21}&t_{31}s_{31}\\t_{12}s_{12}&t_{22}s_{22}&t_{32}s_{32}\\t_{13}s_{13}&t_{23}s_{23}&t_{33}s_{33}\end{bmatrix}\begin{bmatrix}x_{11}&x_{12}\\x_{21}&x_{22}\\x_{31}&x_{32}\end{bmatrix}$ $\text{rhs}=\begin{bmatrix}R_{11}&R_{12}\\R_{21}&R_{22}\\R_{31}&R_{32}\end{bmatrix},R_{ij}=\sum_{k=1}^{3}(t_{ik}s_{ik})x_{kj}$

我们在之前的文章中已经大致了解了GCN的卷积原理，即如何计算一个节点与邻居节点之间的关系，矩阵 $rhs$ 即是实现了这样的一部操作，计算添加空间注意力机制之后的“卷积”。（如难以理解可移步开头第二篇文章学习）

 output = output + rhs.matmul(theta_k)

$\mathrm{rhs}\cdot\Theta_k=\begin{bmatrix}R_{11}&R_{12}\\R_{21}&R_{22}\\R_{31}&R_{32}\end{bmatrix}\begin{bmatrix}\theta_{11}&\theta_{12}&\dots&\theta_{18}\\\theta_{21}&\theta_{22}&\dots&\theta_{28}\end{bmatrix}$

该结果是一个形状为 $(3 \times 8)$ 的矩阵，也就是说这一步最终实现了：

$H^{(0)}=\begin{bmatrix}\begin{bmatrix}x_{11}&x_{12}\\x_{21}&x_{22}\\x_{31}&x_{32}\end{bmatrix}\end{bmatrix} \mathbf{\to} H^{(k)}=\begin{bmatrix}\begin{bmatrix}x'_{11}&x'_{12}&\cdots&x'_{18}\\x'_{21}&x'_{22}&\cdots&x'_{28}\\x'_{31}&x'_{32}&\cdots&x'_{38}\end{bmatrix}\end{bmatrix}$

综上所有计算，该类实现了一个完整的卷积运算过程：

$H^{(t)}=\sum\limits_{k=0}^{K-1}\left(T_k\odot A_{spatial}\right)H^{(t-1)}\Theta_k$

outputs.append(output.unsqueeze(-1))

最后回到第一个循环，将 T=12 个图重新拼接得到结构为 (b, N, F_out, T) 的数据。

2.5 Temporal_Attention_layer

class Temporal_Attention_layer(nn.Module):
    def __init__(self, DEVICE, in_channels, num_of_vertices, num_of_timesteps):
        super(Temporal_Attention_layer, self).__init__()
        self.U1 = nn.Parameter(torch.FloatTensor(num_of_vertices).to(DEVICE))
        self.U2 = nn.Parameter(torch.FloatTensor(in_channels, num_of_vertices).to(DEVICE))
        self.U3 = nn.Parameter(torch.FloatTensor(in_channels).to(DEVICE))
        self.be = nn.Parameter(torch.FloatTensor(1, num_of_timesteps, num_of_timesteps).to(DEVICE))
        self.Ve = nn.Parameter(torch.FloatTensor(num_of_timesteps, num_of_timesteps).to(DEVICE))

    def forward(self, x):
        '''
        :param x: (batch_size, N, F_in, T)
        :return: (B, T, T)
        '''
        _, num_of_vertices, num_of_features, num_of_timesteps = x.shape

        lhs = torch.matmul(torch.matmul(x.permute(0, 3, 2, 1), self.U1), self.U2)
        # x:(B, N, F_in, T) -> (B, T, F_in, N)
        # (B, T, F_in, N)(N) -> (B,T,F_in)
        # (B,T,F_in)(F_in,N)->(B,T,N)

        rhs = torch.matmul(self.U3, x)  # (F)(B,N,F,T)->(B, N, T)

        product = torch.matmul(lhs, rhs)  # (B,T,N)(B,N,T)->(B,T,T)

        E = torch.matmul(self.Ve, torch.sigmoid(product + self.be))  # (B, T, T)

        E_normalized = F.softmax(E, dim=1)

        return E_normalized

如果你看到这一部分，那么恭喜，我们烧脑的部分已经接近尾声了，观察这个类我们不难看出，这个时间注意力机制，即时间评分矩阵的计算方法与之间的空间评分如出一辙，换汤不换药。

代码实现了一个时间注意力层，主要用于计算输入数据的时间注意力分数。

其中 $U_1,U_2,U_3$ 与 $b_e,V_e$ ，均是矩阵。传入的数据格式为(batch_size, N, F_in, T)

我们同样举个例子来展示计算过程，假设我们有以下参数：

$\begin{aligned}&N=3\\&F_{in}=2\\&T=4\end{aligned}$

我们默认只有一个批次，现在有一个简单图包含3个节点，每个节点有2个特征，时间长度为4。

self.U1 = nn.Parameter(torch.FloatTensor(num_of_vertices).to(DEVICE))

$U_1=[u_1,u_2,u_3]^T$

$U_{1}$ 形状与 num_of_vertices 有关，即与节点个数有关。

self.U2 = nn.Parameter(torch.FloatTensor(in_channels, num_of_vertices).to(DEVICE))

$U_{2}=\begin{bmatrix}v_{11}&v_{12}&v_{13}\\v_{21}&v_{22}&v_{23}\end{bmatrix}$

$U_{2}$ 形状与 in_channels，num_of_vertices 有关，即输入特征数量、节点个数有关。

self.U3 = nn.Parameter(torch.FloatTensor(in_channels).to(DEVICE))

$U_3=\begin{bmatrix}u_{31},u_{32}\end{bmatrix}^T$

$U_{3}$ 形状与 in_channels 有关，即与输入特征数量有关，该代码定义 $W_{3}$ 为一个一维张量（W1=torch.Size([2])）。

拿出我们之前定义过的 $N=3,F_{in}=2,T=4$ 的数据来说：

代码首先计算了输入与权重矩阵相乘，捕捉输入在时间维度上的特征。这个结果表示在时间维度上对空间特征的综合影响。

lhs = torch.matmul(torch.matmul(x.permute(0, 3, 2, 1), self.U1), self.U2)

在经过 x.permute(0, 3, 2, 1) 操作后，矩阵形状变为：

$\mathbf{x'}=\begin{bmatrix}\begin{bmatrix}x_{1,1,1,1}&x_{1,2,1,1}&x_{1,3,1,1}\\x_{1,1,1,2}&x_{1,2,1,2}&x_{1,3,1,2}\\x_{1,1,1,3}&x_{1,2,1,3}&x_{1,3,1,3}\\x_{1,1,1,4}&x_{1,2,1,4}&x_{1,3,1,4}\end{bmatrix}\\\begin{bmatrix}x_{1,1,2,1}&x_{1,2,2,1}&x_{1,3,2,1}\\x_{1,1,2,2}&x_{1,2,2,2}&x_{1,3,2,2}\\x_{1,1,2,3}&x_{1,2,2,3}&x_{1,3,2,3}\\x_{1,1,2,4}&x_{1,2,2,4}&x_{1,3,2,4}\end{bmatrix}\end{bmatrix}$

$lhs=\mathbf{x'}\cdot U_{1}\cdot U_{2}$

计算过后 $\mathbf{x'}\cdot U_1$ 的形状为（B,T,F）

$\mathbf{x'}\cdot U_1=lhs_{temp}=\left[\begin{bmatrix}\sum_{n=1}^3x_{1,n,1,j}u_{1,n}\\\sum_{n=1}^3x_{1,n,2,j}u_{1,n}\end{bmatrix}\right]$

可以看出这一步操作整合了节点信息，将四个节点的信息加权求和，其中矩阵 $U_1$ 可视为权重矩阵，在未来的训练中，重要的节点可能会赋予更高的权重。

由于 $lhs=lhs_{temp}\cdot U_{2}$ ，计算表达式如下：

$lhs=\begin{bmatrix}\sum_{j=1}^2lhs_{temp}[0,i,j]\cdot v_{2,j,k}&\text{for}\quad i\quad\text{and}\quad k=1,2,3\end{bmatrix}$

计算过后 $lhs$ 的形状为 (B,T,N）

rhs = torch.matmul(self.U3, x)

$rhs=\sum_{j=1}^2u_{3,j}\cdot x_{1,:,j,:}$

计算过后 $rhs$ 的形状为 (B,N,T），矩阵表达为：

可以看出这一步操作整合了特征信息，将两个的信息加权求和，其中矩阵 $U_3$ 可视为权重矩阵，在未来的训练中，重要的特征可能会赋予更高的权重。

product = torch.matmul(lhs, rhs)

通过 product = torch.matmul(lhs, rhs) 计算的积，这一步实际上是计算每个节点对之间的注意力分数的初始值。

$Product={lhs}(T,N)\cdot{rhs}(N,T)$

最终生成的是一个 $(T\times T)$ 的矩阵。

self.be = nn.Parameter(torch.FloatTensor(1, num_of_timesteps, num_of_timesteps).to(DEVICE))
self.Ve = nn.Parameter(torch.FloatTensor(num_of_timesteps, num_of_timesteps).to(DEVICE))

根据定义， $V_{e}$ 形状与num_of_timesteps 有关，即与时间有关，该代码定义 $V_{e}$ 为一个二维张量（Vs=torch.Size([4,4])）。

$b_e$ 形状与 num_of_timesteps 有关，即与时间有关，该代码定义 $b_e$ 为一个三维张量（bs=torch.Size([1,4,4])）。

E = torch.matmul(self.Ve, torch.sigmoid(product + self.be))  # (B, T, T)
E_normalized = F.softmax(E, dim=1)

$E={V_e}\cdot\sigma(\mathrm{product}+{b_e})$

$E_{\text{normalized}}=\text{softmax}(E,\text{dim}=1)$

$V_{e}$ : 权重矩阵，用于进一步调整时间点间的关系。
$b_e$ : 偏置矩阵，帮助控制输出的基线值。
$\sigma$ : 使用sigmoid函数将结果限制在(0, 1)之间。
使用softmax在节点维度上归一化，确保每个节点的注意力权重和为1

$\mathrm{softmax}(E)_{ij}=\dfrac{e^{E_{ij}}}{\sum_{k=1}^Te^{E_{ik}}}$

2.6 cheb_conv

class cheb_conv(nn.Module):
    '''
    K-order chebyshev graph convolution
    '''

    def __init__(self, K, cheb_polynomials, in_channels, out_channels):
        '''
        :param K: int
        :param in_channles: int, num of channels in the input sequence
        :param out_channels: int, num of channels in the output sequence
        '''
        super(cheb_conv, self).__init__()
        self.K = K
        self.cheb_polynomials = cheb_polynomials
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.DEVICE = cheb_polynomials[0].device
        self.Theta = nn.ParameterList([nn.Parameter(torch.FloatTensor(in_channels, out_channels).to(self.DEVICE)) for _ in range(K)])

    def forward(self, x):
        '''
        Chebyshev graph convolution operation
        :param x: (batch_size, N, F_in, T)
        :return: (batch_size, N, F_out, T)
        '''

        batch_size, num_of_vertices, in_channels, num_of_timesteps = x.shape

        outputs = []

        for time_step in range(num_of_timesteps):

            graph_signal = x[:, :, :, time_step]  # (b, N, F_in)

            output = torch.zeros(batch_size, num_of_vertices, self.out_channels).to(self.DEVICE)  # (b, N, F_out)

            for k in range(self.K):

                T_k = self.cheb_polynomials[k]  # (N,N)

                theta_k = self.Theta[k]  # (in_channel, out_channel)

                rhs = graph_signal.permute(0, 2, 1).matmul(T_k).permute(0, 2, 1)

                output = output + rhs.matmul(theta_k)

            outputs.append(output.unsqueeze(-1))

        return F.relu(torch.cat(outputs, dim=-1))

该类实现的功能与 cheb_conv_withSAt 完全一致，是 cheb_conv_withSAt 的简化版，代码为普通的卷积操作，不包含注意力机制：

$\begin{aligned}H^{(k)}=\sum_{k=0}^{K-1}T_k(\tilde{A})H^{(0)}\Theta_k\end{aligned}$

2.7 ASTGCN_block

class ASTGCN_block(nn.Module):

    def __init__(self, DEVICE, in_channels, K, nb_chev_filter, nb_time_filter, time_strides, cheb_polynomials, num_of_vertices, num_of_timesteps):
        super(ASTGCN_block, self).__init__()
        self.TAt = Temporal_Attention_layer(DEVICE, in_channels, num_of_vertices, num_of_timesteps)
        self.SAt = Spatial_Attention_layer(DEVICE, in_channels, num_of_vertices, num_of_timesteps)
        self.cheb_conv_SAt = cheb_conv_withSAt(K, cheb_polynomials, in_channels, nb_chev_filter)
        self.time_conv = nn.Conv2d(nb_chev_filter, nb_time_filter, kernel_size=(1, 3), stride=(1, time_strides), padding=(0, 1))
        self.residual_conv = nn.Conv2d(in_channels, nb_time_filter, kernel_size=(1, 1), stride=(1, time_strides))
        self.ln = nn.LayerNorm(nb_time_filter)  #需要将channel放到最后一个维度上

    def forward(self, x):
        '''
        :param x: (batch_size, N, F_in, T)
        :return: (batch_size, N, nb_time_filter, T)
        '''
        batch_size, num_of_vertices, num_of_features, num_of_timesteps = x.shape

        # TAt
        temporal_At = self.TAt(x)  # (b, T, T)

        x_TAt = torch.matmul(x.reshape(batch_size, -1, num_of_timesteps), temporal_At).reshape(batch_size, num_of_vertices, num_of_features, num_of_timesteps)

        # SAt
        spatial_At = self.SAt(x_TAt)

        # cheb gcn
        spatial_gcn = self.cheb_conv_SAt(x, spatial_At)  # (b,N,F,T)
        # spatial_gcn = self.cheb_conv(x)

        # convolution along the time axis
        time_conv_output = self.time_conv(spatial_gcn.permute(0, 2, 1, 3))   # (b,N,F,T)->(b,F,N,T) 用(1,3)的卷积核去做->(b,F,N,T)

        # residual shortcut
        x_residual = self.residual_conv(x.permute(0, 2, 1, 3))  # (b,N,F,T)->(b,F,N,T) 用(1,1)的卷积核去做->(b,F,N,T)

        x_residual = self.ln(F.relu(x_residual + time_conv_output).permute(0, 3, 2, 1)).permute(0, 2, 3, 1)
        # (b,F,N,T)->(b,T,N,F) -ln-> (b,T,N,F)->(b,N,F,T)

        return x_residual

ASTGCN_block 是一个图神经网络中的模块，它集成了上述的注意力机制与卷积操作，并且添加了一些增强模型性能的操作。其传入参数如下：

时间注意力层 (TAt): 计算时间维度上的注意力权重。
空间注意力层 (SAt): 计算空间维度上的注意力权重。
切比雪夫卷积层 (cheb_conv_SAt): 结合空间注意力的切比雪夫图卷积，用于提取时空特征。
时间卷积 (residual_conv): 用于在时间维度上进行卷积操作，提取时间特征。
残差卷积 (ln): 用于实现残差连接。
层归一化 (ln): 对输出进行归一化，以稳定训练。

self.time_conv = nn.Conv2d(nb_chev_filter, nb_time_filter, kernel_size=(1, 3), stride=(1, time_strides), padding=(0, 1))
self.residual_conv = nn.Conv2d(in_channels, nb_time_filter, kernel_size=(1, 1), stride=(1, time_strides))

除去已经介绍过的部分，这里采用的是pytorch内置的2D卷积操作，其原理大致如图：

值得注意的是，由于“时间”的特征，其数据结构不想图中的矩形，更像一个一格高N格宽的长条形数据，所以其卷积核的形状也是 $(1,N)$ 。该部分就是一个类似“加权平均”思想。

时间卷积：

$Y[b,j,i,t]=\sum\limits_{k=0}^{F_{in}-1}\sum\limits_{l=0}^2X[b,k,i,t+l-1]\cdot W[k,j,l]$

残差卷积：

$Y[b,i,j,t]=\sum_{k=0}^{F_{in}-1}X[b,i,k,t]\cdot W[k,j]$

输出通道 nb_chev_filter 提供了对图结构和空间关系的深入理解，而输出通道 nb_time_filter 则专注于时间动态的捕捉。二者结合，使模型在处理时空数据时既能考虑空间依赖关系，又能有效建模时间序列变化，从而提升预测精度。

我们为什么要映入残差卷积呢？

在深层网络中，梯度可能在反向传播时逐渐消失，导致前层的参数无法有效更新。残差卷积通过引入短路连接，使得梯度能更容易地流向较浅层，减轻了这一问题。同时，残差连接允许网络学习到输入特征与输出特征之间的残差。网络只需学习变化量（即残差），而不是直接拟合所有输入特征，简化了学习过程。所以说，模型的效率和精度都会增加。

temporal_At = self.TAt(x)  # (b, T, T)
x_TAt = torch.matmul(x.reshape(batch_size, -1, num_of_timesteps), temporal_At).reshape(batch_size, num_of_vertices, num_of_features, num_of_timesteps)

Forward部分首先计算出了时间注意力矩阵 temporal_At ，同样拿之前的例子展示：

$\mathbf{x}\cdot A_{temporal}=\begin{bmatrix}\begin{bmatrix}x_{1,1,1,1}&x_{1,1,1,2}&x_{1,1,1,3}&x_{1,1,1,4}\\x_{1,1,2,1}&x_{1,1,2,2}&x_{1,1,2,3}&x_{1,1,2,4}\end{bmatrix}\\\begin{bmatrix}x_{1,2,1,1}&x_{1,2,1,2}&x_{1,2,1,3}&x_{1,2,1,4}\\x_{1,2,2,1}&x_{1,2,2,2}&x_{1,2,2,3}&x_{1,2,2,4}\end{bmatrix}\\\begin{bmatrix}x_{1,3,1,1}&x_{1,3,1,2}&x_{1,3,1,3}&x_{1,3,1,4}\\x_{1,3,2,1}&x_{1,3,2,2}&x_{1,3,2,3}&x_{1,3,2,4}\end{bmatrix}\end{bmatrix}\begin{bmatrix}t_{11}&t_{12}&t_{13}&t_{14}\\t_{21}&t_{22}&t_{23}&t_{24}\\t_{31}&t_{32}&t_{33}&t_{34}\\t_{41}&t_{42}&t_{43}&t_{44}\end{bmatrix}$

该部分将时间评分矩阵用于输入数据，引入时间注意力模块。得到的矩阵数据结构仍为 (batch_size, N, F_in, T)。

spatial_At = self.SAt(x_TAt)
spatial_gcn = self.cheb_conv_SAt(x, spatial_At)  # (b,N,F,T)

接下来计算得到空间注意力矩阵 spatial_At 。之后应用空间注意力矩阵进行 cheb_conv_withSAt 卷积计算，得到的数据结构为 (batch_size, N, F_out / nb_time_filter, T)。

 time_conv_output = self.time_conv(spatial_gcn.permute(0, 2, 1, 3))
 x_residual = self.residual_conv(x.permute(0, 2, 1, 3))
 x_residual = self.ln(F.relu(x_residual + time_conv_output).permute(0, 3, 2, 1)).permute(0, 2, 3, 1)

接下来采用刚刚提到的时间卷积与残差卷积进一步处理，时间卷积输出数据结构为(batch_size,nb_time_filter,N,T′)，残差卷积输出数据结构为(batch_size,nb_time_filter,N,T′)，输出的时间受到卷积核(kenral size)的大小和滑动步长影响。

x_residual = self.ln(F.relu(x_residual + time_conv_output).permute(0, 3, 2, 1)).permute(0, 2, 3, 1)

最终经过相加，RELU激活已经归一化操作之后，最终输出一个结构为 (batch_size,N,nb_time_filter,T') 的张量。

2.8 ASTGCN_submodule

class ASTGCN_submodule(nn.Module):

    def __init__(self, DEVICE, nb_block, in_channels, K, nb_chev_filter, nb_time_filter, time_strides, cheb_polynomials, num_for_predict, len_input, num_of_vertices):

        super(ASTGCN_submodule, self).__init__()

        self.BlockList = nn.ModuleList([ASTGCN_block(DEVICE, in_channels, K, nb_chev_filter, nb_time_filter, time_strides, cheb_polynomials, num_of_vertices, len_input)])

        self.BlockList.extend([ASTGCN_block(DEVICE, nb_time_filter, K, nb_chev_filter, nb_time_filter, 1, cheb_polynomials, num_of_vertices, len_input//time_strides) for _ in range(nb_block-1)])

        self.final_conv = nn.Conv2d(int(len_input/time_strides), num_for_predict, kernel_size=(1, nb_time_filter))

        self.DEVICE = DEVICE

        self.to(DEVICE)

    def forward(self, x):
        '''
        :param x: (B, N_nodes, F_in, T_in)
        :return: (B, N_nodes, T_out)
        '''
        for block in self.BlockList:
            x = block(x)

        output = self.final_conv(x.permute(0, 3, 1, 2))[:, :, :, -1].permute(0, 2, 1)
        # (b,N,F,T)->(b,T,N,F)-conv<1,F>->(b,c_out*T,N,1)->(b,c_out*T,N)->(b,N,T)

        return output

该类涉及到的参数已经大部分讲解过，这里不在赘述。

ASTGCN_submodule 通过将多个 ASTGCN 块组合在一起，形成一个深度模型。每个块逐步提取空间和时间特征，最终通过卷积层生成用于预测的输出。如图所示：

在 ASTGCN_submodule 以上的所有内容如下图所示：

 output = self.final_conv(x.permute(0, 3, 1, 2))[:, :, :, -1].permute(0, 2, 1)

在通过卷积层之前，调整 $\mathbf{x}$ 的维度，使其形状为 (batch_size,T ,N , F)。
使用卷积层处理后，提取最后一个时间步的输出，并调整维度以得到最终结果。
输出的形状为 (batch_size,N ,T_out)，其中T_out 是预测的时间步数。

3 主函数

def ASTGCN(DEVICE, nb_block, in_channels, K, nb_chev_filter, nb_time_filter, time_strides, adj_mx, num_for_predict, len_input, num_of_vertices):

    L_tilde = scaled_Laplacian(adj_mx)
    cheb_polynomials = [torch.from_numpy(i).type(torch.FloatTensor).to(DEVICE) for i in cheb_polynomial(L_tilde, K)]
    model = ASTGCN_submodule(DEVICE, nb_block, in_channels, K, nb_chev_filter, nb_time_filter, time_strides, cheb_polynomials, num_for_predict, len_input, num_of_vertices)

    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)
        else:
            nn.init.uniform_(p)

    return model

参数说明

DEVICE: 指定模型将要运行的设备（如GPU）。
nb_block: ASTGCN 块的数量。
in_channels: 输入特征的通道数。
K: 切比雪夫图卷积中的阶数。
nb_chev_filter: 切比雪夫卷积层的输出通道数。
nb_time_filter: 时间卷积层的输出通道数。
time_strides: 时间卷积的步幅。
adj_mx: 邻接矩阵，用于构建图的结构。
num_for_predict: 预测的时间步数。
len_input: 输入序列的长度。
num_of_vertices: 节点的数量。

L_tilde = scaled_Laplacian(adj_mx)

计算图的标准化拉普拉斯矩阵 scaled_Laplacian

cheb_polynomials = [torch.from_numpy(i).type(torch.FloatTensor).to(DEVICE) for i in cheb_polynomial(L_tilde, K)]

使用标准化拉普拉斯矩阵生成切比雪夫多项式。

model = ASTGCN_submodule(DEVICE, nb_block, in_channels, K, nb_chev_filter, nb_time_filter, time_strides, cheb_polynomials, num_for_predict, len_input, num_of_vertices)

创建一个 ASTGCN_submodule 实例，构建 ASTGCN 模型的主体部分

for p in model.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)
    else:
        nn.init.uniform_(p)

遍历模型的参数，并根据参数的维度选择不同的初始化方法：

如果参数的维度大于 1（即为权重矩阵），使用 Xavier 均匀分布初始化，以帮助模型更快地收敛。
如果参数是一维的（通常是偏置），则使用均匀分布进行初始化。

以上就是ASTGCN网络模型部分的全部细节，最主要的部分为 Spatial_Attention_layer ，Temporal_Attention_layer，cheb_conv_withSAt，ASTGCN_block。ASTGCN在时间序列预测上具有良好的性能，是研究图数据预测问题的重要基础模型，值得每一个深度学习入门玩家学习。