PyG:PyTorch Geometric Library

studyeboy

已于 2022-02-11 15:57:56 修改

阅读量1.7k

点赞数

分类专栏：深度学习 pytorch Python库文章标签： pytorch 深度学习机器学习 PyG

于 2022-02-10 13:39:23 首次发布

本文链接：https://blog.csdn.net/studyeboy/article/details/122807327

版权

深度学习同时被 3 个专栏收录

73 篇文章 27 订阅

订阅专栏

Python库

56 篇文章 4 订阅

订阅专栏

pytorch

16 篇文章 1 订阅

订阅专栏

PyG是一个基于PyTorch用与处理部规则数据（比如图）的库，是一个用于在图等数据上快速实现表征学习的框架，是当前最流行和广泛使用的GNN（Graph Neural Networks, GNN 图神经网络）库。

Graph Neural Networks,GNN，称为图神经网络，是深度学习中近年来比较受关注的领域，GNN通过对信息的传递、转换和聚合实现特征的提取，类似与传统的CNN，只是CNN只能处理规则的输入，如图像等输入的高、宽和通道数都是固定的，而GNN可以处理部规则的输入，如点云等。

安装

pip install torch-geometric
pip install torch-sparse
pip install torch-scatter
pip install pytorch-fid

torch_geometric.data.Data

节点和节点之间的边构成了图，在PyG中，构建图需要两个要素：节点和边。PyG提供了torch_geometric.data.Data(简称Data)用于构建图，包括5个属性，每一个属性都部是必须的，可以为空。

x：用于存储每个节点的特征，形状是[num_nodes, num_node_features].
edge_index:用于存储节点之间的边，形状是[2, num_edges]。
pos：存储节点的坐标，形状是[num_nodes, num_dimensions]。
y:存储样本标签。如果是每个节点都有标签，那么形状是[num_nodes, *]；如果是整张图只有一个标签，那么形状是[1, *]。
edge_attr：存储边的特征。形状是[num_edges, num_edge_features]。

Data对象不仅仅限制于这些属性，还可以通过data.face来扩展Data，以张量保存三维网格中三角形的连接性。

和P有Torch稍有不同，Data里包含了样本的label，在PyTorch中，重写Dataset的__getitem__()，根据index返回对应的样本和label。在PyG中，在get()函数中根据index返回torch_geometric.data.Data类型的数据，在Data里包含了数据和label。

例如：未加权无向图（未加权指边上没有权值），包括3个节点和4条边：(0->1),(1->0),(1->2),(2->1)，每个节点都有一维特征。
在这里插入图片描述

import torch 
from torch_geometric.data import Data

#由于是无向图，有四条边：(0->1),(1->0),(1->2),(2->1)
#方式一：常用方式，edge_index中边的存储方式有两个list，第一个list是边的起始点，第二个list是边的目标节点。
edge_index = torch.tensor([[0, 1, 1, 2], [1, 0, 2, 1]], dtype=torch.long)

#节点的特征
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index)

# 方式二：需要先转置然后使用contiguous()方法。
edge_index = torch.tensor([[0, 1],[1, 0], [1, 2], [2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)
data = Data(x=x, edge_index=edge_index.t().contiguous())

PyTorch中的contiguous
contiguous是形容词，表示连续的，PyTorch提供了is_contiguous、contiguous（形容词动用）两个方法，分别用于判断Tensor是否是contiguous的，以及保证Tensor是contiguous的。
is_contiguous直观的解释是Tensor底层一维数组元素的存储顺序与Tensor按行优先一维展开的元素顺序是否一致。
Tensor多维数组底层实现是使用一块连续内存的1维数组（行优先顺序存储），Tensor在元信息里保存了多为数组的形状，在访问元素时，通过多维度索引转化成1维数组相对于数组起始位置的偏移量即可找到对应的数据。某些Tensor操作（如transpose、permute、narrow、expand）与原Tensor是共享内存中的数据，不会改变底层数组的存储，但原来在语义上相邻、内存里也相邻的元素在执行这样的操作后，在语义上相邻，但在内存不相邻，即不连续了。
如果像要变得连续，使用contiguous方法，如果Tensor不是连续的，则会重新开辟一块内存空间保证数据是在内存中是连续的，如果Tensor是连续的，则contiguous无操作。
行优先
C/C++中使用的是行优先（raw major），Matlab、Fortran使用的是列优先(column major)，PyTorch中Tensor底层实现是C，也是使用行优先顺序。
t = torch.arange(12).reshape(3, 4)

数组t在内存中实际以一维数组形式存储，通过flatten方法查看t的一维展开形式，实际存储形式与一维展开一致。
t.flatten()

列优先的存储逻辑结构

使用列优先存储时，一维数组中元素顺序：

图1、图2、图3、图4中颜色相同的数据表示在同一行，不论是行优先顺序、或是列优先顺序，如果要访问矩阵中的下一个元素都是通过偏移来实现，这个偏移量称为步长（stride）。在行优先的存储方式下，访问行中相邻元素物理结构需要偏移1个位置，在列优先存储方式下偏移3个位置。

例如：有向图有4个节点，每个节点有两个特征，有自己的类别标签。
在这里插入图片描述

import torch
from torch_geometric.data import Data

x = torch.tensor([[2, 1], [5, 6], [3, 7], [12, 0]], dtype=troch.float)
y = torch.tensor([0, 1, 0, 1], dtype=torch.float)
#与节点对应顺序无关，顺序怎么写都性
edge_index = torch.tensor([[0, 1, 2, 0, 3], [1, 0, 1, 3, 2]], dtype=torch.long)
data = Data(x=x, y=y, edge_index=edge_index)

Dataset与DataLoader

有了Data就可以创建自己的Dataset，读取并返回Data了。

自定义Dataset

尽管PyG包含了许多有用的数据集，也可以通过继承torch_geometric.data.Dataset使用自己的数据集。提供2种不同的Dataset
:

InMemoryDataset：使用这个Dataset会一次性把数据全部加载到内存中。
Dataset：使用这个Dataset每次加载一个数据到内存中，比较常用。

需要在自定义的Dataset的初始化方法中传入数据存放的路径root，然后PyG会在这个路径下再划分2个文件夹：

raw_dir：存放原始数据的路径，一般是csv、mat等格式。
processed_dir：存放处理后的数据，一般是pt格式，由重写process()方法实现。

除了root，类初始化的init函数还接收三个函数参数transform, pre_transform 和pre_filter，这些参数的默认值都是None。transform函数用于动态的转换数据对象。pre_transform函数在数据保存到硬盘之前进行一次转换。pre_filter用于过滤某些数据对象。

保存在内存中的数据集

为了创建InMemoryDataset，需要实现下面四个方法：

raw_file_names()：该函数返回文件名需要在raw_dir文件夹下找到才可以跳过下载过程。
processed_file_names()：该函数返回的文件名需要在processed_dir中找到才可以跳过处理过程。
download()：下载文件到raw_dir。
process()：处理原始数据并保存在processed_dir。

在process()：函数中，需要读入并创建一个Data对象列表之后将所有Data类型的对象保存在processed_dir文件夹中。由于无法将全部数据保存到内存中，需要在数据固化之前通过collate()函数保存Data对象的索引，此外，该函数还会返回一个slices字典用于从本地重建单个样例对象。于是在数据集对象new的时候，需要从本地读取self.data和self.slices对象。

创建更大规模的数据集

有一些数据的规模太大，无法一次性加载到内存中，需要自己实现torch_geometric.data.Dataset，只需要额外实现两个方法：

len()：返回数据集的长度
get()：自定义加载Graph的方法

在PyTorch中，是没有raw和processed这两个文件夹的，这两个文件夹在PyG中的实际意义和处理逻辑。

torch_geometric.data.Dataset继承自torch.utils.data.Dataset，在初始化方法__init__()中，会调用_download()方法和_process()方法。

_download()方法如下，首先检查self.raw_paths列表中的文件是否存在；如果存在，则返回；如果不存在，则调用self.download()方法下载文件。

_process()方法如下，首先在self.processed_dir中有pre_transform，那么判断这个pre_transform和传进来的pre_transform是否一致，如果不一致，那么警告提示用户先删除self.processed_dir文件夹。pre_filter同理。

然后检查self.processed_paths列表中的文件是否存在；如果存在，则返回；如果不存在，则调用self.process()生成文件。

一般来说不用实现downloand()方法。

如果你直接把处理好的 pt 文件放在了self.processed_dir中，那么也不用实现process()方法。

在 Pytorch 的dataset中，需要实现__getitem__()方法，根据index返回样本和标签。在这里torch_geometric.data.Dataset中，重写了__getitem__()方法，其中调用了get()方法获取数据。

需要实现的是get()方法，根据index返回torch_geometric.data.Data类型的数据。

process()方法存在的意义是原始的格式可能是 csv 或者 mat，在process()函数里可以转化为 pt 格式的文件，这样在get()方法中就可以直接使用torch.load()函数读取 pt 格式的文件，返回的是torch_geometric.data.Data类型的数据，而不用在get()方法做数据转换操作 (把其他格式的数据转换为 torch_geometric.data.Data类型的数据)。当然也可以提前把数据转换为 torch_geometric.data.Data类型，使用 pt 格式保存在self.processed_dir中。

#torch_geometric/data/dataset.py
from typing import List, Optional, Callable, Union, Any, Tuple

import sys
import re
import copy
import warnings
import numpy as np
import os.path as osp
from collections.abc import Sequence

import torch.utils.data
from torch import Tensor

from torch_geometric.data import Data
from torch_geometric.data.makedirs import makedirs

IndexType = Union[slice, Tensor, np.ndarray, Sequence]


class Dataset(torch.utils.data.Dataset):
    r"""Dataset base class for creating graph datasets.
    See `here <https://pytorch-geometric.readthedocs.io/en/latest/notes/
    create_dataset.html>`__ for the accompanying tutorial.

    Args:
        root (string, optional): Root directory where the dataset should be
            saved. (optional: :obj:`None`)
        transform (callable, optional): A function/transform that takes in an
            :obj:`torch_geometric.data.Data` object and returns a transformed
            version. The data object will be transformed before every access.
            (default: :obj:`None`)
        pre_transform (callable, optional): A function/transform that takes in
            an :obj:`torch_geometric.data.Data` object and returns a
            transformed version. The data object will be transformed before
            being saved to disk. (default: :obj:`None`)
        pre_filter (callable, optional): A function that takes in an
            :obj:`torch_geometric.data.Data` object and returns a boolean
            value, indicating whether the data object should be included in the
            final dataset. (default: :obj:`None`)
    """
    @property
    def raw_file_names(self) -> Union[str, List[str], Tuple]:
        r"""The name of the files in the :obj:`self.raw_dir` folder that must
        be present in order to skip downloading."""
        raise NotImplementedError

    @property
    def processed_file_names(self) -> Union[str, List[str], Tuple]:
        r"""The name of the files in the :obj:`self.processed_dir` folder that
        must be present in order to skip processing."""
        raise NotImplementedError

    def download(self):
        r"""Downloads the dataset to the :obj:`self.raw_dir` folder."""
        raise NotImplementedError

    def process(self):
        r"""Processes the dataset to the :obj:`self.processed_dir` folder."""
        raise NotImplementedError

    def len(self) -> int:
        r"""Returns the number of graphs stored in the dataset."""
        raise NotImplementedError

    def get(self, idx: int) -> Data:
        r"""Gets the data object at index :obj:`idx`."""
        raise NotImplementedError

    def __init__(self, root: Optional[str] = None,
                 transform: Optional[Callable] = None,
                 pre_transform: Optional[Callable] = None,
                 pre_filter: Optional[Callable] = None):
        super().__init__()

        if isinstance(root, str):
            root = osp.expanduser(osp.normpath(root))

        self.root = root
        self.transform = transform
        self.pre_transform = pre_transform
        self.pre_filter = pre_filter
        self._indices: Optional[Sequence] = None

        if 'download' in self.__class__.__dict__:
            self._download()

        if 'process' in self.__class__.__dict__:
            self._process()

    def indices(self) -> Sequence:
        return range(self.len()) if self._indices is None else self._indices

    @property
    def raw_dir(self) -> str:
        return osp.join(self.root, 'raw')

    @property
    def processed_dir(self) -> str:
        return osp.join(self.root, 'processed')

    @property
    def num_node_features(self) -> int:
        r"""Returns the number of features per node in the dataset."""
        data = self[0]
        data = data[0] if isinstance(data, tuple) else data
        if hasattr(data, 'num_node_features'):
            return data.num_node_features
        raise AttributeError(f"'{data.__class__.__name__}' object has no "
                             f"attribute 'num_node_features'")

    @property
    def num_features(self) -> int:
        r"""Returns the number of features per node in the dataset.
        Alias for :py:attr:`~num_node_features`."""
        return self.num_node_features

    @property
    def num_edge_features(self) -> int:
        r"""Returns the number of features per edge in the dataset."""
        data = self[0]
        data = data[0] if isinstance(data, tuple) else data
        if hasattr(data, 'num_edge_features'):
            return data.num_edge_features
        raise AttributeError(f"'{data.__class__.__name__}' object has no "
                             f"attribute 'num_edge_features'")

    @property
    def raw_paths(self) -> List[str]:
        r"""The absolute filepaths that must be present in order to skip
        downloading."""
        files = to_list(self.raw_file_names)
        return [osp.join(self.raw_dir, f) for f in files]

    @property
    def processed_paths(self) -> List[str]:
        r"""The absolute filepaths that must be present in order to skip
        processing."""
        files = to_list(self.processed_file_names)
        return [osp.join(self.processed_dir, f) for f in files]

    def _download(self):
        if files_exist(self.raw_paths):  # pragma: no cover
            return

        makedirs(self.raw_dir)
        self.download()

    def _process(self):
        f = osp.join(self.processed_dir, 'pre_transform.pt')
        if osp.exists(f) and torch.load(f) != _repr(self.pre_transform):
            warnings.warn(
                f"The `pre_transform` argument differs from the one used in "
                f"the pre-processed version of this dataset. If you want to "
                f"make use of another pre-processing technique, make sure to "
                f"sure to delete '{self.processed_dir}' first")

        f = osp.join(self.processed_dir, 'pre_filter.pt')
        if osp.exists(f) and torch.load(f) != _repr(self.pre_filter):
            warnings.warn(
                "The `pre_filter` argument differs from the one used in the "
                "pre-processed version of this dataset. If you want to make "
                "use of another pre-fitering technique, make sure to delete "
                "'{self.processed_dir}' first")

        if files_exist(self.processed_paths):  # pragma: no cover
            return

        print('Processing...', file=sys.stderr)

        makedirs(self.processed_dir)
        self.process()

        path = osp.join(self.processed_dir, 'pre_transform.pt')
        torch.save(_repr(self.pre_transform), path)
        path = osp.join(self.processed_dir, 'pre_filter.pt')
        torch.save(_repr(self.pre_filter), path)

        print('Done!', file=sys.stderr)

    def __len__(self) -> int:
        r"""The number of examples in the dataset."""
        return len(self.indices())

    def __getitem__(
        self,
        idx: Union[int, np.integer, IndexType],
    ) -> Union['Dataset', Data]:
        r"""In case :obj:`idx` is of type integer, will return the data object
        at index :obj:`idx` (and transforms it in case :obj:`transform` is
        present).
        In case :obj:`idx` is a slicing object, *e.g.*, :obj:`[2:5]`, a list, a
        tuple, or a :obj:`torch.Tensor` or :obj:`np.ndarray` of type long or
        bool, will return a subset of the dataset at the specified indices."""
        if (isinstance(idx, (int, np.integer))
                or (isinstance(idx, Tensor) and idx.dim() == 0)
                or (isinstance(idx, np.ndarray) and np.isscalar(idx))):

            data = self.get(self.indices()[idx])
            data = data if self.transform is None else self.transform(data)
            return data

        else:
            return self.index_select(idx)

    def index_select(self, idx: IndexType) -> 'Dataset':
        r"""Creates a subset of the dataset from specified indices :obj:`idx`.
        Indices :obj:`idx` can be a slicing object, *e.g.*, :obj:`[2:5]`, a
        list, a tuple, or a :obj:`torch.Tensor` or :obj:`np.ndarray` of type
        long or bool."""
        indices = self.indices()

        if isinstance(idx, slice):
            indices = indices[idx]

        elif isinstance(idx, Tensor) and idx.dtype == torch.long:
            return self.index_select(idx.flatten().tolist())

        elif isinstance(idx, Tensor) and idx.dtype == torch.bool:
            idx = idx.flatten().nonzero(as_tuple=False)
            return self.index_select(idx.flatten().tolist())

        elif isinstance(idx, np.ndarray) and idx.dtype == np.int64:
            return self.index_select(idx.flatten().tolist())

        elif isinstance(idx, np.ndarray) and idx.dtype == np.bool:
            idx = idx.flatten().nonzero()[0]
            return self.index_select(idx.flatten().tolist())

        elif isinstance(idx, Sequence) and not isinstance(idx, str):
            indices = [indices[i] for i in idx]

        else:
            raise IndexError(
                f"Only slices (':'), list, tuples, torch.tensor and "
                f"np.ndarray of dtype long or bool are valid indices (got "
                f"'{type(idx).__name__}')")

        dataset = copy.copy(self)
        dataset._indices = indices
        return dataset

    def shuffle(
        self,
        return_perm: bool = False,
    ) -> Union['Dataset', Tuple['Dataset', Tensor]]:
        r"""Randomly shuffles the examples in the dataset.

        Args:
            return_perm (bool, optional): If set to :obj:`True`, will also
                return the random permutation used to shuffle the dataset.
                (default: :obj:`False`)
        """
        perm = torch.randperm(len(self))
        dataset = self.index_select(perm)
        return (dataset, perm) if return_perm is True else dataset

    def __repr__(self) -> str:
        arg_repr = str(len(self)) if len(self) > 1 else ''
        return f'{self.__class__.__name__}({arg_repr})'


def to_list(value: Any) -> Sequence:
    if isinstance(value, Sequence) and not isinstance(value, str):
        return value
    else:
        return [value]


def files_exist(files: List[str]) -> bool:
    # NOTE: We return `False` in case `files` is empty, leading to a
    # re-processing of files on every instantiation.
    return len(files) != 0 and all([osp.exists(f) for f in files])


def _repr(obj: Any) -> str:
    if obj is None:
        return 'None'
    return re.sub('(<.*?)\\s.*(>)', r'\1\2', obj.__repr__())

DataLoader

通过torch_geometric.data.DataLoader可以方便地使用 mini-batch。

    dataset = get_dataset(train_args['dataset'], 'test')
    dataloader = DataLoader(dataset,
                            batch_size=args.batch_size,
                            num_workers=1,
                            pin_memory=True,
                            shuffle=False)

PyG实现LayoutGAN++

参数设置

在这里插入图片描述

数据处理

在这里插入图片描述

模型定义

在这里插入图片描述

Generator(
  (fc_z): Linear(in_features=4, out_features=128, bias=True)
  (emb_label): Embedding(13, 128)
  (fc_in): Linear(in_features=256, out_features=256, bias=True)
  (transformer): TransformerEncoder(
    (layers): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (2): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (3): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (4): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (5): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (6): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
      (7): TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
        )
        (linear1): Linear(in_features=256, out_features=128, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=128, out_features=256, bias=True)
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
)

TransformerEncoder(
  (layers): ModuleList(
    (0): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=128, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=128, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
    (1): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=128, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=128, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
    (2): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=128, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=128, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
    (3): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=128, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=128, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
    (4): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=128, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=128, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
    (5): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=128, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=128, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
    (6): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=128, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=128, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
    (7): TransformerEncoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
      )
      (linear1): Linear(in_features=256, out_features=128, bias=True)
      (dropout): Dropout(p=0.1, inplace=False)
      (linear2): Linear(in_features=128, out_features=256, bias=True)
      (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
    )
  )
)

TransformerEncoderLayer(
  (self_attn): MultiheadAttention(
    (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
  )
  (linear1): Linear(in_features=256, out_features=128, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (linear2): Linear(in_features=128, out_features=256, bias=True)
  (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (dropout2): Dropout(p=0.1, inplace=False)
)

预测结果

在这里插入图片描述

序列到序列模型(Seq2Seq模型)
Seq2Seq模型是输出长度不确定时采用的模型，一般在机器翻译、人机对话、聊天机器人等对话生成场景应用中使用该模型。例如输入的中文长度为4，输出的英文长度为2.

在网络结构中，输入一个中文序列，然后输出它对应的中文翻译，输出的部分的结果预测后面，根据上面的例子，先输出“machine”，将“machine”作为下一次的输入，接着输出“learning”，这样就输出任意长的序列。
Seq2Seq属于encoder-decoder结构中的一种，基本思想就是利用两个RNN，一个RNN作为encoder，另一个RNN作为decoder。encoder负责将输入序列压缩成指定长度的向量，这个向量就可以看成是这个序列的语义，这个过程称为编码，如下图所示，获取语义向量最简单的方式就是直接将最后一个输入的隐状态作为语义向量C。也可以对最后一个隐含状态做一个变换得到语义向量，还可以将输入序列的所有隐含状态做一个变换得到语义变量。

encoder负责根据语义向量生成指定的序列，这个过程称为解码，如下图所示，最简单的方式是将encoder得到的语义变量作为初始状态输入到decoder的RNN中，得到输出序列。可以看到上一时刻的输出会作为当前时刻的输入，而且其中语义向量C只作为初始状态参与运算，后面的运算都与语义向量C无关。

decoder处理方式还有另外一种，就是语义向量C参与了所有时刻的运算，如下图所示，上一时刻的输出仍然作为当前时刻的输入，但语义向量C会参与所有时刻的运算。

RNN是可以学习概率分布，然后进行预测，比如输入t时刻的数据后，预测t+1时刻的数据。比较常见的是字符预测或者使劲按序列预测。为了得到概率分布，一般会在RNN的输出层使用softmax激活函数，就可以得到每个分类的概率。

基础的Seq2Seq有很多弊端，首先Encoder将输入编码为固定大小状态向量的过程实际上是一个信息“有损压缩”的过程，如果信息量越大，那么转化向量的过程对信息的损失就越大，同时，随着sequence length的增加，时间维度上的序列很长，RNN模型也会出现梯度弥散。最后，基础的模型连接Encoder和Decoder模块的组件仅仅是一个固定大小的状态向量，这使得Decoder无法直接去关注到输入信息的更多细节。
注意力机制（Attention Mechanism）
注意力机制（Attention Mechanism）源于对人类视觉的研究。在认知科学中，由于信息处理的瓶颈，人类会选择性地关注所有信息的一部分，同时忽略其他可见的信息。上述机制通常被称为注意力机制。人类视网膜不同的部位具有不同程度的信息处理能力，即敏锐度（Acuity），只有视网膜中央凹部位具有最强的敏锐度。为了合理利用有限的视觉信息处理资源，人类需要选择视觉区域中的特定部分，然后集中关注它。综上，注意力机制主要有两个方面：

决定需要关注输入的哪部分。
分配有限的信息处理资源给重要的部分。

在计算机视觉领域，注意力机制被引入进行视觉信息处理。注意力是一种机制，或者方法论，并没有严格的数学定义。比如，传统的局部图像特征提取、显著性检测、滑动窗口方法等都可以看作一种注意力机制。在神经网络中，注意力模块通常是一个额外的神经网络，能够硬性选择输入的某些部分，或者给输入的不同部分分配不同的权重。
深度学习与视觉注意力机制结合的研究工作，大多数是集中于使用掩码（mask）来形成注意力机制。掩码的原理在于通过另一层新的权重，将图片数据中关键的特征标识出来，通过学习训练，让神经网络学到每一张新图片中需要关注的区域，即形成了注意力。
注意力机制可以分为四类：基于输入项的柔性注意力（item-wise soft attention）、基于输入项的硬性注意力（item-wise hard attention）、基于位置的柔性注意力（location-wise soft attention）、基于位置的硬性注意力（location-wise hard attention）。
基于项的注意力的输入需要包含明确的项的序列，或者需要额外的预处理步骤来生成包含明确的项的序列（项可以是一个向量、矩阵、或者一个特征图）。基于位置的注意力是针对输入为一个单独的特征图设计的，所有的目标可以通过位置指定。
总的来说，一种是软注意力（soft attention），一种是强注意力（hard attention）。
软注意力的关键点在于，这种注意力更关注区域或者通道，而且软注意力是确定性的注意力，学习完成后直接可以通过网络生成，最关键的地方是软注意力是可微的，可微分的注意力可以通过神经网络算出梯度并且前向传播和后面反馈学习得到的注意力的权重。

强注意力是更加关注点，也就是图像中的每个点都可能延伸出注意力，同时强注意力是一个随机的预测过程，更强调动态变化。当然，最关键是强注意力是一个不可微的注意力，训练过程是通过增强学习来完成的。

从注意力域角度来分析几种注意力实现方法，主要有三种注意力域：空间域（spatial domain）、通道域（channel domain）、混合域（mixed domain）。强注意力实现的注意力域，时间域（time domain），因为强注意力是使用reinforcement learning来实现的，训练起来有所不同。

根据通用近似定理，前馈网络和循环网络都有很强的能力，但是还是存在计算能力的限制和优化算法的限制：

计算能力的限制：当要记住很多“信息”，模型就要变得更复杂，目前计算能力依然是限制神经网络发展的瓶颈。
优化算法的限制：虽然局部连接、权重共享以及pooling等优化操作可以让神经网络变得简单一些，有效缓解模型复杂度和表达能力之间的矛盾；但是，如循环神经网络中的长距离以来的问题，信息“记忆”能力并不高。
可以借助人脑吹信息过载的方式，如Attention机制可以提高神经网络处理信息的能力。

当用神经网络来处理大量的输入信息时，可以借鉴人脑的注意力机制，只选择一些关键的信息输入进行处理，来提高神经网络的效率。按照认知神经学中的注意力，可以总体分为两类：

聚焦式（focus）注意力：自伤而下的有意识的注意力，主动注意是指有预定目的的、依赖任务的、主动有意识的聚焦于某一对象的注意力。
显著性（saliency-based）注意力：自下而上的有意识的注意力，被动注意基于显著性的注意力式由外界刺激驱动的注意力，不需要主动干预，也和任务无关，可以将max-pooling和门控（gating）机制来近似的看作是自下而上的基于显著性的注意力机制。

在人工神经网络中，注意力机制一般就特指聚焦式注意力。

目前大多数的注意力模型附着在Encoder-Decoder框架下，当然，器注意力模型可以看作一种通用的思想，本身并不依赖于特定框架。下图是文本处理领域里常用的Encoder-Decoder框架最抽象的一种表示。

文本处理领域的Encoder-Decoder框架可以看作适合处理由一个句子（或篇章）生成另外一个句子（或篇章）的通用处理模型。对于句子对<Source, Target>，目标是给定输入句子Source，期待通过Encoder-Decoder框架来生成目标句子Target。Source和Target可以是同一种语言，也可以是两种不同的语言。Source和Target分别由各自的单词序列构成：

Encoder对输入句子Source进行编码，将输入句子通过非线性变换转化为中间语义表示C：

解码器根据句子Source的中间语义表示C和之前已经生成的历史信息 $y_1,y_2...y_{i-1}$ 来生成 $i$ 时刻要生成的单词 $y_i$ ：

每个 $y_i$ 都依次产生，看起来就是整个系统根据输入句子Source生成了目标句子Target。
上图的Encoder-Decoder框架没有体现出“注意力模型”，可以看作是注意力不集中的分心模型。因为在生成目标句子的单词时，不论生成哪个单词，他们使用的输入句子Source的语义编码C都是一样的，没有任何区别。

增加注意力模型的Encoder-Decoder框架如下图所示：

即生成目标句子单词的过程成了下面的形式：

把Attention从Encoder-Decoder框架中剥离，进一步抽象，可以看懂A头疼体哦i你机制的本质思想。

将Source中的构成元素想象成是由一系列的<Key, Value>数据对构成，此时给定Target中的某个元素Query，通过计算Query和各个Key的相似性或者相关性，得到每个Key对应Value的权重系数，然后对Value进行加权求和，即得到了最终的Attention数值。所以本质上Attention机制是对Source中元素的Value值进行加权求和，而Query和Key用来计算对应Value的权重系数，既可以将其本质思想改写为如下公式：

其中, $L_x=||Source||$ 表示Source的长度。在Attention的计算过程中，Source中的Key和Value合二为一，指向的是同一个东西，即输入句子中每个单词对应的语义编码，所以可能不容易看出这种能够体现本质思想的结构。但是从概念上理解，把Attention仍然理解为从大量信息中有选择的筛选出少量重要信息并聚焦到这些重要信息上，忽略大多不重要的信息，这种思想仍然成立。聚焦的过程体现在权重系数的计算上，权重越大越聚焦于其对应的Value值上，即权重代表了信息的重要性，而Value是其对应的信息。
Attention机制的计算过程：

第一阶段，根据Query和某个Key_i，计算两者的相似性或者相关性，最常见的方法包括：求两者的向量点积、求两者的向量Cosine相似性或者通过引入额外的神经网络来求值。

第二阶段，引入类似SoftMax的计算方式对第一阶段的得分进行数值转换，一方面可以进行归一化，将原始计算分值整理成所有元素权重之和为1的概率分布；另一方面可以通过SoftMax的内在机制更加突出重要元素的权重。

第三阶段，根据权重系数对Value进行加权求和。第二阶段的计算结果 $a_i$ 即为 $value_i$ 对应的权重系数，然后进行加权求和即可得到Attention数值。

从另外一种理解，可以将Attention机制看作是一种软寻址（Soft Addressing）：Source可以看作存储器内存储的内容，元素由地址Key和值Value组成，当前有个Key=Query的查询，目的是取出存储器中对应的Value值，即Attention数值。通过Query和存储器内元素Key的地址进行相似性比较来寻址，软寻址指的是可能从每个Key地址都会取出内容，取出内容的重要性根据Query和Key的相似性来决定，之后对Value进行加权求和，这样就可以取出最终的Value值，即Attention值。

这种编码方式为软性注意力机制（soft attention）软性注意力机制有两种：普通模式（Key=Value=X）和键值对模式（Key！=Value）

Attention机制的变种

比较基础的加入attention与rnn结合的model的算法流程：

encoder对输入序列编码得到最后一个时间步的状态 $c$ ，和每个时间步的输出 $h$ ，其中 $c$ 又作为decoder的初始状态 $z_0$ 。
对于每个时间步的输出 $h$ 和 $z_0$ 做匹配，得到每个时间步的匹配向量 $\alpha_0^1$ ，如下图所示：

对所有的时间步输出 $h$ 和 $z_0$ 的匹配度 $\alpha_0^1$ ，使用softmax做归一化处理，得到各个时间步对于 $z_0$ 的匹配分数。
求各个时间步的输出 $h$ 与匹配分数的加权求和得到 $c^0$ ，作为decoder的下一个时间步的输入，如下图所示：

计算各个时间步的输出 $h$ 与 $z_1$ 的匹配度，得到 $c^1$ 作为decoder下一个时间步的输入，如此一步一步重复下去，如下图所示：

图解Seq2Seq Attention模型：

自注意力模型（self-attention model）
当使用神经网络来处理一个变长的向量序列时，可以使用卷积神经网络或循环神经网络进行编码来得到一个相同长度的输出向量序列，如下图所示：

从图中可以看出，无论卷积还是循环神经网络其实都是对边长序列的一种“局部编码”：卷积神经网络显然时基于N-gram的局部编码；而对应循环神经网络，由于梯度消失等问题也只能建立短距离依赖。
如果要建立输入序列之间的长距离依赖关系，可以使用以下两种方法：一种方法时增加网络的层数，通过一个深层网络来获取远距离的信息交互，另一种方法时使用全连接网络。下图所示为全连接模型和自注意力模型，实线表示可学习的权重，虚线表示动态生成的权重。

从上图中可以看出，全连接网络虽然是一种非常直接的建模远距离依赖的模型，但是无法处理变长的输入序列。不同的输入长度，其连接权重的大小也是不同的。可以利用注意力机制来“动态”的生成不同连接的权重，这就是自注意力模型（self-attention model）。因为自注意力模型的权重是动态生成的，因此可以处理变长的信息序列。
自注意力模型的计算流程：

注意力计算公式：

自注意力模型中，通常使用缩放点积来作为注意力打分函数，输出向量序列可以写为：

Q、K、V是怎么来的？假设输入序列“我是谁”，并且已经通过某种方式得到了1个形状为3x4的矩阵来进行表示，通过下面的过程便能够得到Q、K和V。

Q、K、V计算过程：Encoder和Decoder在各自输入部分利用自注意力机制进行编码的过程中，Q、K和V其实就是输入X分别乘以3个不同的矩阵计算得到的。

注意力权重计算：计算得到的Q、K、V可以理解为是对于同一个输入进行3词不同的线性变换来表示其不同的3中状态。在计算得到Q、K、V之后，可以进一步计算得到的权重向量，计算过程如下所示，已经经过scale和softmax操作。对于权重矩阵的第1行来说，0.7表示“我”与“我”的注意力值；0.2表示“我”与“是”的注意力值；0.1表示“我”与“谁”的注意力值。即在对序列中的“我”进行编码时，应该将0.7的注意力放在“我”上，将0.2注意力放在“是”上，将0.1的注意力放在“谁”上。对于权重矩阵的第3行来说，在对序列中的“谁”进行编码时，应将0.2的注意力放在“我”上，将0.1的注意力放在“是”上，将0.7的注意力放在“谁”上。从这一过程中可以看出，通过这个权重矩阵模型就能知道在编码对应位置上的向量时，应该以何种方式将注意力集中到不同的位置上。

权重和编码输出：计算得到权重矩阵后，可以将其作用于V，进而得到最终的编码输出。

编码输出换个角度观察，对于最终输出“是”的编码向量来说，它其实就是原始“我是谁”3个向量的加权和，这就体现了在对“是”进行编码时注意力权重分配的全过程。

上面过程可以用下面的过程来表示。

自注意力机制的缺陷是：模型在对当前的位置信息进行编码时，会过度的将注意力集中于自身的位置。为了解决该问题，使用多头注意力机制能够给予注意力层的输出包含有不同子空间中的编码表示信息，从而增强模型的表达能力。

从上图可以看出，多头注意力机制其实就是将原始的输入序列进行多组的自注意力处理过程，然后再将每一组自注意力的结果拼接起来进行一次线性变换得到最终的输出结果。论文中使用的多头注意力机制其实就是将一个大的高维单头拆分成了h个多头。

在固定的情况下，不管是使用单头还是多头的方式，在实际的处理过程中直到进行注意力权重矩阵计算前，两者之前没有任何区别。当进行进行注意力权重矩阵计算时，越大那么就会被切分得越小，进而得到的注意力权重分配方式越多。

从图可以看出，如果，那么最终可能得到的就是一个各个位置只集中于自身位置的注意力权重矩阵；如果，那么就还可能得到另外一个注意力权重稍微分配合理的权重矩阵；同理如此。因而多头这一做法也恰好是论文作者提出用于克服模型在对当前位置的信息进行编码时，会过度的将注意力集中于自身的位置的问题。

NLP Transformer模型原理
模型结构和大多数seq2seq模型一样，transformer的结构也是由encoder和decoder组成。

Encoder

Decoder

加了mask的attention原理如图（另附multi-head attention）：

Positional Encodeing

PyTorch中transformer的输入输出细节

位置编码于解码过程
根据自注意力机制原理，注意力机制在实际运算过程中不过就是几个矩阵来回相乘进行线性变换而已。因此，即使打乱各个词的顺序，那么最终计算得到的结果本质上却没有发生任何变换，即仅仅只使用自注意力机制会丢失文本原有的序列信息。

上图在经过词嵌入表示后，序列“我在看书”经过一次线性变换。限制将序列变成“书在看我”，然后同样以中间这个权重矩阵来进行线性变换。

根据计算结果来看，序列在交换位置前和交换位置后计算得到的结果在本质上并没有任何区别，仅仅只是交换了对应的位置。因此，基于这样的原因，Transformer在原始输入文本进行Token Embedding后，又额外的加入了一个Positional Embedding来刻画数据在时序上的特征。

常数Positional embedding
原始输入在经过Token Embedding后，又加入一个常数位置信息的Positional Embedding。再经过一次线性变换后得到下图所示结果。

再交换序列的位置，并同时进行Positional Embedding后，结果如下。再交换序列位置后，采用同样的Positional Embedding进行处理，并且进行线性变换，其计算结果同上面的计算结果本质上没有发生变换。因此，如果Positional Embedding中位置信息时以常数形式进行变换，那么这样的Positional Embedding是无效的。

非常数Positional embedding
融入非常数的Positional Embedding位置信息后，便可以得到如下对比结果。在交换位置前与交换位置后，与同一个权重矩阵进行线性变换后的结果截然不同。这就证明通过Positional Embedding可以弥补自注意力机制不能捕捉训练时序信息的缺陷。

Transformer中的掩码

Attention Mask
在训练过程中对于每一个样本来说都需要这样一个对称矩阵来掩盖掉当前时刻之后所有位置的信息。

Padding Mask
在网络的训练过程中同一个batch会包含有多个文本序列，而不同的文本序列长度并不一致。因此在数据集的生成过程中，需要将同一个batch中的序列padding到相同的长度。但是，这样会导致在注意力的计算过程中考虑到Padding位置上的信息。

如图所示，P表示Padding的位置，右边的矩阵表示计算得到的注意力权重矩阵。可以看到，此时的注意力权重对于Padding位置上的信息也会加以考虑。因此在Transformer中，作者通过在生成训练集的过程中记录下每个样本Padding的实际位置；然后再将注意力权重矩阵中对应位置的权重替换成负无穷，经softmax操作后对应Padding位置上的权重就变成了0，从而达到了忽略Padding位置信息的目的。