graph4nlp中文版

1.前言


  在GNN和NLP的章节,我们对图神经网络和自然语言处理进行了综合的阐述,今天我们对图深度学习在自然语言处理应用的工程化实践进行阐述,这里我引用graph4nlp网站进行展开。目的有二:其一,为了推广graph4nlp这些实用的工具,其二,为了阐述图神经网络算法如何运用到NLP领域。

  graph4nlp的安装方法:官网安装方法,参考来源:graph4nlp网站

2.图数据集


  Graph4NLP是一个容易上手的工具,可用于研究图深度学习和自然语言处理,为科研人员和开发人员提供了很好的模型工具。基于DGL工具,Graph4NLP兼具高效运行和良好的扩展性。具体而言,这个工具库具备易用性、灵活性、丰富的学习实例、运行高效、扩展性俱佳。Graph4NLP的组成结构如下:
图1.graph4nlp的结构层次

  言归正传,我们先看图数据集的建立方法。Graph4NLP的GraphData类为我们提供了数据集建立和操作方法。

2.1 建立数据集

import graph4nlp
from graph4nlp import data

g = data.GraphData() # Construct an empty graph
g.add_nodes(10) # Add 10 nodes to this graph sequentially.
g.get_node_num()
>>> 10
g.add_nodes(9)  # Add another 9 nodes. This operation will append the new nodes to existing ones.
g.get_node_num()
>>> 19
g.add_edges([0, 1, 2], [1, 2, 3])   # Add 3 edges, connecting nodes 0~4 into a line.
g.get_all_edges()
>>> [(0, 1), (1, 2), (2, 3)]

GraphData类提供了两种方法:GraphData.nodes()和GraphData.edges(),分别对节点和边进行操作。当然,我们还有其他的添加数据的方法,如from_dgl(), from_dense_adj(), from_scipy_sparse_matrix() 和from_graphdata()。

import graph4nlp
from graph4nlp import data
import torch

g = data.GraphData()
g.add_nodes(10)
for i in range(10):
    g.add_edge(src=i, tgt=(i + 1) % 10)
g.node_features['node_feat'] = torch.randn((10, 10))
g.node_features['zero'] = torch.zeros(10)
g.node_features['idx'] = torch.tensor(list(range(10)), dtype=torch.long)
g.edge_features['edge_feat'] = torch.randn((10, 10))
g.edge_features['idx'] = torch.tensor(list(range(10)), dtype=torch.long)


# Test to_dgl
dgl_g = g.to_dgl()
for node_feat_name in g.node_feature_names():
    if g.node_features[node_feat_name] is None:
        assert node_feat_name not in dgl_g.ndata.keys()
    else:
        assert torch.all(torch.eq(dgl_g.ndata[node_feat_name], g.node_features[node_feat_name]))
        
for edge_feat_name in g.get_edge_feature_names():
    if g.edge_features[edge_feat_name] is None:
        assert edge_feat_name not in dgl_g.edata.keys()
    else:
        assert torch.all(torch.eq(dgl_g.edata[edge_feat_name], g.edge_features[edge_feat_name]))
print(g.get_node_num(),dgl_g.number_of_nodes())

src, tgt = dgl_g.all_edges()
dgl_g_edges = []
for i in range(src.shape[0]):
    dgl_g_edges.append((int(src[i]), int(tgt[i])))
print(g.get_all_edges(), dgl_g_edges)
# Test from_dgl

g = data.GraphData()
g1 = g.from_dgl(dgl_g)
for node_feat_name in g.node_feature_names():
    try:
        assert torch.all(torch.eq(g1.node_features[node_feat_name], g.node_features[node_feat_name]))
    except TypeError:
        assert g1.node_features[node_feat_name] == g.node_features[node_feat_name]
        
for edge_feat_name in g.get_edge_feature_names():
    try:
        assert torch.all(torch.eq(g1.edge_features[edge_feat_name], g.edge_features[edge_feat_name]))
    except TypeError:
        assert g1.edge_features[edge_feat_name] == g.edge_features[edge_feat_name]
print(g1.get_node_num(),g.get_node_num())
print(g1.get_all_edges(), g.get_all_edges())

图2.运行结果

  图数据集可以转为其他的格式,目前,to_dgl()可把GraphData转为dgl.DGLGraph。

2.2 操作数据集


 emsp;在GraphData类中,有两类信息:节点和边相关的信息,也可以划分为特征信息和属性信息。

2.2.1 特征

import graph4nlp
from graph4nlp import data

g = data.GraphData()
g.add_nodes(10)

# Note that the first dimension of features represent the number of instances(nodes/edges).
# Any manipulation to the features should keep the match between the number of instances and the dimension size
# An invalid example
g.node_features['node_feat'] = torch.randn((9, 10))

图3.节点和数据不匹配

g.node_features['node_feat'] = torch.randn((10, 10))
g.node_features['zero'] = torch.zeros(10)
g.node_features['idx'] = torch.tensor(list(range(10)), dtype=torch.long)
g.node_features

图4.运行结果

  为了操作节点和边,我们可以通过GraphData.nodes[node_index].features或GraphData.edges[node_index].features的方式访问节点和边的特征。相应地,GraphData.node_features和.GraphData.edge_features访问整个图的节点和边的特征信息。添加节点,节点的特征默认为zero_padding。

g.add_nodes(1)
g.node_features     # Zero padding is performed

图5.zero padding

2.2.2 属性

g = data.GraphData()
g.add_nodes(2)  # Add 2 nodes to an empty graph
print(g.node_attributes)

g.node_attributes[1]['node_attr'] = 'hello'
print(g.node_attributes)

图6.节点属性
特征和属性的区别:

  1. 存储类型不同:特征信息只存储数值类型,一般为torch.tensor类型。在shape方面,图中如有10个节点和20条边,那么每个节点shape为[10,],而每个边的特征shape为[20,];但是属性可以存储任意类型的数据,而且没有shape上的规定;
  2. 访问方式不同:虽然特征和属性都有两层key:name和indices,但是特征信息采用字典的feature_name(string):feature_value(torch.tensor),我们可以根据feature_name去查询feature_value,而属性采用list(dict)的方式(见图6),我们可以根据list的index访问属性信息。

2.3 数据的batch化

import graph4nlp
from graph4nlp import data

g_list = []
batched_edges = []
graph_edges_list = []
# Build a number of graphs
for i in range(5):
    g = data.GraphData()
    g.add_nodes(10)
    for j in range(10):
        g.add_edge(src=j, tgt=(j + 1) % 10)
        batched_edges.append((i * 10 + j, i * 10 + ((j + 1) % 10)))
    g.node_features['idx'] = torch.ones(10) * i
    g.edge_features['idx'] = torch.ones(10) * i
    graph_edges_list.append(g.get_all_edges())
    g_list.append(g)

print(g_list)
# Test to_batch
batch = data.data.to_batch(g_list)

target_batch_idx = []
for i in range(5):
    for j in range(10):
        target_batch_idx.append(i)

# Expected behaviors
assert batch.batch == target_batch_idx
assert batch.get_node_num() == 50
assert batch.get_all_edges() == batched_edges

# Un-batching
graph_list = data.data.from_batch(batch)

for i in range(len(graph_list)):
    g = graph_list[i]
    # Expected behaviors
    print(g.get_all_edges(),graph_edges_list[i])
    print(g.get_node_num() == 10)
    print(torch.all(torch.eq(g.node_features['idx'], torch.ones(10) * i)))
    print(torch.all(torch.eq(g.edge_features['idx'], torch.ones(10) * i)))
    print("-"*115)

图7.batch的结果

  在深度学习中,batch化是常用的手段,由于数据量太大,我们在训练的过程中,采用批量化输入数据的方式。在图深度学习领域,我们也采用batch的方式,即把list中的GraphData实例,形成小的batch,而非整个大的graph数据一次性输入。其中,graph4nlp.data.data.to_batch方法就是把装有GraphData实例打包成batch,而graph4nlp.data.data.from_batch方法解构batch,变成list。

2.4 Dataset和本地数据Dataset

图7.建立Dataset的流程

  建立dataset的方式:下载/收集原始数据,数据预处理,最后规定数据的迭代方式。这和我们在深度学习建立torch.data.DataLoader方式一样。其中,第二步中,需要建立拓扑结构、建立vocab和向量化表示。

class Dataset:
    @property
    def raw_dir(self) -> str:
        """The directory where the raw data is stored."""
        return os.path.join(self.root, 'raw')

    @property
    def processed_dir(self) -> str:
        return os.path.join(self.root, 'processed', self.topology_subdir)
        


  建立自己数据的dataset,我们可以针对文本句子、token列表和图数据进行定制化的建模工作。

import graph4nlp
from graph4nlp import data

class Text2TextDataItem(data.DataItem):
    def __init__(self, input_text, output_text, tokenizer, share_vocab=True):
        super(Text2TextDataItem, self).__init__(input_text, tokenizer)
        self.output_text = output_text
        self.share_vocab = share_vocab
    def extract(self):
        g: GraphData = self.graph
        input_tokens = []
        for i in range(g.get_node_num()):
            if self.tokenizer is None:
                tokenized_token = g.node_attributes[i]['token'].strip().split(' ')
            else:
                tokenized_token = self.tokenizer(g.node_attributes[i]['token'])

            input_tokens.extend(tokenized_token)

        if self.tokenizer is None:
            output_tokens = self.output_text.strip().split(' ')
        else:
            output_tokens = self.tokenizer(self.output_text)

        if self.share_vocab:
            return input_tokens + output_tokens
        else:
            return input_tokens, output_tokens


class JobsDataset(data.Text2TextDataset):
        def __init__(self, root_dir,
             topology_builder, topology_subdir,
            #  pretrained_word_emb_file=None,
             pretrained_word_emb_name="6B",
             pretrained_word_emb_url=None,
             pretrained_word_emb_cache_dir=None,
             graph_type='static',
             merge_strategy="tailhead", edge_strategy=None,
             seed=None,
             word_emb_size=300, share_vocab=True, lower_case=True,
             thread_number=1, port=9000,
             dynamic_graph_type=None,
             dynamic_init_topology_builder=None,
             dynamic_init_topology_aux_args=None,
             for_inference=None,
             reused_vocab_model=None):
    # Initialize the dataset. If the preprocessed files are not found, then do the preprocessing and save them.
            super(JobsDataset, self).__init__(root_dir=root_dir, topology_builder=topology_builder,
                                      topology_subdir=topology_subdir, graph_type=graph_type,
                                      edge_strategy=edge_strategy, merge_strategy=merge_strategy,
                                      share_vocab=share_vocab, lower_case=lower_case,
                                      pretrained_word_emb_name=pretrained_word_emb_name, pretrained_word_emb_url=pretrained_word_emb_url, pretrained_word_emb_cache_dir=pretrained_word_emb_cache_dir,
                                      seed=seed, word_emb_size=word_emb_size,
                                      thread_number=thread_number, port=port,
                                      dynamic_graph_type=dynamic_graph_type,
                                      dynamic_init_topology_builder=dynamic_init_topology_builder,
                                      dynamic_init_topology_aux_args=dynamic_init_topology_aux_args,
                                      for_inference=for_inference,
                                      reused_vocab_model=reused_vocab_model)


  从上面的代码,我们可以看到,Text2TextDataItem类继承了data.DataItem类,而data.DataItem类有一个extract()方法,extract()可以返回input token和output token,用户可以重写这个类方法。而JobsDataset的父类为data.Text2TextDataset,extract()方法返回文本graph的token列表。

  定制化的下载方法如下,首先检查文件是否已经下载过,然后再下载缺失文件,而在基类Dataset中,download()方法是一个抽象方法。


@property
def raw_file_names(self):
    """3 reserved keys: 'train', 'val' (optional), 'test'. Represent the split of dataset."""
    return {'train': 'train.txt', 'test': 'test.txt'}

class Dataset:
    def _download(self):
        if all([os.path.exists(raw_path) for raw_path in self.raw_file_paths.values()]):
            return

        os.makedirs(self.raw_dir, exist_ok=True)
        self.download()

    @abc.abstractmethod
    def download(self):
        """Download the raw data from the Internet."""
        raise NotImplementedError
        


  接下来,就是数据预处理了,在上节,我们提到数据预处理的三个流程:build_topology, build_vocab和 vectorization。

def _process(self):
    if all([os.path.exists(processed_path) for processed_path in self.processed_file_paths.values()]):
        if 'val_split_ratio' in self.__dict__:
            UserWarning(
                "Loading existing processed files on disk. Your `val_split_ratio` might not work since the data have"
                "already been split.")
        return
    if self.for_inference and \
            all([(os.path.exists(processed_path) or self.processed_file_names['data'] not in processed_path) for
                 processed_path in self.processed_file_paths.values()]):
        return

    os.makedirs(self.processed_dir, exist_ok=True)

    self.read_raw_data()

    if self.for_inference:
        self.test = self.build_topology(self.test)
        self.vectorization(self.test)
        data_to_save = {'test': self.test}
        torch.save(data_to_save, self.processed_file_paths['data'])
    else:
        self.train = self.build_topology(self.train)
        self.test = self.build_topology(self.test)
        if 'val' in self.__dict__:
            self.val = self.build_topology(self.val)

        self.build_vocab()

        self.vectorization(self.train)
        self.vectorization(self.test)
        if 'val' in self.__dict__:
            self.vectorization(self.val)

        data_to_save = {'train': self.train, 'test': self.test}
        if 'val' in self.__dict__:
            data_to_save['val'] = self.val
        torch.save(data_to_save, self.processed_file_paths['data'])

        vocab_to_save = self.vocab_model
        torch.save(vocab_to_save, self.processed_file_paths['vocab'])


  我们看下完整的示例:

class Dataset:
    def build_topology(self, data_items):
        """
        Build graph topology for each item in the dataset. The generated graph is bound to the `graph` attribute of the
        DataItem.
        """
        total = len(data_items)
        thread_number = min(total, self.thread_number)
        pool = Pool(thread_number)
        res_l = []
        for i in range(thread_number):
            start_index = total * i // thread_number
            end_index = total * (i + 1) // thread_number

            """
            data_items, topology_builder,
                                graph_type, dynamic_graph_type, dynamic_init_topology_builder,
                                merge_strategy, edge_strategy, dynamic_init_topology_aux_args,
                                lower_case, tokenizer, port, timeout
            """
            r = pool.apply_async(self._build_topology_process,
                                 args=(data_items[start_index:end_index], self.topology_builder, self.graph_type,
                                       self.dynamic_graph_type, self.dynamic_init_topology_builder,
                                       self.merge_strategy, self.edge_strategy, self.dynamic_init_topology_aux_args,
                                       self.lower_case, self.tokenizer, self.port, self.timeout))
            res_l.append(r)
        pool.close()
        pool.join()

        data_items = []
        for i in range(thread_number):
            res = res_l[i].get()
            for data in res:
                if data.graph is not None:
                    data_items.append(data)

        return data_items

    def build_vocab(self):
        """
        Build the vocabulary. If `self.use_val_for_vocab` is `True`, use both training set and validation set for building
        the vocabulary. Otherwise only the training set is used.

        """
        data_for_vocab = self.train
        if self.use_val_for_vocab:
            data_for_vocab = self.val + data_for_vocab

        vocab_model = VocabModel.build(saved_vocab_file=self.processed_file_paths['vocab'],
                                       data_set=data_for_vocab,
                                       tokenizer=self.tokenizer,
                                       lower_case=self.lower_case,
                                       max_word_vocab_size=self.max_word_vocab_size,
                                       min_word_vocab_freq=self.min_word_vocab_freq,
                                       share_vocab=self.share_vocab,
                                       pretrained_word_emb_name=self.pretrained_word_emb_name,
                                       pretrained_word_emb_url=self.pretrained_word_emb_url,
                                       pretrained_word_emb_cache_dir=self.pretrained_word_emb_cache_dir,
                                       target_pretrained_word_emb_name=self.target_pretrained_word_emb_name,
                                       target_pretrained_word_emb_url=self.target_pretrained_word_emb_url,
                                       word_emb_size=self.word_emb_size)
        self.vocab_model = vocab_model

        return self.vocab_model

class Text2TextDataset:
    def vectorization(self, data_items):
        if self.topology_builder == IEBasedGraphConstruction:
            use_ie = True
        else:
            use_ie = False
        for item in data_items:
            graph: GraphData = item.graph
            token_matrix = []
            for node_idx in range(graph.get_node_num()):
                node_token = graph.node_attributes[node_idx]['token']
                node_token_id = self.vocab_model.in_word_vocab.getIndex(node_token, use_ie)
                graph.node_attributes[node_idx]['token_id'] = node_token_id

                token_matrix.append([node_token_id])
            if self.topology_builder == IEBasedGraphConstruction:
                for i in range(len(token_matrix)):
                    token_matrix[i] = np.array(token_matrix[i][0])
                token_matrix = pad_2d_vals_no_size(token_matrix)
                token_matrix = torch.tensor(token_matrix, dtype=torch.long)
                graph.node_features['token_id'] = token_matrix
                pass
            else:
                token_matrix = torch.tensor(token_matrix, dtype=torch.long)
                graph.node_features['token_id'] = token_matrix

            if use_ie and 'token' in graph.edge_attributes[0].keys():
                edge_token_matrix = []
                for edge_idx in range(graph.get_edge_num()):
                    edge_token = graph.edge_attributes[edge_idx]['token']
                    edge_token_id = self.vocab_model.in_word_vocab.getIndex(edge_token, use_ie)
                    graph.edge_attributes[edge_idx]['token_id'] = edge_token_id
                    edge_token_matrix.append([edge_token_id])
                if self.topology_builder == IEBasedGraphConstruction:
                    for i in range(len(edge_token_matrix)):
                        edge_token_matrix[i] = np.array(edge_token_matrix[i][0])
                    edge_token_matrix = pad_2d_vals_no_size(edge_token_matrix)
                    edge_token_matrix = torch.tensor(edge_token_matrix, dtype=torch.long)
                    graph.edge_features['token_id'] = edge_token_matrix

            tgt = item.output_text
            tgt_token_id = self.vocab_model.out_word_vocab.to_index_sequence(tgt)
            tgt_token_id.append(self.vocab_model.out_word_vocab.EOS)
            tgt_token_id = np.array(tgt_token_id)
            item.output_np = tgt_token_id


class Text2TextDataset:
    @staticmethod
    def collate_fn(data_list: [Text2TextDataItem]):
        graph_list = [item.graph for item in data_list]
        graph_data = to_batch(graph_list)

        output_numpy = [deepcopy(item.output_np) for item in data_list]
        output_str = [deepcopy(item.output_text.lower().strip()) for item in data_list]
        output_pad = pad_2d_vals_no_size(output_numpy)

        tgt_seq = torch.from_numpy(output_pad).long()
        return {
            "graph_data": graph_data,
            "tgt_seq": tgt_seq,
            "output_str": output_str
        }


  build_topology为数据集中的每个构建文本图DataItem并将其绑定到相应的DataItem 对象。该例程通常涉及GraphConstruction模块提供的功能。此外,由于每个单独的文本图的构建是相互独立的,因此多个图的构建可以并发进行,这就涉及到Python的多处理模块。


  build_vocab获取数据项中出现的所有标记并从中构建词汇表。默认情况下,VocabModelingraph4nlp.utils.vocab_utils.VocabModel负责构建词汇表并代表词汇表本身。构建的词汇表将成为Dataset 实例的成员。


  vectorization是一个查找步骤,它将标记从 ASCII 字符转换为词嵌入。由于有多种方法可以将嵌入向量分配给标记,因此这一步通常会被下游类覆盖。


  vectorization数据集的运行时迭代由 PyTorch 的数据加载器执行。由于基本组成元素是 ,因此我们的工作是将fetched byDataItem的低级列表转换为我们想要的批处理数据。 旨在完成这项工作。DataItemtorch.DataLoaderDataset.collate_fn()

3.后语


  后续,继续讨论图的构建和图的编码器和解码器,请关注后文。时间匆忙,工作繁重,如有不当之处,欢迎批评指正,谢谢。

graph4nlp官网

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值