1.前言
在GNN和NLP的章节,我们对图神经网络和自然语言处理进行了综合的阐述,今天我们对图深度学习在自然语言处理应用的工程化实践进行阐述,这里我引用graph4nlp网站进行展开。目的有二:其一,为了推广graph4nlp这些实用的工具,其二,为了阐述图神经网络算法如何运用到NLP领域。
graph4nlp的安装方法:官网安装方法,参考来源:graph4nlp网站
2.图数据集
Graph4NLP是一个容易上手的工具,可用于研究图深度学习和自然语言处理,为科研人员和开发人员提供了很好的模型工具。基于DGL工具,Graph4NLP兼具高效运行和良好的扩展性。具体而言,这个工具库具备易用性、灵活性、丰富的学习实例、运行高效、扩展性俱佳。Graph4NLP的组成结构如下:
言归正传,我们先看图数据集的建立方法。Graph4NLP的GraphData类为我们提供了数据集建立和操作方法。
2.1 建立数据集
import graph4nlp
from graph4nlp import data
g = data.GraphData() # Construct an empty graph
g.add_nodes(10) # Add 10 nodes to this graph sequentially.
g.get_node_num()
>>> 10
g.add_nodes(9) # Add another 9 nodes. This operation will append the new nodes to existing ones.
g.get_node_num()
>>> 19
g.add_edges([0, 1, 2], [1, 2, 3]) # Add 3 edges, connecting nodes 0~4 into a line.
g.get_all_edges()
>>> [(0, 1), (1, 2), (2, 3)]
GraphData类提供了两种方法:GraphData.nodes()和GraphData.edges(),分别对节点和边进行操作。当然,我们还有其他的添加数据的方法,如from_dgl(), from_dense_adj(), from_scipy_sparse_matrix() 和from_graphdata()。
import graph4nlp
from graph4nlp import data
import torch
g = data.GraphData()
g.add_nodes(10)
for i in range(10):
g.add_edge(src=i, tgt=(i + 1) % 10)
g.node_features['node_feat'] = torch.randn((10, 10))
g.node_features['zero'] = torch.zeros(10)
g.node_features['idx'] = torch.tensor(list(range(10)), dtype=torch.long)
g.edge_features['edge_feat'] = torch.randn((10, 10))
g.edge_features['idx'] = torch.tensor(list(range(10)), dtype=torch.long)
# Test to_dgl
dgl_g = g.to_dgl()
for node_feat_name in g.node_feature_names():
if g.node_features[node_feat_name] is None:
assert node_feat_name not in dgl_g.ndata.keys()
else:
assert torch.all(torch.eq(dgl_g.ndata[node_feat_name], g.node_features[node_feat_name]))
for edge_feat_name in g.get_edge_feature_names():
if g.edge_features[edge_feat_name] is None:
assert edge_feat_name not in dgl_g.edata.keys()
else:
assert torch.all(torch.eq(dgl_g.edata[edge_feat_name], g.edge_features[edge_feat_name]))
print(g.get_node_num(),dgl_g.number_of_nodes())
src, tgt = dgl_g.all_edges()
dgl_g_edges = []
for i in range(src.shape[0]):
dgl_g_edges.append((int(src[i]), int(tgt[i])))
print(g.get_all_edges(), dgl_g_edges)
# Test from_dgl
g = data.GraphData()
g1 = g.from_dgl(dgl_g)
for node_feat_name in g.node_feature_names():
try:
assert torch.all(torch.eq(g1.node_features[node_feat_name], g.node_features[node_feat_name]))
except TypeError:
assert g1.node_features[node_feat_name] == g.node_features[node_feat_name]
for edge_feat_name in g.get_edge_feature_names():
try:
assert torch.all(torch.eq(g1.edge_features[edge_feat_name], g.edge_features[edge_feat_name]))
except TypeError:
assert g1.edge_features[edge_feat_name] == g.edge_features[edge_feat_name]
print(g1.get_node_num(),g.get_node_num())
print(g1.get_all_edges(), g.get_all_edges())
图数据集可以转为其他的格式,目前,to_dgl()可把GraphData转为dgl.DGLGraph。
2.2 操作数据集
emsp;在GraphData类中,有两类信息:节点和边相关的信息,也可以划分为特征信息和属性信息。
2.2.1 特征
import graph4nlp
from graph4nlp import data
g = data.GraphData()
g.add_nodes(10)
# Note that the first dimension of features represent the number of instances(nodes/edges).
# Any manipulation to the features should keep the match between the number of instances and the dimension size
# An invalid example
g.node_features['node_feat'] = torch.randn((9, 10))
g.node_features['node_feat'] = torch.randn((10, 10))
g.node_features['zero'] = torch.zeros(10)
g.node_features['idx'] = torch.tensor(list(range(10)), dtype=torch.long)
g.node_features
为了操作节点和边,我们可以通过GraphData.nodes[node_index].features或GraphData.edges[node_index].features的方式访问节点和边的特征。相应地,GraphData.node_features和.GraphData.edge_features访问整个图的节点和边的特征信息。添加节点,节点的特征默认为zero_padding。
g.add_nodes(1)
g.node_features # Zero padding is performed
2.2.2 属性
g = data.GraphData()
g.add_nodes(2) # Add 2 nodes to an empty graph
print(g.node_attributes)
g.node_attributes[1]['node_attr'] = 'hello'
print(g.node_attributes)
特征和属性的区别:
- 存储类型不同:特征信息只存储数值类型,一般为torch.tensor类型。在shape方面,图中如有10个节点和20条边,那么每个节点shape为[10,],而每个边的特征shape为[20,];但是属性可以存储任意类型的数据,而且没有shape上的规定;
- 访问方式不同:虽然特征和属性都有两层key:name和indices,但是特征信息采用字典的feature_name(string):feature_value(torch.tensor),我们可以根据feature_name去查询feature_value,而属性采用list(dict)的方式(见图6),我们可以根据list的index访问属性信息。
2.3 数据的batch化
import graph4nlp
from graph4nlp import data
g_list = []
batched_edges = []
graph_edges_list = []
# Build a number of graphs
for i in range(5):
g = data.GraphData()
g.add_nodes(10)
for j in range(10):
g.add_edge(src=j, tgt=(j + 1) % 10)
batched_edges.append((i * 10 + j, i * 10 + ((j + 1) % 10)))
g.node_features['idx'] = torch.ones(10) * i
g.edge_features['idx'] = torch.ones(10) * i
graph_edges_list.append(g.get_all_edges())
g_list.append(g)
print(g_list)
# Test to_batch
batch = data.data.to_batch(g_list)
target_batch_idx = []
for i in range(5):
for j in range(10):
target_batch_idx.append(i)
# Expected behaviors
assert batch.batch == target_batch_idx
assert batch.get_node_num() == 50
assert batch.get_all_edges() == batched_edges
# Un-batching
graph_list = data.data.from_batch(batch)
for i in range(len(graph_list)):
g = graph_list[i]
# Expected behaviors
print(g.get_all_edges(),graph_edges_list[i])
print(g.get_node_num() == 10)
print(torch.all(torch.eq(g.node_features['idx'], torch.ones(10) * i)))
print(torch.all(torch.eq(g.edge_features['idx'], torch.ones(10) * i)))
print("-"*115)
在深度学习中,batch化是常用的手段,由于数据量太大,我们在训练的过程中,采用批量化输入数据的方式。在图深度学习领域,我们也采用batch的方式,即把list中的GraphData实例,形成小的batch,而非整个大的graph数据一次性输入。其中,graph4nlp.data.data.to_batch方法就是把装有GraphData实例打包成batch,而graph4nlp.data.data.from_batch方法解构batch,变成list。
2.4 Dataset和本地数据Dataset
建立dataset的方式:下载/收集原始数据,数据预处理,最后规定数据的迭代方式。这和我们在深度学习建立torch.data.DataLoader方式一样。其中,第二步中,需要建立拓扑结构、建立vocab和向量化表示。
class Dataset:
@property
def raw_dir(self) -> str:
"""The directory where the raw data is stored."""
return os.path.join(self.root, 'raw')
@property
def processed_dir(self) -> str:
return os.path.join(self.root, 'processed', self.topology_subdir)
建立自己数据的dataset,我们可以针对文本句子、token列表和图数据进行定制化的建模工作。
import graph4nlp
from graph4nlp import data
class Text2TextDataItem(data.DataItem):
def __init__(self, input_text, output_text, tokenizer, share_vocab=True):
super(Text2TextDataItem, self).__init__(input_text, tokenizer)
self.output_text = output_text
self.share_vocab = share_vocab
def extract(self):
g: GraphData = self.graph
input_tokens = []
for i in range(g.get_node_num()):
if self.tokenizer is None:
tokenized_token = g.node_attributes[i]['token'].strip().split(' ')
else:
tokenized_token = self.tokenizer(g.node_attributes[i]['token'])
input_tokens.extend(tokenized_token)
if self.tokenizer is None:
output_tokens = self.output_text.strip().split(' ')
else:
output_tokens = self.tokenizer(self.output_text)
if self.share_vocab:
return input_tokens + output_tokens
else:
return input_tokens, output_tokens
class JobsDataset(data.Text2TextDataset):
def __init__(self, root_dir,
topology_builder, topology_subdir,
# pretrained_word_emb_file=None,
pretrained_word_emb_name="6B",
pretrained_word_emb_url=None,
pretrained_word_emb_cache_dir=None,
graph_type='static',
merge_strategy="tailhead", edge_strategy=None,
seed=None,
word_emb_size=300, share_vocab=True, lower_case=True,
thread_number=1, port=9000,
dynamic_graph_type=None,
dynamic_init_topology_builder=None,
dynamic_init_topology_aux_args=None,
for_inference=None,
reused_vocab_model=None):
# Initialize the dataset. If the preprocessed files are not found, then do the preprocessing and save them.
super(JobsDataset, self).__init__(root_dir=root_dir, topology_builder=topology_builder,
topology_subdir=topology_subdir, graph_type=graph_type,
edge_strategy=edge_strategy, merge_strategy=merge_strategy,
share_vocab=share_vocab, lower_case=lower_case,
pretrained_word_emb_name=pretrained_word_emb_name, pretrained_word_emb_url=pretrained_word_emb_url, pretrained_word_emb_cache_dir=pretrained_word_emb_cache_dir,
seed=seed, word_emb_size=word_emb_size,
thread_number=thread_number, port=port,
dynamic_graph_type=dynamic_graph_type,
dynamic_init_topology_builder=dynamic_init_topology_builder,
dynamic_init_topology_aux_args=dynamic_init_topology_aux_args,
for_inference=for_inference,
reused_vocab_model=reused_vocab_model)
从上面的代码,我们可以看到,Text2TextDataItem类继承了data.DataItem类,而data.DataItem类有一个extract()方法,extract()可以返回input token和output token,用户可以重写这个类方法。而JobsDataset的父类为data.Text2TextDataset,extract()方法返回文本graph的token列表。
定制化的下载方法如下,首先检查文件是否已经下载过,然后再下载缺失文件,而在基类Dataset中,download()方法是一个抽象方法。
@property
def raw_file_names(self):
"""3 reserved keys: 'train', 'val' (optional), 'test'. Represent the split of dataset."""
return {'train': 'train.txt', 'test': 'test.txt'}
class Dataset:
def _download(self):
if all([os.path.exists(raw_path) for raw_path in self.raw_file_paths.values()]):
return
os.makedirs(self.raw_dir, exist_ok=True)
self.download()
@abc.abstractmethod
def download(self):
"""Download the raw data from the Internet."""
raise NotImplementedError
接下来,就是数据预处理了,在上节,我们提到数据预处理的三个流程:build_topology, build_vocab和 vectorization。
def _process(self):
if all([os.path.exists(processed_path) for processed_path in self.processed_file_paths.values()]):
if 'val_split_ratio' in self.__dict__:
UserWarning(
"Loading existing processed files on disk. Your `val_split_ratio` might not work since the data have"
"already been split.")
return
if self.for_inference and \
all([(os.path.exists(processed_path) or self.processed_file_names['data'] not in processed_path) for
processed_path in self.processed_file_paths.values()]):
return
os.makedirs(self.processed_dir, exist_ok=True)
self.read_raw_data()
if self.for_inference:
self.test = self.build_topology(self.test)
self.vectorization(self.test)
data_to_save = {'test': self.test}
torch.save(data_to_save, self.processed_file_paths['data'])
else:
self.train = self.build_topology(self.train)
self.test = self.build_topology(self.test)
if 'val' in self.__dict__:
self.val = self.build_topology(self.val)
self.build_vocab()
self.vectorization(self.train)
self.vectorization(self.test)
if 'val' in self.__dict__:
self.vectorization(self.val)
data_to_save = {'train': self.train, 'test': self.test}
if 'val' in self.__dict__:
data_to_save['val'] = self.val
torch.save(data_to_save, self.processed_file_paths['data'])
vocab_to_save = self.vocab_model
torch.save(vocab_to_save, self.processed_file_paths['vocab'])
我们看下完整的示例:
class Dataset:
def build_topology(self, data_items):
"""
Build graph topology for each item in the dataset. The generated graph is bound to the `graph` attribute of the
DataItem.
"""
total = len(data_items)
thread_number = min(total, self.thread_number)
pool = Pool(thread_number)
res_l = []
for i in range(thread_number):
start_index = total * i // thread_number
end_index = total * (i + 1) // thread_number
"""
data_items, topology_builder,
graph_type, dynamic_graph_type, dynamic_init_topology_builder,
merge_strategy, edge_strategy, dynamic_init_topology_aux_args,
lower_case, tokenizer, port, timeout
"""
r = pool.apply_async(self._build_topology_process,
args=(data_items[start_index:end_index], self.topology_builder, self.graph_type,
self.dynamic_graph_type, self.dynamic_init_topology_builder,
self.merge_strategy, self.edge_strategy, self.dynamic_init_topology_aux_args,
self.lower_case, self.tokenizer, self.port, self.timeout))
res_l.append(r)
pool.close()
pool.join()
data_items = []
for i in range(thread_number):
res = res_l[i].get()
for data in res:
if data.graph is not None:
data_items.append(data)
return data_items
def build_vocab(self):
"""
Build the vocabulary. If `self.use_val_for_vocab` is `True`, use both training set and validation set for building
the vocabulary. Otherwise only the training set is used.
"""
data_for_vocab = self.train
if self.use_val_for_vocab:
data_for_vocab = self.val + data_for_vocab
vocab_model = VocabModel.build(saved_vocab_file=self.processed_file_paths['vocab'],
data_set=data_for_vocab,
tokenizer=self.tokenizer,
lower_case=self.lower_case,
max_word_vocab_size=self.max_word_vocab_size,
min_word_vocab_freq=self.min_word_vocab_freq,
share_vocab=self.share_vocab,
pretrained_word_emb_name=self.pretrained_word_emb_name,
pretrained_word_emb_url=self.pretrained_word_emb_url,
pretrained_word_emb_cache_dir=self.pretrained_word_emb_cache_dir,
target_pretrained_word_emb_name=self.target_pretrained_word_emb_name,
target_pretrained_word_emb_url=self.target_pretrained_word_emb_url,
word_emb_size=self.word_emb_size)
self.vocab_model = vocab_model
return self.vocab_model
class Text2TextDataset:
def vectorization(self, data_items):
if self.topology_builder == IEBasedGraphConstruction:
use_ie = True
else:
use_ie = False
for item in data_items:
graph: GraphData = item.graph
token_matrix = []
for node_idx in range(graph.get_node_num()):
node_token = graph.node_attributes[node_idx]['token']
node_token_id = self.vocab_model.in_word_vocab.getIndex(node_token, use_ie)
graph.node_attributes[node_idx]['token_id'] = node_token_id
token_matrix.append([node_token_id])
if self.topology_builder == IEBasedGraphConstruction:
for i in range(len(token_matrix)):
token_matrix[i] = np.array(token_matrix[i][0])
token_matrix = pad_2d_vals_no_size(token_matrix)
token_matrix = torch.tensor(token_matrix, dtype=torch.long)
graph.node_features['token_id'] = token_matrix
pass
else:
token_matrix = torch.tensor(token_matrix, dtype=torch.long)
graph.node_features['token_id'] = token_matrix
if use_ie and 'token' in graph.edge_attributes[0].keys():
edge_token_matrix = []
for edge_idx in range(graph.get_edge_num()):
edge_token = graph.edge_attributes[edge_idx]['token']
edge_token_id = self.vocab_model.in_word_vocab.getIndex(edge_token, use_ie)
graph.edge_attributes[edge_idx]['token_id'] = edge_token_id
edge_token_matrix.append([edge_token_id])
if self.topology_builder == IEBasedGraphConstruction:
for i in range(len(edge_token_matrix)):
edge_token_matrix[i] = np.array(edge_token_matrix[i][0])
edge_token_matrix = pad_2d_vals_no_size(edge_token_matrix)
edge_token_matrix = torch.tensor(edge_token_matrix, dtype=torch.long)
graph.edge_features['token_id'] = edge_token_matrix
tgt = item.output_text
tgt_token_id = self.vocab_model.out_word_vocab.to_index_sequence(tgt)
tgt_token_id.append(self.vocab_model.out_word_vocab.EOS)
tgt_token_id = np.array(tgt_token_id)
item.output_np = tgt_token_id
class Text2TextDataset:
@staticmethod
def collate_fn(data_list: [Text2TextDataItem]):
graph_list = [item.graph for item in data_list]
graph_data = to_batch(graph_list)
output_numpy = [deepcopy(item.output_np) for item in data_list]
output_str = [deepcopy(item.output_text.lower().strip()) for item in data_list]
output_pad = pad_2d_vals_no_size(output_numpy)
tgt_seq = torch.from_numpy(output_pad).long()
return {
"graph_data": graph_data,
"tgt_seq": tgt_seq,
"output_str": output_str
}
build_topology为数据集中的每个构建文本图DataItem并将其绑定到相应的DataItem 对象。该例程通常涉及GraphConstruction模块提供的功能。此外,由于每个单独的文本图的构建是相互独立的,因此多个图的构建可以并发进行,这就涉及到Python的多处理模块。
build_vocab获取数据项中出现的所有标记并从中构建词汇表。默认情况下,VocabModelingraph4nlp.utils.vocab_utils.VocabModel负责构建词汇表并代表词汇表本身。构建的词汇表将成为Dataset 实例的成员。
vectorization是一个查找步骤,它将标记从 ASCII 字符转换为词嵌入。由于有多种方法可以将嵌入向量分配给标记,因此这一步通常会被下游类覆盖。
vectorization数据集的运行时迭代由 PyTorch 的数据加载器执行。由于基本组成元素是 ,因此我们的工作是将fetched byDataItem的低级列表转换为我们想要的批处理数据。 旨在完成这项工作。DataItemtorch.DataLoaderDataset.collate_fn()
3.后语
后续,继续讨论图的构建和图的编码器和解码器,请关注后文。时间匆忙,工作繁重,如有不当之处,欢迎批评指正,谢谢。