Graph Convolutional Networks for Text Classification原码解读[tensorflow]

最新推荐文章于 2022-01-18 21:44:14 发布

Muasci

最新推荐文章于 2022-01-18 21:44:14 发布

阅读量849

点赞数

分类专栏：文献阅读之家文章标签： nlp gcn

本文链接：https://blog.csdn.net/jokerxsy/article/details/112076521

版权

文献阅读之家专栏收录该内容

51 篇文章 5 订阅

订阅专栏

项目地址

https://github.com/yao8839836/text_gcn

环境配置

python 3.6

20ng文章样例

From: decay@cbnewsj.cb.att.com (dean.kaflowitz) Subject: Re: about the bible quiz answers Organization: AT&T Distribution: na Lines: 18 In article healta.153.735242337@saturn.wwc.edu, healta@saturn.wwc.edu (Tammy R Healy) writes: > > > #12) The 2 cheribums are on the Ark of the Covenant. When God said make no > graven image, he was refering to idols, which were created to be worshipped. > The Ark of the Covenant wasn’t wrodhipped and only the high priest could > enter the Holy of Holies where it was kept once a year, on the Day of > Atonement. I am not familiar with, or knowledgeable about the original language, but I believe there is a word for “idol” and that the translator would have used the word “idol” instead of “graven image” had the original said “idol.” So I think you’re wrong here, but then again I could be too. I just suggesting a way to determine whether the interpretation you offer is correct. Dean Kaflowitz

python remove_words.py 20ng

dataset = sys.argv[1]：20ng
统计词频
过滤低频词和stopwords
把所有处理过的文章写入 20ng.clean.txt ( ‘word1 word2 word3 …\n word1 word2 …\n…’)
在这里插入图片描述

python build_graph.py 20ng

doc_train_list[0]:

在这里插入图片描述

doc_content_list[0]:

在这里插入图片描述

train_ids_str:

“idx1\nidx2\n…”
在这里插入图片描述

shuffle_doc_name_str（上面train）:

“name1\nname2\n…”
在这里插入图片描述

shuffle_doc_words_str

略

word_doc_list *

{word1:[1,2,3,4], word2:[2,3,4,5],word3:[100,203,…]}
表明word1在第1、2、3、4篇文章中出现过

word_doc_freq

{word1:4,word2:4,…}
表明word1在4篇文章中出现过

word_id_map

{word1:0,word2:1,word3:2…}

vocab_str

在这里插入图片描述

label_set

在这里插入图片描述

label_list_str

在这里插入图片描述

x

row_x( real_train_size x word_embeddings_dim):
[0,0,0,0,…0, 1,1,1,1,…1, 2,2,2,2,…2,…]
300 300 300
col_x( real_train_size x word_embeddings_dim):
[0,1,2,3,…299, 0,1,2,3,…299, …]
data_x(real_train_size x word_embeddings_dim)
x = sp.csr_matrix((data_x, (row_x, col_x)), shape=(
real_train_size, word_embeddings_dim))

y

[[0,1,0,0,0,…], [1,0,0,0,…],…]

tx

test x

ty

test y

allx(doc+word)

word_vectors:(vocab_size, word_embeddings_dim)
row_allx\col_allx\data_allx和上面的都一样，只是现在包含了所有的训练文章和所有的vocab

进一步
row_allx:[0,0,…,1,1,…, train_size-1,train_size-1,…train_size+vocab_size-1, train_size+vocab_size-1,…]
在这里插入图片描述

ally

[[0,1,0,0,0,…], [1,0,0,0,…],…,[0,0,0,0,…]]
有标签的文章 | 没有标签的单词

暂时总结

print(x.shape, y.shape, tx.shape, ty.shape, allx.shape, ally.shape)

(10183, 300) (10183, 20) (7532, 300) (7532, 20) (54071, 300) (54071, 20)

windows

window_size = 20
[[w1,w2,w3,…w14], [w1,w2,…w20],…]
如果一篇文章只有14个单词，那么直接加入窗口
否则，窗口以1的步长，在文章上滑动。即，length长度的文章，能够产生length-window_size+1个窗口

word_window_freq

{word1:freq1,word2:freq2,word3:freq3}
word1在freq1个不同的window中出现过

word_pair_count

{‘1498,2066’:3,‘2066,1498’:3,…}
双向图，vocab中第1498个单词和第2066个单词在所有的windows中，共同出现了3次

感觉这里的count和上面的word freq有所不同

word_window_freq的一个单词的次数，是指出现过该单词的窗口数
word_pair_count的一个单词组的次数，是指所有窗口中，总共共同出现的对数，即一个窗口内，可能有多对

PMI(word word)

考虑一个单词对，word1在词表中是第i个，word2在词表中是第j个
row:
[train_size+i, …]
col
[train_size+j,…]
weight
[pmi_i_j]
思考，其它地方呢？train_size x train_size 里面呢？

doc_word_freq

{‘doc_id1,word_id1’:3,…}
表明在doc_id1对应的文章中，word_id1对应的单词出现了3次

TF-IDF(doc word)

承接上面的问题
row:
[train_size, train_size+i…train_size+vocab_size-1, | 0, 1,2,…train_size-1, | train_size + vocab_size + i,… train_size + vocab_size + test_size]
adj = sp.csr_matrix(
(weight, (row, col)), shape=(node_size, node_size))

python train.py 20ng

load_corpus

adj: 完全不懂:adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T > adj)，原来的adj加上了adj的转置比adj大的对应元素，减去了adj比adj的转置小的对应元素
features: (train_size(doc) + vocab_size + test_size) x 300
y_train: (train_size(doc) + vocab_size + test_size) x label_num，但只有[1,1,1(real_train_size) 0,0,0…]中为1的地方有label标签，其它是无标签的
y_val：[0,0,0(real_train_size), 1,1,1(val_size), 0,0,0…]
y_test：[0,0,0(real_train_size),0,0,0(val_size),0,0,0(vocab_size),1,1,1(test_size)]
train_mask
val_mask
test_mask
train_size
test_size

features = sp.identity(features.shape[0])

下面代码以输入的features=sp.identity(3)为例。

import scipy.sparse as sp
import numpy as np
def preprocess_features(features):
	(features)
	(0, 0)	1.0
	(1, 1)	1.0
	(2, 2)	1.0
    """Row-normalize feature matrix and convert to tuple representation"""
    rowsum = np.array(features.sum(1)) # 每一行的和
    print("rowsum",rowsum)
    (rowsum)
    [[1.]
	 [1.]
	 [1.]]
    r_inv = np.power(rowsum, -1).flatten() # 每一行的行的倒数，一行上所有元素乘上和的倒数，做的就是归一化
    print("np.power(rowsum, -1)",np.power(rowsum, -1))
    (r_inv)
    [1. 1. 1.]
    r_inv[np.isinf(r_inv)] = 0.  # 把无穷大无穷小的值都变为0
    r_mat_inv = sp.diags(r_inv) # 以每一行的和的倒数为对角元素，创建矩阵，用于与原来矩阵进行相乘
    print("r_mat_inv",r_mat_inv)
    (r_mat_inv)
    (0, 0)	1.0
	(1, 1)	1.0
	(2, 2)	1.0
    features = r_mat_inv.dot(features) # 矩阵乘法，做归一化
    print("feature", features)
    (features)
    (0, 0)	1.0
    (1, 1)	1.0
    (2, 2)	1.0
    return sparse_to_tuple(features)

def sparse_to_tuple(sparse_mx):
    """Convert sparse matrix to tuple representation."""
    def to_tuple(mx):
        if not sp.isspmatrix_coo(mx):
            mx = mx.tocoo()
        coords = np.vstack((mx.row, mx.col)).transpose()
        values = mx.data
        shape = mx.shape
        return coords, values, shape # 列下标和行下标，对应的元素，矩阵大小

    if isinstance(sparse_mx, list):
        for i in range(len(sparse_mx)):
            sparse_mx[i] = to_tuple(sparse_mx[i])
    else:
        sparse_mx = to_tuple(sparse_mx)
    return sparse_mx

so，preprocess_features(features)最终返回的是:
在这里插入图片描述

support = [preprocess_adj(adj)]

def normalize_adj(adj):
    """Symmetrically normalize adjacency matrix."""
    adj = sp.coo_matrix(adj)
    rowsum = np.array(adj.sum(1))
    d_inv_sqrt = np.power(rowsum, -0.5).flatten()
    d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.
    d_mat_inv_sqrt = sp.diags(d_inv_sqrt)
    return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).tocoo()


def preprocess_adj(adj):
    """Preprocessing of adjacency matrix for simple GCN model and conversion to tuple representation."""
    adj_normalized = normalize_adj(adj + sp.eye(adj.shape[0]))
    return sparse_to_tuple(adj_normalized)

做的应该就是论文中的:
在这里插入图片描述
（待细看）

placeholders

placeholders = {
    'support': [tf.sparse_placeholder(tf.float32) for _ in range(num_supports)],
    'features': tf.sparse_placeholder(tf.float32, shape=tf.constant(features[2], dtype=tf.int64)),
    'labels': tf.placeholder(tf.float32, shape=(None, y_train.shape[1])),
    'labels_mask': tf.placeholder(tf.int32),
    'dropout': tf.placeholder_with_default(0., shape=()),
    # helper variable for sparse dropout
    'num_features_nonzero': tf.placeholder(tf.int32)
}

这里的sparse_placeholder需要喂的数据是(indices, value, shape),也就是preprocess_features返回的三个东西

create_model

两层gcn网络
self.layers.append(GraphConvolution) x 2

然后再build构建静态图:

        for layer in self.layers:
            hidden = layer(self.activations[-1])
            self.activations.append(hidden)
        self.outputs = self.activations[-1]

保存变量:

        variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
        self.vars = {var.name: var for var in variables}

loss

# 为什么只需要第一层?
 for var in self.layers[0].vars.values():
            self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
 # 需要对有标签的进行计算
 def masked_softmax_cross_entropy(preds, labels, mask):
    """Softmax cross-entropy loss with masking."""
    print(preds)
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=preds, labels=labels)
    mask = tf.cast(mask, dtype=tf.float32)
    mask /= tf.reduce_mean(mask)
    loss *= mask
    return tf.reduce_mean(loss)

acc

opt_op

self.opt_op = self.optimizer.minimize(self.loss)

train(计算过程)

首先把inputs传入第一个layer:

self.layers.append(GraphConvolution(input_dim=self.input_dim,
                                            output_dim=FLAGS.hidden1, # 200
                                            placeholders=self.placeholders,
                                            act=tf.nn.relu,
                                            dropout=True,
                                            featureless=True,
                                            sparse_inputs=True,
                                            logging=self.logging))

首先进行dropout(sparse dropout)

 # dropout
        if self.sparse_inputs:
            x = sparse_dropout(x, 1-self.dropout, self.num_features_nonzero)
        else:
            x = tf.nn.dropout(x, 1-self.dropout)
def sparse_dropout(x, keep_prob, noise_shape):
    """Dropout for sparse tensors."""
    random_tensor = keep_prob # 保留的概率
    random_tensor += tf.random_uniform(noise_shape) # 按照feature中有实质的个数，返回一个(个数，)的向量，每一个元素值是random_uniform出来的，然后加上keep_prob
    dropout_mask = tf.cast(tf.floor(random_tensor), dtype=tf.bool) # 对于random_tensor中的值，如果>1，则为True，反之则为False
    pre_out = tf.sparse_retain(x, dropout_mask) # 保留x中，dropout_mask对应位置是True的元素
    return pre_out * (1./keep_prob) # 保留下来的元素再除以一个 小于1的概率？--->激活？

然后进行convolve，第一个layer的featureless是True的，这里很奇怪的就是featureless的含义，如果它是True，那么做的计算就只是对称邻接矩阵和权重相乘？feature都不要啦？—后来才知道，一开始的feature是一个单位阵，根本不需要参与计算…：
在这里插入图片描述

supports = list()
for i in range(len(self.support)):
    if not self.featureless:
        pre_sup = dot(x, self.vars['weights_' + str(i)],
                      sparse=self.sparse_inputs)
    else:
        pre_sup = self.vars['weights_' + str(i)]
    support = dot(self.support[i], pre_sup, sparse=True)
    supports.append(support)  
output = tf.add_n(supports) # tf.add_n就是把supports列表中的所有support对应元素相加
def dot(x,   y, sparse=False):
    """Wrapper for tf.matmul (sparse vs dense)."""
    if sparse:
        res = tf.sparse_tensor_dense_matmul(x, y)
    else:
        res = tf.matmul(x, y)
    return res

bias&embedding，可以看到embedding就是第一层的输出，而self.act(output)需要传到下一层

# bias
        if self.bias:
            output += self.vars['bias']
        self.embedding = output #output
        return self.act(output)

然后把self.act(output)传入到第二个layer:

self.layers.append(GraphConvolution(input_dim=FLAGS.hidden1,
                                    output_dim=self.output_dim,
                                    placeholders=self.placeholders,
                                    act=lambda x: x, 
                                    dropout=True,
                                    logging=self.logging))

注意这里的featureless=False了。

#加粗样式 一些奇奇怪怪的报错

AttributeError: module 'tensorflow' has no attribute 'random_uniform'

解决:
https://blog.csdn.net/weixin_43763859/article/details/104537392

Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11
I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

解决(下载cudatoolkit=11.0):
https://blog.csdn.net/qq_28193019/article/details/103146116

Cannot use GPU when output.shape[1] * nnz(a) > 2^31

解决
https://blog.csdn.net/weixin_35970195/article/details/112585490

如果是大于二层的GCN呢？—目前看来是需要手动再写几层GCN，下面的切比雪夫并不是。

support = chebyshev_polynomials(adj, FLAGS.max_degree)

参考

TensorFlow函数：tf.sparse_placeholde

Muasci

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录