项目地址
https://github.com/yao8839836/text_gcn
环境配置
python 3.6
20ng文章样例
From: decay@cbnewsj.cb.att.com (dean.kaflowitz) Subject: Re: about the bible quiz answers Organization: AT&T Distribution: na Lines: 18 In article healta.153.735242337@saturn.wwc.edu, healta@saturn.wwc.edu (Tammy R Healy) writes: > > > #12) The 2 cheribums are on the Ark of the Covenant. When God said make no > graven image, he was refering to idols, which were created to be worshipped. > The Ark of the Covenant wasn’t wrodhipped and only the high priest could > enter the Holy of Holies where it was kept once a year, on the Day of > Atonement. I am not familiar with, or knowledgeable about the original language, but I believe there is a word for “idol” and that the translator would have used the word “idol” instead of “graven image” had the original said “idol.” So I think you’re wrong here, but then again I could be too. I just suggesting a way to determine whether the interpretation you offer is correct. Dean Kaflowitz
python remove_words.py 20ng
dataset = sys.argv[1]:20ng
统计词频
过滤低频词和stopwords
把所有处理过的文章写入 20ng.clean.txt ( ‘word1 word2 word3 …\n word1 word2 …\n…’)
python build_graph.py 20ng
doc_train_list[0]:
doc_content_list[0]:
train_ids_str:
“idx1\nidx2\n…”
shuffle_doc_name_str(上面train):
“name1\nname2\n…”
shuffle_doc_words_str
略
word_doc_list *
{word1:[1,2,3,4], word2:[2,3,4,5],word3:[100,203,…]}
表明word1在第1、2、3、4篇文章中出现过
word_doc_freq
{word1:4,word2:4,…}
表明word1在4篇文章中出现过
word_id_map
{word1:0,word2:1,word3:2…}
vocab_str
label_set
label_list_str
x
row_x( real_train_size x word_embeddings_dim):
[0,0,0,0,…0, 1,1,1,1,…1, 2,2,2,2,…2,…]
300 300 300
col_x( real_train_size x word_embeddings_dim):
[0,1,2,3,…299, 0,1,2,3,…299, …]
data_x(real_train_size x word_embeddings_dim)
x = sp.csr_matrix((data_x, (row_x, col_x)), shape=(
real_train_size, word_embeddings_dim))
y
[[0,1,0,0,0,…], [1,0,0,0,…],…]
tx
test x
ty
test y
allx(doc+word)
word_vectors:(vocab_size, word_embeddings_dim)
row_allx\col_allx\data_allx和上面的都一样,只是现在包含了所有的训练文章和所有的vocab
进一步
row_allx:[0,0,…,1,1,…, train_size-1,train_size-1,…train_size+vocab_size-1, train_size+vocab_size-1,…]
ally
[[0,1,0,0,0,…], [1,0,0,0,…],…,[0,0,0,0,…]]
有标签的文章 | 没有标签的单词
暂时总结
print(x.shape, y.shape, tx.shape, ty.shape, allx.shape, ally.shape)
(10183, 300) (10183, 20) (7532, 300) (7532, 20) (54071, 300) (54071, 20)
windows
window_size = 20
[[w1,w2,w3,…w14], [w1,w2,…w20],…]
如果一篇文章只有14个单词,那么直接加入窗口
否则,窗口以1的步长,在文章上滑动。即,length长度的文章,能够产生length-window_size+1个窗口
word_window_freq
{word1:freq1,word2:freq2,word3:freq3}
word1在freq1个不同的window中出现过
word_pair_count
{‘1498,2066’:3,‘2066,1498’:3,…}
双向图,vocab中第1498个单词和第2066个单词在所有的windows中,共同出现了3次
感觉这里的count和上面的word freq有所不同
- word_window_freq的一个单词的次数,是指出现过该单词的窗口数
- word_pair_count的一个单词组的次数,是指所有窗口中,总共共同出现的对数,即一个窗口内,可能有多对
PMI(word word)
考虑一个单词对,word1在词表中是第i个,word2在词表中是第j个
row:
[train_size+i, …]
col
[train_size+j,…]
weight
[pmi_i_j]
思考,其它地方呢?train_size x train_size 里面呢?
doc_word_freq
{‘doc_id1,word_id1’:3,…}
表明在doc_id1对应的文章中,word_id1对应的单词出现了3次
TF-IDF(doc word)
承接上面的问题
row:
[train_size, train_size+i…train_size+vocab_size-1, | 0, 1,2,…train_size-1, | train_size + vocab_size + i,… train_size + vocab_size + test_size]
adj = sp.csr_matrix(
(weight, (row, col)), shape=(node_size, node_size))
python train.py 20ng
load_corpus
- adj: 完全不懂:adj = adj + adj.T.multiply(adj.T > adj) - adj.multiply(adj.T > adj),原来的adj加上了adj的转置比adj大的对应元素,减去了adj比adj的转置小的对应元素
- features: (train_size(doc) + vocab_size + test_size) x 300
- y_train: (train_size(doc) + vocab_size + test_size) x label_num,但只有[1,1,1(real_train_size) 0,0,0…]中为1的地方有label标签,其它是无标签的
- y_val:[0,0,0(real_train_size), 1,1,1(val_size), 0,0,0…]
- y_test:[0,0,0(real_train_size),0,0,0(val_size),0,0,0(vocab_size),1,1,1(test_size)]
- train_mask
- val_mask
- test_mask
- train_size
- test_size
features = sp.identity(features.shape[0])
下面代码以输入的features=sp.identity(3)为例。
import scipy.sparse as sp
import numpy as np
def preprocess_features(features):
(features)
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
"""Row-normalize feature matrix and convert to tuple representation"""
rowsum = np.array(features.sum(1)) # 每一行的和
print("rowsum",rowsum)
(rowsum)
[[1.]
[1.]
[1.]]
r_inv = np.power(rowsum, -1).flatten() # 每一行的行的倒数,一行上所有元素乘上和的倒数,做的就是归一化
print("np.power(rowsum, -1)",np.power(rowsum, -1))
(r_inv)
[1. 1. 1.]
r_inv[np.isinf(r_inv)] = 0. # 把无穷大无穷小的值都变为0
r_mat_inv = sp.diags(r_inv) # 以每一行的和的倒数为对角元素,创建矩阵,用于与原来矩阵进行相乘
print("r_mat_inv",r_mat_inv)
(r_mat_inv)
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
features = r_mat_inv.dot(features) # 矩阵乘法,做归一化
print("feature", features)
(features)
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
return sparse_to_tuple(features)
def sparse_to_tuple(sparse_mx):
"""Convert sparse matrix to tuple representation."""
def to_tuple(mx):
if not sp.isspmatrix_coo(mx):
mx = mx.tocoo()
coords = np.vstack((mx.row, mx.col)).transpose()
values = mx.data
shape = mx.shape
return coords, values, shape # 列下标和行下标,对应的元素,矩阵大小
if isinstance(sparse_mx, list):
for i in range(len(sparse_mx)):
sparse_mx[i] = to_tuple(sparse_mx[i])
else:
sparse_mx = to_tuple(sparse_mx)
return sparse_mx
so,preprocess_features(features)最终返回的是:
support = [preprocess_adj(adj)]
def normalize_adj(adj):
"""Symmetrically normalize adjacency matrix."""
adj = sp.coo_matrix(adj)
rowsum = np.array(adj.sum(1))
d_inv_sqrt = np.power(rowsum, -0.5).flatten()
d_inv_sqrt[np.isinf(d_inv_sqrt)] = 0.
d_mat_inv_sqrt = sp.diags(d_inv_sqrt)
return adj.dot(d_mat_inv_sqrt).transpose().dot(d_mat_inv_sqrt).tocoo()
def preprocess_adj(adj):
"""Preprocessing of adjacency matrix for simple GCN model and conversion to tuple representation."""
adj_normalized = normalize_adj(adj + sp.eye(adj.shape[0]))
return sparse_to_tuple(adj_normalized)
做的应该就是论文中的:
(待细看)
placeholders
placeholders = {
'support': [tf.sparse_placeholder(tf.float32) for _ in range(num_supports)],
'features': tf.sparse_placeholder(tf.float32, shape=tf.constant(features[2], dtype=tf.int64)),
'labels': tf.placeholder(tf.float32, shape=(None, y_train.shape[1])),
'labels_mask': tf.placeholder(tf.int32),
'dropout': tf.placeholder_with_default(0., shape=()),
# helper variable for sparse dropout
'num_features_nonzero': tf.placeholder(tf.int32)
}
这里的sparse_placeholder需要喂的数据是(indices, value, shape),也就是preprocess_features返回的三个东西
create_model
两层gcn网络
self.layers.append(GraphConvolution) x 2
然后再build构建静态图:
for layer in self.layers:
hidden = layer(self.activations[-1])
self.activations.append(hidden)
self.outputs = self.activations[-1]
保存变量:
variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)
self.vars = {var.name: var for var in variables}
loss
# 为什么只需要第一层?
for var in self.layers[0].vars.values():
self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)
# 需要对有标签的进行计算
def masked_softmax_cross_entropy(preds, labels, mask):
"""Softmax cross-entropy loss with masking."""
print(preds)
loss = tf.nn.softmax_cross_entropy_with_logits(logits=preds, labels=labels)
mask = tf.cast(mask, dtype=tf.float32)
mask /= tf.reduce_mean(mask)
loss *= mask
return tf.reduce_mean(loss)
acc
opt_op
self.opt_op = self.optimizer.minimize(self.loss)
train(计算过程)
首先把inputs传入第一个layer:
self.layers.append(GraphConvolution(input_dim=self.input_dim,
output_dim=FLAGS.hidden1, # 200
placeholders=self.placeholders,
act=tf.nn.relu,
dropout=True,
featureless=True,
sparse_inputs=True,
logging=self.logging))
首先进行dropout(sparse dropout)
# dropout
if self.sparse_inputs:
x = sparse_dropout(x, 1-self.dropout, self.num_features_nonzero)
else:
x = tf.nn.dropout(x, 1-self.dropout)
def sparse_dropout(x, keep_prob, noise_shape):
"""Dropout for sparse tensors."""
random_tensor = keep_prob # 保留的概率
random_tensor += tf.random_uniform(noise_shape) # 按照feature中有实质的个数,返回一个(个数,)的向量,每一个元素值是random_uniform出来的,然后加上keep_prob
dropout_mask = tf.cast(tf.floor(random_tensor), dtype=tf.bool) # 对于random_tensor中的值,如果>1,则为True,反之则为False
pre_out = tf.sparse_retain(x, dropout_mask) # 保留x中,dropout_mask对应位置是True的元素
return pre_out * (1./keep_prob) # 保留下来的元素再除以一个 小于1的概率?--->激活?
然后进行convolve,第一个layer的featureless是True的, 这里很奇怪的就是featureless的含义,如果它是True,那么做的计算就只是对称邻接矩阵和权重相乘?feature都不要啦?—后来才知道,一开始的feature是一个单位阵,根本不需要参与计算…:
supports = list()
for i in range(len(self.support)):
if not self.featureless:
pre_sup = dot(x, self.vars['weights_' + str(i)],
sparse=self.sparse_inputs)
else:
pre_sup = self.vars['weights_' + str(i)]
support = dot(self.support[i], pre_sup, sparse=True)
supports.append(support)
output = tf.add_n(supports) # tf.add_n就是把supports列表中的所有support对应元素相加
def dot(x, y, sparse=False):
"""Wrapper for tf.matmul (sparse vs dense)."""
if sparse:
res = tf.sparse_tensor_dense_matmul(x, y)
else:
res = tf.matmul(x, y)
return res
bias&embedding,可以看到embedding就是第一层的输出,而self.act(output)需要传到下一层
# bias
if self.bias:
output += self.vars['bias']
self.embedding = output #output
return self.act(output)
然后把self.act(output)传入到第二个layer:
self.layers.append(GraphConvolution(input_dim=FLAGS.hidden1,
output_dim=self.output_dim,
placeholders=self.placeholders,
act=lambda x: x,
dropout=True,
logging=self.logging))
注意这里的featureless=False了。
#加粗样式 一些奇奇怪怪的报错
AttributeError: module 'tensorflow' has no attribute 'random_uniform'
解决:
https://blog.csdn.net/weixin_43763859/article/details/104537392
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11
I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
解决(下载cudatoolkit=11.0):
https://blog.csdn.net/qq_28193019/article/details/103146116
Cannot use GPU when output.shape[1] * nnz(a) > 2^31
解决
https://blog.csdn.net/weixin_35970195/article/details/112585490