TensorFlow学习笔记(3)——TensorFlow实现Word2Vec

19 篇文章 10 订阅
4 篇文章 4 订阅

本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow),作者是Thushan Ganegedara。

对了宝贝儿们,卑微小李的公众号【野指针小李】已开通,期待与你一起探讨学术哟~摸摸大!

0 前言

本文的代码来自于《TensorFlow自然语言处理》(Natural Language Processing with TensorFlow),作者是Thushan Ganegedara。在作者代码的基础上,我添加了部分自己的注释(作者的注释是英文,我的注释是用的中文)。代码已上传至github,这里是链接

如果有任何错误或者没有讲解清楚的部分,请评论在下方,看到后我会更改。

关于Word2Vec的原理以及两个优化方法——Hierachical softmax与negative sampling,如果有疑问的同学,可以参考我之前的两篇文章:Word2Vec原理与公式详细推导Word2Vec之Hierarchical Softmax与Negative Sampling

TensorFlow版本是1.8.0。

1 数据集准备

数据集准备这一节没有什么可说的,就是下载数据。

url = 'http://www.evanjones.ca/software/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        print('Downloading file...')
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('wikipedia2text-extracted.txt.bz2', 18377035)

不过我也不知道为什么,这个网站我打开是Not found,但是可以下载数据。

2 读取数据但不做预处理

def read_data(filename):
    """Extract the first file enclosed in a zip file as a list of words"""

    with bz2.BZ2File(filename) as f:
        data = []
        file_string = f.read().decode('utf-8')
        file_string = nltk.word_tokenize(file_string)
        data.extend(file_string)
    return data
  
words = read_data(filename)
print('Data size %d' % len(words))
print('Example words (start): ',words[:10])
print('Example words (end): ',words[-10:])

这一步就是从下载的文件中读取数据并做分词操作,由于没有做预处理,又有一千多万个数据,所以这一行代码运行速度很慢。而且后面使用的也不是这个没有做预处理的数据。

输出如下:

Data size 11634727
Example words (start):  ['Propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']
Example words (end):  ['useless', 'for', 'cultivation', '.', 'and', 'people', 'have', 'sex', 'there', '.']

3 读取数据并做预处理

def read_data(filename):
    """
    Extract the first file enclosed in a zip file as a list of words
    and pre-processes it using the nltk python library
    """

    with bz2.BZ2File(filename) as f:

        data = []
        file_size = os.stat(filename).st_size
        chunk_size = 1024 * 1024 # reading 1 MB at a time as the dataset is moderately large
        print('Reading data...')
        for i in range(ceil(file_size//chunk_size)+1):
            bytes_to_read = min(chunk_size,file_size-(i*chunk_size))
            file_string = f.read(bytes_to_read).decode('utf-8')
            file_string = file_string.lower()
            # tokenizes a string to words residing in a list
            file_string = nltk.word_tokenize(file_string)
            data.extend(file_string)
    return data

words = read_data(filename)
print('Data size %d' % len(words))
print('Example words (start): ',words[:10])
print('Example words (end): ',words[-10:])

这一步预处理主要就在两部分,第一部分是一次只读取1M数据,第二部分就是将所有的单词变为小写。

输出如下:

Reading data...
Data size 3361192
Example words (start):  ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']
Example words (end):  ['favorable', 'long-term', 'outcomes', 'for', 'around', 'half', 'of', 'those', 'diagnosed', 'with']

4 创建词典

这一步主要是做词语与ID的映射,以"I like to go to school"为例,映射的规则如下:

  1. dictionary: 词语与ID的映射关系 (e.g. {‘I’:0, ‘like’: 1, ‘to’: 2, ‘go’: 3, ‘school’: 4})。
  2. reverse_dictionary: ID与词语的映射关系(就是将dictionary的键值反一转)(e.g. {0: ‘I’, 1: ‘like’, 2: ‘to’, 3: ‘go’, 4: ‘school’})。
  3. count: 一个列表,列表中的每个元素是一个元组,每个元组中的元素为单词以及频率(e.g. [(‘I’, 1), (‘like’, 1), (‘to’, 2), (‘go’, 1), (‘school’, 1)])。
  4. data: 文本中的词语,这些词语以ID来代替 (e.g. [0, 1, 2, 3, 2, 4])。
  5. UNK: 稀有词语,即去掉出现频率最高的50000个词语后的所有词语。
# we restrict our vocabulary size to 50000
vocabulary_size = 50000 

def build_dataset(words):
    count = [['UNK', -1]]  # 因为后面还需要更改这个-1的值,所以这里是列表
    # Gets only the vocabulary_size most common words as the vocabulary
    # All the other words will be replaced with UNK token
    # 就是说提取50000个最常见的单词,其余全部归类为'UNK'
    count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
    dictionary = dict()

    # Create an ID for each word by giving the current length of the dictionary
    # And adding that item to the dictionary
    # word: 单词名, _: 出现的次数(频率)
    # 这一步是做词语与id的映射
    # dictionary: {'word1': 0, 'word2': 1, 'word3': 2, ...}
    for word, _ in count:
        dictionary[word] = len(dictionary)
    
    data = list()
    unk_count = 0  # 记录有多少个unk
    # Traverse through all the text we have and produce a list
    # where each element corresponds to the ID of the word found at that index
    # 如果词语在dictionary中,则用该词语的id
    # 否则则是UNK,id为0(因为dictionary中第1个单词也是UNK)
    for word in words:
        # If word is in the dictionary use the word ID,
        # else use the ID of the special token "UNK"
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count = unk_count + 1
        data.append(index)
    
    # update the count variable with the number of UNK occurences
    # 更新count,其实就是更新UNK的数量
    count[0][1] = unk_count
  
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
    # Make sure the dictionary is of size of the vocabulary
    assert len(dictionary) == vocabulary_size
    
    return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
del words  # Hint to reduce memory.

首先先说这个count,在上面的解释中写到count中的元素是元组,但是build_dataset中第一行就定义了count = [['UNK', -1]],这是因为在后面还需要更改这个-1的值,所以这里定义的count[0]是个列表。也就是说count是如下的:

count = [
		['UNK', 68751],
		('the', 226893),
		...
		('suggested', 336)
]

接着,dictionary的创建也很有灵性,先是提取出count中的词语(count中的词语是按照从大到小的顺序排列好了的),放入dictionary中。由于count是通过collections生成的,所以没有重复的值,放入字典后,ID就是该字典的长度,由于每一轮会放一个键值对进去,所以每一轮字典长度会增加1,这样就实现了ID一次增加1。

再之后就是将dictionary中的值提取出来放入data中,并根据其键值构建reverse_dictionary

输出如下:

Most common words (+UNK) [['UNK', 68751], ('the', 226893), (',', 184013), ('.', 120919), ('of', 116323)]
Sample data [1721, 9, 8, 16479, 223, 4, 5168, 4459, 26, 11597]

5 定义Skip-gram的batch

skip-gram是根据一个单词预测其上下文,其网络结构如下所示:

skip-gram
这里需要注意的是,Word2Vec的隐藏层是线性的,所以在隐藏层没有激活函数。

由于这一步是定义输入与输出的标签。设batch为输入词;labels是该输入词的上下文;设span是窗口大小与目标词语,大小为 2 ∗ w i n d o w _ s i z e + 1 2 * {\rm window\_size} + 1 2window_size+1,由于window_size是单侧的窗口大小,所以这里要乘2,也就是说上下文大小为 2 ∗ w i n d o w _ s i z e 2 * {\rm window\_size} 2window_size

data_index = 0

def generate_batch_skip_gram(batch_size, window_size):
    # data_index is updated by 1 everytime we read a data point
    # 调用外部的变量
    global data_index 
    # print('global data_index:', data_index)
    # print('batch_size:', batch_size)
    # print('window_size:', window_size)
    
    # two numpy arras to hold target words (batch)
    # and context words (labels)
    # batch: 随机初始大小为 batch_size × 1 的向量
    # labels: 随机初始大小为 1 × batch_size 的向量
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    
    # span defines the total window size, where
    # data we consider at an instance looks as follows. 
    # [ skip_window target skip_window ]
    # span: 2 * 窗口大小 + 1, i.e. 左右上下文 + 目标词
    span = 2 * window_size + 1 
    
    # The buffer holds the data contained within the span
    # deque创建的队列可以从左从右加入数据
    buffer = collections.deque(maxlen=span)
  
    # Fill the buffer and update the data_index
    # 将ID添加到buffer中
    # 该循环运行完后,data_index会移动到窗口后一位
    # e.g. window_size = 2, span = 5, 循环结束后 data_index = 5
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    # print('data_index: ', data_index)
    # print('buffer before loop:', buffer)
    
    # This is the number of context words we sample for a single target word
    # 上下文的大小
    num_samples = 2*window_size 

    # We break the batch reading into two for loops
    # The inner for loop fills in the batch and labels with 
    # num_samples data points using data contained within the span
    # The outper for loop repeat this for batch_size//num_samples times
    # to produce a full batch
    # 假设窗口大小为2
    # range(batch_size // num_samples): 0, 1
    for i in range(batch_size // num_samples):
        k=0
        # avoid the target word itself as a prediction
        # fill in batch and label numpy arrays
        # 假设窗口大小为2
        # list(range(window_size)): [0, 1]
        # list(range(window_size + 1, 2 * window_size + 1)): [3, 4]
        # j 的循环范围: [0, 1, 3, 4], 避开了2,, 也就是目标词汇
        for j in list(range(window_size))+list(range(window_size+1,2*window_size+1)):
            batch[i * num_samples + k] = buffer[window_size]
            labels[i * num_samples + k, 0] = buffer[j]
            k += 1 
    
        # Everytime we read num_samples data points,
        # we have created the maximum number of datapoints possible
        # withing a single span, so we need to move the span by 1
        # to create a fresh new span
        # 由于buffer的maxlen只有窗口 + 1的大小
        # 所以这里append一个新的元素会使得第一个元素出队
        # 就可以使得窗口向右滑动一格
        buffer.append(data[data_index])
        # print('buffer after change:', buffer)
        data_index = (data_index + 1) % len(data)
    return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])

for window_size in [1, 2]:
    data_index = 0
    # batch: 目标词汇
    # labels: 上下文单词
    # batch中目标词汇对应的索引在labels中就是该词汇的上下文
    batch, labels = generate_batch_skip_gram(batch_size=8, window_size=window_size)
    print('\nwith window_size = %d:' %window_size)
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])

这里buffer是用collections.deque()创建的队列,该队列有个特点,可以从左从右入队出队,而且可以设置固定长度,当达到最大长度后,新入队的元素会把最远(从右入队最左边的元素出队)的那个元素挤出去。

换句话说,buffer里存储的就是这一轮循环中目标词汇及其上下文的单词。就比如刚刚的例子:I like to go to school,假设window_size=1,则span=3,第一轮中buffer存储的就是[‘I’, ‘like’, ‘to’],这里batch为’like’,labels为[‘I’, ‘to’]。

这里有个细节,就是这串代码:

for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)

核心在data_index = (data_index + 1) % len(data),细节就在,运行到最后一轮的时候,这个data_index还会+1。就比如window_size=1span=3),运行完后data_index=3,这里再结合上后面循环中的代码:

buffer.append(data[data_index])
# print('buffer after change:', buffer)
data_index = (data_index + 1) % len(data)

就实现了滑动窗口。在循环结尾处,buffer添加了data_index=3的数据,挤出了data_index=0的数据。

输出如下:

data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed']

with window_size = 1:
    batch: ['is', 'is', 'a', 'a', 'concerted', 'concerted', 'set', 'set']
    labels: ['propaganda', 'a', 'is', 'concerted', 'a', 'set', 'concerted', 'of']

with window_size = 2:
    batch: ['a', 'a', 'a', 'a', 'concerted', 'concerted', 'concerted', 'concerted']
    labels: ['propaganda', 'is', 'concerted', 'set', 'is', 'a', 'set', 'of']

这里输出内容中,batchlabels的索引是一一对应的,即labels[i]是目标词batch[i]的上下文。

6 Skip-gram

6.1 定义超参数

这里定义的超参数为:

  1. batch_size: 一个batch中的样本数
  2. embedding_size: 嵌入向量的大小(隐藏层的神经元个数)
  3. window_size: 上下文大小
  4. valid_size: 选择的验证集数据个数
  5. valid_window: 验证集窗口大小(从该窗口中随机选择验证集的索引)
  6. num_sampled: 负采样的个数
batch_size = 128 # Data points in a single batch
embedding_size = 128 # Dimension of the embedding vector.
window_size = 4 # How many words to consider left and right.

# We pick a random validation set to sample nearest neighbors
valid_size = 16 # Random set of words to evaluate similarity on.
# We sample valid datapoints randomly from a large window without always being deterministic
valid_window = 50

# When selecting valid examples, we select some of the most frequent words as well as
# some moderately rare words as well
valid_examples = np.array(random.sample(range(valid_window), valid_size))
valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)

num_sampled = 32 # Number of negative examples to sample.

这里random.sample()是随机从列表中取元素,但是不会影响到列表本身的排序。

6.2 定义输入与输出的占位符

tf.reset_default_graph()

# Training input data (target word IDs).
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
# Training input label data (context word IDs)
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
# Validation input data, we don't need a placeholder
# as we have already defined the IDs of the words selected
# as validation data
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

train_dataset: 大小为 128 × 1 128\times 1 128×1,为每一个batch输入的数据。
train_labels: 大小为 128 × 1 128 \times 1 128×1,为每一个batch输入数据的标签。
valid_dataset: 大小为 32 × 1 32 \times 1 32×1,为验证集中的数据。

6.3 定义模型参数与其他变量

# Variables

# Embedding layer, contains the word embeddings
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

# Softmax Weights and Biases
softmax_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                        stddev=0.5 / math.sqrt(embedding_size))
)
softmax_biases = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01))

truncated_normal(): 截断的产生正态分布随机数,随机数与均值的差值若大于两倍的标准差,则重新生成。
embeddings: W W W,输入层到隐藏层的权重矩阵, 50000 × 128 50000 \times 128 50000×128的张量,均匀分布在 [ − 1 , 1 ] [-1,1] [1,1]
softmax_weights: W ′ W' W,隐藏层到输出层的权重矩阵, 50000 × 128 50000\times 128 50000×128的张量,截断正态分布,均值为0,标准差为 0.5 128 \frac{0.5}{\sqrt{128}} 128 0.5
softmax_biases: b b b,输出层的偏置向量, 50000 × 1 50000\times 1 50000×1,均匀分布在 [ 0 , 0.01 ] [0, 0.01] [0,0.01]

这里按理说softmax_weights的大小应该是 128 × 50000 128 \times 50000 128×50000才对,而大小为 50000 × 128 50000 \times 128 50000×128的原因在于truncated_normal()这个函数,源码中关于weights这个变量的定义为:

weights: A Tensor of shape [num_classes, dim], or a list of Tensor objects whose concatenation along dimension 0 has shape [num_classes, dim]. The (possibly-sharded) class embeddings.

如果不清楚这个num_classesdim哪个是哪个,那么就再查一下biases的定义:

biases: A Tensor of shape [num_classes]. The class biases.

这里就很明确了,num_classes是输出层的神经元个数,所以这里softmax_weights的大小为 50000 × 128 50000 \times 128 50000×128

6.4 定义模型计算

模型计算这里,首先通过查询方法embedding_lookup()来获得给定输入与隐藏层向量的关系,同时也定义了负采样的损失函数tf.nn.sampled_softmax_loss

# Model.
# Look up embeddings for a batch of inputs.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)

# Compute the softmax loss, using a sample of the negative labels each time.
# 计算的是平均损失
loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(
        weights=softmax_weights, biases=softmax_biases, inputs=embed,
        labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size)
)

这里为何直接用输入来查找embedding的向量,先卖个关子,后面讲解模型运行的时候一并讲解。

6.5 计算词语相似度

这里是采用的余弦相似度计算两个词语的相似程度。

# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

余弦相似度的公式如下:

c o s θ = A ⃗ ⋅ B ⃗ ∣ A ⃗ ∣ × ∣ B ⃗ ∣ {\rm cos}\theta = \frac{\vec{A}·\vec{B}}{|\vec{A}| \times |\vec{B}|} cosθ=A ×B A B

norm: 矩阵中每行元素的模( N × 1 N\times 1 N×1的向量)。
normalized_embeddings: 这里是TensorFlow中的矩阵除法,如果矩阵 A \bold{A} A的大小是 N × M N\times M N×M,则被除的向量 v ⃗ \vec{v} v 的大小必须是 N × 1 N \times 1 N×1,其计算过程就是矩阵 A \bold{A} A中第 i i i行的所有元素除以向量 v ⃗ \vec{v} v 中第 i i i行的元素。这里除出来,那么就是embeddings这个矩阵中每个元素除以了它所在行的模,也就是个L2正则化过程。
valid_embeddings: 这里主要是从normalized_embeddings中提取出验证集的数据。
similarity: 由于经过了L2正则化,所以矩阵中每一行的模都为1,即 ∣ A ⃗ ∣ × ∣ B ⃗ ∣ = 1 |\vec{A}| \times |\vec{B}|=1 A ×B =1,于是余弦相似度就变为了 c o s θ = A ⃗ ⋅ B ⃗ {\rm cos}\theta = \vec{A}·\vec{B} cosθ=A B ,我们唯一要做的就是点积两个向量。

6.6 优化器

优化器采用的是Adagrad优化器,学习率设置的1.0。Adagrad优化器是解决不同参数应该使用不同的更新速率的问题。Adagrad自适应地为各个参数分配不同学习率的算法。

# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

6.7 运行Skip-gram

num_steps = 100001
skip_losses = []
# ConfigProto is a way of providing various configuration settings 
# required to execute the graph
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:
    # Initialize the variables in the graph
    tf.global_variables_initializer().run()
    print('Initialized')
    average_loss = 0

    # Train the Word2vec model for num_step iterations
    for step in range(num_steps):

        # Generate a single batch of data
        batch_data, batch_labels = generate_batch_skip_gram(
            batch_size, window_size)

        # Populate the feed_dict and run the optimizer (minimize loss)
        # and compute the loss
        # _: 占位符,调用优化函数
        # l: 损失
        feed_dict = {train_dataset: batch_data, train_labels: batch_labels}
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)

        # Update the average loss variable
        average_loss += l
        
        # 每2000步计算一次平均损失的平均
        if (step + 1) % 2000 == 0:
            if step > 0:
                average_loss = average_loss / 2000

            skip_losses.append(average_loss)
            # The average loss is an estimate of the loss over the last 2000 batches.
            print('Average loss at step %d: %f' % (step + 1, average_loss))
            average_loss = 0

        # Evaluating validation set word similarities
        if (step + 1) % 10000 == 0:
            sim = similarity.eval()
            # Here we compute the top_k closest words for a given validation word
            # in terms of the cosine distance
            # We do this for all the words in the validation set
            # Note: This is an expensive step
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]  # 提取出验证集中的词语
                top_k = 8  # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k + 1]  # 相似度按照从大到小排序
                log = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log = '%s %s,' % (log, close_word)
                print(log)
    skip_gram_final_embeddings = normalized_embeddings.eval()
    
# We will save the word vectors learned and the loss over time
# as this information is required later for comparisons
np.save('skip_embeddings',skip_gram_final_embeddings)

with open('skip_losses.csv', 'wt') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerow(skip_losses)

这里最主要的困难我觉得就是6.4中提到的为何直接用输入向量来查询输入层到隐藏层 W W W中的某一行向量

这里一方面可以说是作者简化代码秀一手操作,或者说就是强行降低代码的易读性。首先我们看代码,train_dataset接收的内容是batch_databatch_data在5中提到了,是输入词语的ids。接着这个train_dataset放入到6.4中的embed中查询embeddings的向量。最后放入到loss中计算损失。由于Word2Vec的输入层输入的是每个词语的one-hot编码,就导致在隐藏层接收到的内容 h = W T x = W k , ⋅ T \bold{h}=\bold{W}^{\rm T}\bold{x}=\bold{W}_{k,·}^{\rm T} h=WTx=Wk,T,即矩阵 W \bold{W} W的第 k k k行。所以这里直接查询输入词语对应的权重矩阵 W \bold{W} W中对应的向量,就等于查询到了 W k , ⋅ T \bold{W}_{k,·}^{\rm T} Wk,T

后面的部分就是求解词语的相似度,以及输出损失,相应的东西看看注释就能够看得懂了。

6.8 可视化Skip-gram学习过程

这里的可视化我就只是简单看了眼,因为核心还是在学习过程,所以这里就只是贴上代码以及我当时看代码写的注释。

6.8.1 仅查找聚类的单词而非稀疏分布的单词

def find_clustered_embeddings(embeddings,distance_threshold,sample_threshold):
    ''' 
    Find only the closely clustered embeddings. 
    This gets rid of more sparsly distributed word embeddings and make the visualization clearer
    This is useful for t-SNE visualization
    
    distance_threshold: maximum distance between two points to qualify as neighbors
    sample_threshold: number of neighbors required to be considered a cluster
    '''
    
    # calculate cosine similarity
    cosine_sim = np.dot(embeddings,np.transpose(embeddings))
    norm = np.dot(np.sum(embeddings**2,axis=1).reshape(-1,1),np.sum(np.transpose(embeddings)**2,axis=0).reshape(1,-1))
    assert cosine_sim.shape == norm.shape
    cosine_sim /= norm  # 求得余弦相似度
    
    # make all the diagonal entries zero otherwise this will be picked as highest
    # 在这里对角线元素就是词语 i 的向量与自己的点积
    # 就是说不考虑自己与自己的相似度(因为自己与自己的相似度是最高的)
    np.fill_diagonal(cosine_sim, -1.0)  # 将cosine_sim的对角线元素变为-1
    
    argmax_cos_sim = np.argmax(cosine_sim, axis=1)
    mod_cos_sim = cosine_sim
    # find the maximums in a loop to count if there are more than n items above threshold
    # 将每一行的最大值设置为-1
    for _ in range(sample_threshold-1):
        argmax_cos_sim = np.argmax(cosine_sim, axis=1)  # 每一轮迭代选一个最大值出来
        mod_cos_sim[np.arange(mod_cos_sim.shape[0]),argmax_cos_sim] = -1  # 将这轮迭代选出的最大值设置为-1
    
    max_cosine_sim = np.max(mod_cos_sim,axis=1)  # 迭代完成后选择出当前矩阵中的最大元素

    return np.where(max_cosine_sim>distance_threshold)[0]

6.8.2 采用sklearn计算词嵌入的t-SNE可视化

num_points = 1000 # we will use a large sample space to build the T-SNE manifold and then prune it using cosine similarity

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)

print('Fitting embeddings to T-SNE. This can take some time ...')
# get the T-SNE manifold
selected_embeddings = skip_gram_final_embeddings[:num_points, :]  # 提取前1000个嵌入向量
two_d_embeddings = tsne.fit_transform(selected_embeddings)

print('Pruning the T-SNE embeddings')
# prune the embeddings by getting ones only more than n-many sample above the similarity threshold
# this unclutters the visualization
selected_ids = find_clustered_embeddings(selected_embeddings,.25,10)  # 获得聚类单词的id
two_d_embeddings = two_d_embeddings[selected_ids,:]

print('Out of ',num_points,' samples, ', selected_ids.shape[0],' samples were selected by pruning')

6.8.3 采用matplotlib绘制t-SNE的图

def plot(embeddings, labels):
  
    n_clusters = 20 # number of clusters
    # automatically build a discrete set of colors, each for cluster
    label_colors = [pylab.cm.Spectral(float(i) /n_clusters) for i in range(n_clusters)]
  
    assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  
    # Define K-Means
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=0).fit(embeddings)
    kmeans_labels = kmeans.labels_
  
    pylab.figure(figsize=(15,15))  # in inches
    
    # plot all the embeddings and their corresponding words
    for i, (label,klabel) in enumerate(zip(labels,kmeans_labels)):
        x, y = embeddings[i,:]
        pylab.scatter(x, y, c=label_colors[klabel])    

        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                       ha='right', va='bottom',fontsize=10)

    # use for saving the figure if needed
    #pylab.savefig('word_embeddings.png')
    pylab.show()

words = [reverse_dictionary[i] for i in selected_ids]
plot(two_d_embeddings, words)

生成的图如下:
skip-gram的t-SNE

7 CBOW

7.1 更改数据生成过程

由于CBOW与Skip-gram的网络结构不同,所以需要更改数据生成过程。CBOW的网络结构如下:
CBOW
由于CBOW有多个输入,所以输入向量的大小从 b a t c h _ s i z e × 1 {\rm batch\_size \times 1} batch_size×1变为 b a t c h _ s i z e × ( c o n t e x t _ w i n d o w ∗ 2 ) {\rm batch\_size} \times ({\rm context\_window} * 2) batch_size×(context_window2)

data_index = 0

def generate_batch_cbow(batch_size, window_size):
    # window_size is the amount of words we're looking at from each side of a given word
    # creates a single batch
    # 输入是词语 i 的上下文,单侧上下文有 window_size 个单词
    
    # data_index is updated by 1 everytime we read a set of data point
    global data_index

    # span defines the total window size, where
    # data we consider at an instance looks as follows. 
    # [ skip_window target skip_window ]
    # e.g if skip_window = 2 then span = 5
    # 左右侧上下文单词 + 目标单词
    span = 2 * window_size + 1 # [ skip_window target skip_window ]

    # two numpy arras to hold target words (batch)
    # and context words (labels)
    # Note that batch has span-1=2*window_size columns
    # batch: 目标单词的上下文单词
    # labels: 目标单词
    batch = np.ndarray(shape=(batch_size,span-1), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    
    # The buffer holds the data contained within the span
    buffer = collections.deque(maxlen=span)

    # Fill the buffer and update the data_index
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)

    # Here we do the batch reading
    # We iterate through each batch index
    # For each batch index, we iterate through span elements
    # to fill in the columns of batch array
    # i: 记录目标词语的索引
    for i in range(batch_size):
        target = window_size  # target label at the center of the buffer
        target_to_avoid = [ window_size ] # we only need to know the words around a given word, not the word itself

        # add selected target to avoid_list for next time
        col_idx = 0  # 记录上下文词语在batch中的索引
        # j: 记录上下文词语在buffer中的索引
        for j in range(span):
            # ignore the target word when creating the batch
            if j==span//2:
                continue
            batch[i,col_idx] = buffer[j] 
            col_idx += 1
        labels[i, 0] = buffer[target]

        # Everytime we read a data point,
        # we need to move the span by 1
        # to create a fresh new span
        # 移动滑动窗口
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)

    return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])

for window_size in [1,2]:
    data_index = 0
    batch, labels = generate_batch_cbow(batch_size=8, window_size=window_size)
    print('\nwith window_size = %d:' % (window_size))
    print('    batch:', [[reverse_dictionary[bii] for bii in bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])

这里和5中的generate_batch_skip_gram()类似,只是区别在于batchlabels

for i in range(batch_size):
    target = window_size  # target label at the center of the buffer
    target_to_avoid = [ window_size ] # we only need to know the words around a given word, not the word itself

    # add selected target to avoid_list for next time
    col_idx = 0  # 记录上下文词语在batch中的索引
    # j: 记录上下文词语在buffer中的索引
    for j in range(span):
        # ignore the target word when creating the batch
        if j==span//2:
            continue
        batch[i,col_idx] = buffer[j] 
        col_idx += 1
    labels[i, 0] = buffer[target]

    # Everytime we read a data point,
    # we need to move the span by 1
    # to create a fresh new span
    # 移动滑动窗口
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)

这里采用的是span // 2跳过目标单词。由于CBOW是根据上下文预测目标单词,所以目标单词成了labels,输入单词变为了目标单词的上下文。因为buffer中记录了目标单词以及其上下文,所以在循环时需要跳过目标单词的下标。在每一轮循环中labels记录预测的单词,batch记录上下文单词。

输出如下:

data: ['propaganda', 'is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed']

with window_size = 1:
    batch: [['propaganda', 'a'], ['is', 'concerted'], ['a', 'set'], ['concerted', 'of'], ['set', 'messages'], ['of', 'aimed'], ['messages', 'at'], ['aimed', 'influencing']]
    labels: ['is', 'a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at']

with window_size = 2:
    batch: [['propaganda', 'is', 'concerted', 'set'], ['is', 'a', 'set', 'of'], ['a', 'concerted', 'of', 'messages'], ['concerted', 'set', 'messages', 'aimed'], ['set', 'of', 'aimed', 'at'], ['of', 'messages', 'at', 'influencing'], ['messages', 'aimed', 'influencing', 'the'], ['aimed', 'at', 'the', 'opinions']]
    labels: ['a', 'concerted', 'set', 'of', 'messages', 'aimed', 'at', 'influencing']

7.2 定义超参数

batch_size = 128 # Data points in a single batch
embedding_size = 128 # Dimension of the embedding vector.
# How many words to consider left and right.
# Skip gram by design does not require to have all the context words in a given step
# However, for CBOW that's a requirement, so we limit the window size
window_size = 2 

# We pick a random validation set to sample nearest neighbors
valid_size = 16 # Random set of words to evaluate similarity on.
# We sample valid datapoints randomly from a large window without always being deterministic
valid_window = 50

# When selecting valid examples, we select some of the most frequent words as well as
# some moderately rare words as well
valid_examples = np.array(random.sample(range(valid_window), valid_size))
valid_examples = np.append(valid_examples,random.sample(range(1000, 1000+valid_window), valid_size),axis=0)

num_sampled = 32 # Number of negative examples to sample.

batch_size: 一个batch中的样本数
embedding_size: 嵌入向量(隐藏层)的大小
window_size: 上下文大小

7.3 定义输入与输出

tf.reset_default_graph()

# Training input data (target word IDs). Note that it has 2*window_size columns
train_dataset = tf.placeholder(tf.int32, shape=[batch_size,2*window_size])
# Training input label data (context word IDs)
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
# Validation input data, we don't need a placeholder
# as we have already defined the IDs of the words selected
# as validation data
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

train_dataset: 每一个batch中输入的训练数据集,大小为 128 × 4 128 \times 4 128×4,(每一个输入有4个上下文)
train_labels: 输入的标签,大小为 128 × 1 128 \times 1 128×1,(每4个输入的上下文只有一个输出单词)
valid_dataset: 验证集

7.4 定义模型参数和其他变量

embeddings: 输入层到隐藏层的权重矩阵 W W W V × N V \times N V×N,均匀分布,分布在 [ − 1 , 1 ] [-1, 1] [1,1]

softmax_weights: 隐藏层到输出层的权重 W ′ W' W V × N V \times N V×N,截断正态分布,均值0,标准差 0.5 128 \frac{0.5}{\sqrt{128}} 128 0.5

softmax_biases: 输出层的偏置 b b b V V V,均匀分布,分布在 [ 0 , 0.01 ] [0, 0.01] [0,0.01]

# Variables.

# Embedding layer, contains the word embeddings
embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0,dtype=tf.float32))

# Softmax Weights and Biases
softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size],
                              stddev=0.5 / math.sqrt(embedding_size),dtype=tf.float32))
softmax_biases = tf.Variable(tf.random_uniform([vocabulary_size],0.0,0.01))

7.5 定义模型计算

# Model.
# Look up embeddings for a batch of inputs.
# Here we do embedding lookups for each column in the input placeholder
# and then average them to produce an embedding_size word vector
stacked_embedings = None  # 每一个目标词汇的上下文向量构成的矩阵
print('Defining %d embedding lookups representing each word in the context'%(2*window_size))
for i in range(2*window_size):
    embedding_i = tf.nn.embedding_lookup(embeddings, train_dataset[:,i])  # 取出train_dataset中第i列的词语的embedding向量       
    x_size,y_size = embedding_i.get_shape().as_list()
    if stacked_embedings is None:
        stacked_embedings = tf.reshape(embedding_i,[x_size,y_size,1])
    else:
        stacked_embedings = tf.concat(axis=2,values=[stacked_embedings,tf.reshape(embedding_i,[x_size,y_size,1])])

assert stacked_embedings.get_shape().as_list()[2]==2*window_size
print("Stacked embedding size: %s"%stacked_embedings.get_shape().as_list())
mean_embeddings =  tf.reduce_mean(stacked_embedings,2,keepdims=False)  # 对输入的向量进行求和再平均
print("Reduced mean embedding size: %s"%mean_embeddings.get_shape().as_list())

# Compute the softmax loss, using a sample of the negative labels each time.
# inputs are embeddings of the train words
# with this loss we optimize weights, biases, embeddings
loss = tf.reduce_mean(
    tf.nn.sampled_softmax_loss(weights=softmax_weights, biases=softmax_biases, inputs=mean_embeddings,
                               labels=train_labels, num_sampled=num_sampled, num_classes=vocabulary_size))

输出如下:

Defining 4 embedding lookups representing each word in the context
Stacked embedding size: [128, 128, 4]
Reduced mean embedding size: [128, 128]

这一步主要是求得输入的向量。因为CBOW是输入多个上下文,这些上下文需要求平均数才能够提交给隐藏层。这个for循环的作用就是将输入向量给拼接起来,之后再用tf.reduce_mean()求平均数。

而损失函数与skip-gram模型定义的一样,就不多赘述。

7.6 模型参数优化器

这里同样采用Adagrad作为优化器。

# Optimizer.
optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

7.7 计算词语相似度

# Compute the similarity between minibatch examples and all embeddings.
# We use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))

这里与之前定义的一模一样,也不多赘述。

7.8 运行CBOW模型

num_steps = 100001
cbow_losses = []

# ConfigProto is a way of providing various configuration settings 
# required to execute the graph
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as session:
    
    # Initialize the variables in the graph
    tf.global_variables_initializer().run()
    print('Initialized')
    
    average_loss = 0
    
    # Train the Word2vec model for num_step iterations
    for step in range(num_steps):
        
        # Generate a single batch of data
        batch_data, batch_labels = generate_batch_cbow(batch_size, window_size)
        
        # Populate the feed_dict and run the optimizer (minimize loss)
        # and compute the loss
        feed_dict = {train_dataset : batch_data, train_labels : batch_labels}
        _, l = session.run([optimizer, loss], feed_dict=feed_dict)
        
        # Update the average loss variable
        average_loss += l
        
        if (step+1) % 2000 == 0:
            if step > 0:
                average_loss = average_loss / 2000
                # The average loss is an estimate of the loss over the last 2000 batches.
            cbow_losses.append(average_loss)
            print('Average loss at step %d: %f' % (step+1, average_loss))
            average_loss = 0
            
        # Evaluating validation set word similarities
        if (step+1) % 10000 == 0:
            sim = similarity.eval()
            # Here we compute the top_k closest words for a given validation word
            # in terms of the cosine distance
            # We do this for all the words in the validation set
            # Note: This is an expensive step
            for i in range(valid_size):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8 # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log = 'Nearest to %s:' % valid_word
                for k in range(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log = '%s %s,' % (log, close_word)
                print(log)
    cbow_final_embeddings = normalized_embeddings.eval()
    

np.save('cbow_embeddings',cbow_final_embeddings)

with open('cbow_losses.csv', 'wt') as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerow(cbow_losses)

这里和skip-gram唯一的区别就在于生成的batch_databatch_labels不同,于是loss中传入的参数也就不同。

7.9 可视化CBOW的学习过程

7.9.1 采用sklearn计算词嵌入t-SNE可视化

num_points = 1000 # we will use a large sample space to build the T-SNE manifold and then prune it using cosine similarity

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)

print('Fitting embeddings to T-SNE. This can take some time ...')
# get the T-SNE manifold
selected_embeddings = cbow_final_embeddings[:num_points, :]  # 提取前1000个嵌入向量
two_d_embeddings = tsne.fit_transform(selected_embeddings)

print('Pruning the T-SNE embeddings')
# prune the embeddings by getting ones only more than n-many sample above the similarity threshold
# this unclutters the visualization
selected_ids = find_clustered_embeddings(selected_embeddings,.25,10)  # 获得聚类单词的id
two_d_embeddings = two_d_embeddings[selected_ids,:]

print('Out of ',num_points,' samples, ', selected_ids.shape[0],' samples were selected by pruning')

7.9.2 采用matplotlib绘制t-SNE

def plot(embeddings, labels):
  
    n_clusters = 20 # number of clusters
    # automatically build a discrete set of colors, each for cluster
    label_colors = [pylab.cm.Spectral(float(i) /n_clusters) for i in range(n_clusters)]
  
    assert embeddings.shape[0] >= len(labels), 'More labels than embeddings'
  
    # Define K-Means
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', random_state=0).fit(embeddings)
    kmeans_labels = kmeans.labels_
  
    pylab.figure(figsize=(15,15))  # in inches
    
    # plot all the embeddings and their corresponding words
    for i, (label,klabel) in enumerate(zip(labels,kmeans_labels)):
        x, y = embeddings[i,:]
        pylab.scatter(x, y, c=label_colors[klabel])    

        pylab.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points',
                       ha='right', va='bottom',fontsize=10)

    # use for saving the figure if needed
    #pylab.savefig('word_embeddings.png')
    pylab.show()

words = [reverse_dictionary[i] for i in selected_ids]
plot(two_d_embeddings, words)

CBOW-t-SNE

8 参考

[1] Thushan Ganegedara. Natural Language Processing with TensorFlow (TensorFlow自然语言处理)[M]. 北京: 机械工业出版社, 2019: 42-46.
[2] 野指针小李. Word2Vec原理与公式详细推导[EB/OL]. (2021-04-28)[2021-06-18]. https://blog.csdn.net/qq_35357274/article/details/116240180
[3] 野指针小李. Word2Vec之Hierarchical Softmax与Negative Sampling[EB/OL]. (2021-05-03)[2021-06-18]. https://blog.csdn.net/qq_35357274/article/details/116381205
[4] 101欢欢鱼. python——random.sample()的用法[EB/OL]. (2019-08-12)[2021-06-18]. https://www.cnblogs.com/fish-101/p/11339909.html
[5] TaoTao Yu. embedding_lookup的学习笔记[EB/OL]. (2019-08-04)[2021-06-18]. https://blog.csdn.net/hit0803107/article/details/98377030
[6] 阿常呓语. collections中 deque的使用[EB/OL]. (2018-06-21)[2021-06-18]. https://blog.csdn.net/u010339879/article/details/80767293

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值