Glove论文代码复现

In [13]:

import tf_glove
print('done')
done

In [4]:

#!ls -al

Instantiating the model

To create a new GloVe model, simply call tf_glove.GloVeModel():

In [14]:

model = tf_glove.GloVeModel(embedding_size=50, context_size=10, min_occurrences=25,
                            learning_rate=0.05, batch_size=512)
print('done')
done

GloVeModel() has several parameters:

  • embedding_size: the target dimensionality of the trained word representations. Typically between 50 and 300.
  • context_size: how many tokens on either side of a given word to include in each context window. Can be either a tuple of two ints, indicating how many token on the left and right to include, or a single int, which will be interpreted to mean symmetric context.
  • max_vocab_size (Optional): the maximum size of the model's vocabulary. The model's vocabulary will be the most frequently occurring words in the corpus up to this amount. The default is 100,000.
  • min_occurrences (Optional): the minimum number of times a word must have appeared in the corpus to be included in the model's vocabulary. Default is 1.
  • scaling_factor (Optional): the alpha term in Eqn. 9 of Pennington et al.'s paper. Default is 3/4, which is the paper's recommendation
  • cooccurrence_cap (Optional): the x_max term in Eqn. 9 of Pennington et al.'s paper. Default is 100, which is the paper's recommendation
  • batch_size (Optional): the number of cooccurrences per minibatch of in training. Default is 512, which seems to work well on my machine. If training is very slow, consider playing with this.
  • learning_rate (Optional): the Adagrad learning rate used in training. Default is 0.05, which is the paper's recommendation

Reading the corpus

tf_glove needs to be fit to a corpus in order to learn word representations. To do this, we'll use GloVeModel.fit_to_corpus(corpus).

This method expects an iterable of iterables of strings, where each string is a token, like this:

[["this", "is", "a", "comment", "."], ["this", "is", "another", "comment", "."]]

That was a list of lists, but any iterable of iterables of strings should work.

Note on getting the dataset (if you want to follow along with these examples exactly)

For these examples, I'm going to use the dataset of Reddit comments described here: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment

tf_glove is designed to work with any corpus, so there's no need to download this dataset. However, if you'd like to, that post has a link to a torrent for all of the comments as well as a link for just the comments from January 2015. Even just the January 2015 file is quite large (~5 GB).

I downloaded it and used

$ head -n 1000000 RC_2015-01 > /path/to/RC_2015-01-1m_sample

to get the 1 million comment sample file referenced below. You could also use 100k if you want to save some time. 1 million comments takes ~15 minutes to fit on my machine.

The code:

In [20]:

!ls -al
total 8
drwxr-xr-x 2 nbuser nbuser       0 Jan  1  1970 .
drwxr-xr-x 1 nbuser nbuser    4096 Aug 16 20:58 ..
-rw-r--r-- 1 nbuser nbuser    6590 Aug 16 20:51 1
-rw-r--r-- 1 nbuser nbuser 1009979 Aug 16 21:02 GettingStartedwithGloveImplement.ipynb
drwxr-xr-x 2 nbuser nbuser       0 Aug 15 19:16 .ipynb_checkpoints
-rw-r--r-- 1 nbuser nbuser 2106991 Aug 16 21:01 RC_2006-01
-rw-r--r-- 1 nbuser nbuser      94 Aug 15 19:17 README.md
-rw-r--r-- 1 nbuser nbuser  350093 Aug 16 20:56 reddit2005-1
-rw-r--r-- 1 nbuser nbuser   11214 Aug 15 19:19 tf_glove.py
-rw-r--r-- 1 nbuser nbuser   10868 Aug 15 19:20 tf_glove.pyc

In [21]:

import re
import nltk

def extract_reddit_comments(path):
    # A regex for extracting the comment body from one line of JSON (faster than parsing)
    body_snatcher = re.compile(r"\{.*?(?<!\\)\"body(?<!\\)\":(?<!\\)\"(.*?)(?<!\\)\".*}")
    with open(path) as file_:
        for line in file_:
            match = body_snatcher.match(line)
            if match:
                body = match.group(1)
                # Ignore deleted comments
                if not body == '[deleted]':
                    # Return the comment as a string (not yet tokenized)
                    yield body
                        
def tokenize_comment(comment_str):
    # Use the excellent NLTK to tokenize the comment body
    #
    # Note that we're lower-casing the comments here. tf_glove is case-sensitive,
    # so if you want 'You' and 'you' to be considered the same word, be sure to lower-case everything.
    return nltk.wordpunct_tokenize(comment_str.lower())

def reddit_comment_corpus(path):
    # A generator that returns lists of tokens representing individual words in the comment
    return (tokenize_comment(comment) for comment in extract_reddit_comments(path))

# Replace the path with the path to your corpus file
corpus = reddit_comment_corpus("RC_2006-01")

In [16]:

type(corpus)

Out[16]:

generator

Now, to fit the model to the corpus:

In [22]:

model.fit_to_corpus(corpus)

Training the model

GloVeModel.fit_to_corpus() builds the vocabulary and cooccurrence matrix that will be used in training, but it doesn't actually train the word representations. It's time to kick off TensorFlow and train the model for real:

In [23]:

model.train(num_epochs=50, log_dir="log/example", summary_batch_interval=1000)
print('done training')
Writing TensorBoard summaries to log/example
done training

GloVeModel.train() has a few parameters:

  • num_epochs: How many passes through the cooccurrence matrix the training should make. The paper recommends at least 50 for embedding_size < 300, and 100 otherwise.
  • log_dir (Optional): The path of the directory in which to log summaries for TensorBoard and t-SNE visualizations. Default is None, i.e. don't log anything.
  • summary_batch_interval (Optional): How many minibatches between logging events for TensorBoard. Default is 1000.
  • tsne_epoch_interval (Optional): How many epochs (full passes through cooccurrence matrix) between outputting a t-SNE visualization of the model's embeddings for the most frequent 1000 words in the vocabulary. Default is None, i.e. don't output t-SNE visualizations during training.

Checking out the results

Now that we've trained the model, let's look at the results.

Use GloVeModel.embedding_for() to get the trained embedding for a single word:

In [24]:

model.embedding_for("reddit")

Out[24]:

array([-0.69618446,  0.20859328,  0.95094717, -0.6289609 ,  1.0386745 ,
        0.17344241,  0.9935566 ,  0.8475526 , -0.7601383 , -0.15705493,
       -0.13119277,  1.1211975 , -0.18929769, -0.14709277, -0.35697016,
       -0.26167196,  0.4804169 , -0.11896043, -0.14235029,  0.55361015,
       -0.31584463,  0.4823669 , -0.4825324 ,  0.54852855,  0.29927242,
       -0.46764416, -0.31799427, -0.84117365,  0.05095249,  1.105447  ,
        0.8208274 ,  0.1659146 , -0.25713712,  0.1382741 ,  0.33468646,
        0.7997937 ,  0.05455755, -0.19944909,  0.45362073,  0.09918478,
       -0.05857491, -0.03431866,  0.00754084,  0.19290435,  0.8423282 ,
       -0.0155649 ,  0.09942843, -0.2352441 , -0.31153223,  0.23810734],
      dtype=float32)

You can also get the model's embeddings for every word in the vocabulary like this:

In [25]:

model.embeddings

Out[25]:

array([[ 1.8047712 ,  1.4226124 ,  0.81635815, ..., -0.7931991 ,
        -0.21225452, -1.1910183 ],
       [ 0.26826242, -0.2517047 ,  1.3441598 , ...,  0.22778615,
        -0.4677268 , -0.73932177],
       [ 0.18769039,  0.05553362,  1.0482045 , ...,  1.0999475 ,
        -0.9687896 , -1.0854678 ],
       ...,
       [-0.23779865, -0.2531533 , -0.5582521 , ..., -0.22597912,
         0.21925761,  0.11462189],
       [ 0.24282733,  0.44581434, -0.30409908, ...,  0.9912765 ,
         0.21773858,  1.3758885 ],
       [-0.5071667 , -0.53841263, -0.10033393, ...,  0.37520498,
        -0.70765084,  0.07273079]], dtype=float32)

GloVeModel.embeddings will give you a NumPy matrix where each row is the model's embedding for a single word.

To make use of this, you'll want to know what row corresponds to a particular word. You can do that with GloVeModel.id_for_word:

In [26]:

model.embeddings[model.id_for_word('reddit')]

Out[26]:

array([-0.69618446,  0.20859328,  0.95094717, -0.6289609 ,  1.0386745 ,
        0.17344241,  0.9935566 ,  0.8475526 , -0.7601383 , -0.15705493,
       -0.13119277,  1.1211975 , -0.18929769, -0.14709277, -0.35697016,
       -0.26167196,  0.4804169 , -0.11896043, -0.14235029,  0.55361015,
       -0.31584463,  0.4823669 , -0.4825324 ,  0.54852855,  0.29927242,
       -0.46764416, -0.31799427, -0.84117365,  0.05095249,  1.105447  ,
        0.8208274 ,  0.1659146 , -0.25713712,  0.1382741 ,  0.33468646,
        0.7997937 ,  0.05455755, -0.19944909,  0.45362073,  0.09918478,
       -0.05857491, -0.03431866,  0.00754084,  0.19290435,  0.8423282 ,
       -0.0155649 ,  0.09942843, -0.2352441 , -0.31153223,  0.23810734],
      dtype=float32)

And if you want to see a 2D visualization of the learned vector space, you can use GloVeModel.generate_tsne():

In [27]:

%matplotlib inline
model.generate_tsne()

You might want to open that image in a new tab.

With no parameters, GloVeModel.generate_tsne() can be used interactively like in this notebook, but it also has parameters that will let you save the visualization to a file and adjust the size of the image and how many words appear:

  • path (Optional): The path at which to save the generated PNG image. Default is None, which only really makes sense for interactive environments.
  • size (Optional): A tuple of (width, height) in inches. (Yeah, I know right? This is inherited from matplotlib.) Default is 100 x 100.
  • word_count (Optional): How many words to plot in the visualization. Default is 1000, which works fairly well for a (100 x 100) visualization.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值