Glove论文代码复现

最新推荐文章于 2024-04-29 19:07:19 发布

辽宁大学

最新推荐文章于 2024-04-29 19:07:19 发布

阅读量438

点赞数

分类专栏： nlp 文章标签： nlp

本文链接：https://blog.csdn.net/zhuiyunzhugang/article/details/104142571

版权

nlp 专栏收录该内容

84 篇文章 0 订阅

订阅专栏

In [13]:

import tf_glove
print('done')

done

In [4]:

#!ls -al

Instantiating the model¶

To create a new GloVe model, simply call tf_glove.GloVeModel():

In [14]:

model = tf_glove.GloVeModel(embedding_size=50, context_size=10, min_occurrences=25,
                            learning_rate=0.05, batch_size=512)
print('done')

done

GloVeModel() has several parameters:

embedding_size: the target dimensionality of the trained word representations. Typically between 50 and 300.
context_size: how many tokens on either side of a given word to include in each context window. Can be either a tuple of two ints, indicating how many token on the left and right to include, or a single int, which will be interpreted to mean symmetric context.
max_vocab_size (Optional): the maximum size of the model's vocabulary. The model's vocabulary will be the most frequently occurring words in the corpus up to this amount. The default is 100,000.
min_occurrences (Optional): the minimum number of times a word must have appeared in the corpus to be included in the model's vocabulary. Default is 1.
scaling_factor (Optional): the alpha term in Eqn. 9 of Pennington et al.'s paper. Default is 3/4, which is the paper's recommendation
cooccurrence_cap (Optional): the x_max term in Eqn. 9 of Pennington et al.'s paper. Default is 100, which is the paper's recommendation
batch_size (Optional): the number of cooccurrences per minibatch of in training. Default is 512, which seems to work well on my machine. If training is very slow, consider playing with this.
learning_rate (Optional): the Adagrad learning rate used in training. Default is 0.05, which is the paper's recommendation

Reading the corpus¶

tf_glove needs to be fit to a corpus in order to learn word representations. To do this, we'll use GloVeModel.fit_to_corpus(corpus).

This method expects an iterable of iterables of strings, where each string is a token, like this:¶

[["this", "is", "a", "comment", "."], ["this", "is", "another", "comment", "."]]

That was a list of lists, but any iterable of iterables of strings should work.

Note on getting the dataset (if you want to follow along with these examples exactly)¶

For these examples, I'm going to use the dataset of Reddit comments described here: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment

tf_glove is designed to work with any corpus, so there's no need to download this dataset. However, if you'd like to, that post has a link to a torrent for all of the comments as well as a link for just the comments from January 2015. Even just the January 2015 file is quite large (~5 GB).

I downloaded it and used

$ head -n 1000000 RC_2015-01 > /path/to/RC_2015-01-1m_sample

to get the 1 million comment sample file referenced below. You could also use 100k if you want to save some time. 1 million comments takes ~15 minutes to fit on my machine.

The code:¶

In [20]:

!ls -al

total 8
drwxr-xr-x 2 nbuser nbuser       0 Jan  1  1970 .
drwxr-xr-x 1 nbuser nbuser    4096 Aug 16 20:58 ..
-rw-r--r-- 1 nbuser nbuser    6590 Aug 16 20:51 1
-rw-r--r-- 1 nbuser nbuser 1009979 Aug 16 21:02 GettingStartedwithGloveImplement.ipynb
drwxr-xr-x 2 nbuser nbuser       0 Aug 15 19:16 .ipynb_checkpoints
-rw-r--r-- 1 nbuser nbuser 2106991 Aug 16 21:01 RC_2006-01
-rw-r--r-- 1 nbuser nbuser      94 Aug 15 19:17 README.md
-rw-r--r-- 1 nbuser nbuser  350093 Aug 16 20:56 reddit2005-1
-rw-r--r-- 1 nbuser nbuser   11214 Aug 15 19:19 tf_glove.py
-rw-r--r-- 1 nbuser nbuser   10868 Aug 15 19:20 tf_glove.pyc

In [21]:

import re
import nltk

def extract_reddit_comments(path):
    # A regex for extracting the comment body from one line of JSON (faster than parsing)
    body_snatcher = re.compile(r"\{.*?(?<!\\)\"body(?<!\\)\":(?<!\\)\"(.*?)(?<!\\)\".*}")
    with open(path) as file_:
        for line in file_:
            match = body_snatcher.match(line)
            if match:
                body = match.group(1)
                # Ignore deleted comments
                if not body == '[deleted]':
                    # Return the comment as a string (not yet tokenized)
                    yield body
                        
def tokenize_comment(comment_str):
    # Use the excellent NLTK to tokenize the comment body
    #
    # Note that we're lower-casing the comments here. tf_glove is case-sensitive,
    # so if you want 'You' and 'you' to be considered the same word, be sure to lower-case everything.
    return nltk.wordpunct_tokenize(comment_str.lower())

def reddit_comment_corpus(path):
    # A generator that returns lists of tokens representing individual words in the comment
    return (tokenize_comment(comment) for comment in extract_reddit_comments(path))

# Replace the path with the path to your corpus file
corpus = reddit_comment_corpus("RC_2006-01")

In [16]:

type(corpus)

Out[16]:

generator

Now, to fit the model to the corpus:

In [22]:

model.fit_to_corpus(corpus)

Training the model¶

GloVeModel.fit_to_corpus() builds the vocabulary and cooccurrence matrix that will be used in training, but it doesn't actually train the word representations. It's time to kick off TensorFlow and train the model for real:

In [23]:

model.train(num_epochs=50, log_dir="log/example", summary_batch_interval=1000)
print('done training')

Writing TensorBoard summaries to log/example
done training

GloVeModel.train() has a few parameters:

num_epochs: How many passes through the cooccurrence matrix the training should make. The paper recommends at least 50 for embedding_size < 300, and 100 otherwise.
log_dir (Optional): The path of the directory in which to log summaries for TensorBoard and t-SNE visualizations. Default is None, i.e. don't log anything.
summary_batch_interval (Optional): How many minibatches between logging events for TensorBoard. Default is 1000.
tsne_epoch_interval (Optional): How many epochs (full passes through cooccurrence matrix) between outputting a t-SNE visualization of the model's embeddings for the most frequent 1000 words in the vocabulary. Default is None, i.e. don't output t-SNE visualizations during training.

Checking out the results¶

Now that we've trained the model, let's look at the results.

Use GloVeModel.embedding_for() to get the trained embedding for a single word:

In [24]:

model.embedding_for("reddit")

Out[24]:

array([-0.69618446,  0.20859328,  0.95094717, -0.6289609 ,  1.0386745 ,
        0.17344241,  0.9935566 ,  0.8475526 , -0.7601383 , -0.15705493,
       -0.13119277,  1.1211975 , -0.18929769, -0.14709277, -0.35697016,
       -0.26167196,  0.4804169 , -0.11896043, -0.14235029,  0.55361015,
       -0.31584463,  0.4823669 , -0.4825324 ,  0.54852855,  0.29927242,
       -0.46764416, -0.31799427, -0.84117365,  0.05095249,  1.105447  ,
        0.8208274 ,  0.1659146 , -0.25713712,  0.1382741 ,  0.33468646,
        0.7997937 ,  0.05455755, -0.19944909,  0.45362073,  0.09918478,
       -0.05857491, -0.03431866,  0.00754084,  0.19290435,  0.8423282 ,
       -0.0155649 ,  0.09942843, -0.2352441 , -0.31153223,  0.23810734],
      dtype=float32)

You can also get the model's embeddings for every word in the vocabulary like this:

In [25]:

model.embeddings

Out[25]:

array([[ 1.8047712 ,  1.4226124 ,  0.81635815, ..., -0.7931991 ,
        -0.21225452, -1.1910183 ],
       [ 0.26826242, -0.2517047 ,  1.3441598 , ...,  0.22778615,
        -0.4677268 , -0.73932177],
       [ 0.18769039,  0.05553362,  1.0482045 , ...,  1.0999475 ,
        -0.9687896 , -1.0854678 ],
       ...,
       [-0.23779865, -0.2531533 , -0.5582521 , ..., -0.22597912,
         0.21925761,  0.11462189],
       [ 0.24282733,  0.44581434, -0.30409908, ...,  0.9912765 ,
         0.21773858,  1.3758885 ],
       [-0.5071667 , -0.53841263, -0.10033393, ...,  0.37520498,
        -0.70765084,  0.07273079]], dtype=float32)

GloVeModel.embeddings will give you a NumPy matrix where each row is the model's embedding for a single word.

To make use of this, you'll want to know what row corresponds to a particular word. You can do that with GloVeModel.id_for_word:

In [26]:

model.embeddings[model.id_for_word('reddit')]

Out[26]:

array([-0.69618446,  0.20859328,  0.95094717, -0.6289609 ,  1.0386745 ,
        0.17344241,  0.9935566 ,  0.8475526 , -0.7601383 , -0.15705493,
       -0.13119277,  1.1211975 , -0.18929769, -0.14709277, -0.35697016,
       -0.26167196,  0.4804169 , -0.11896043, -0.14235029,  0.55361015,
       -0.31584463,  0.4823669 , -0.4825324 ,  0.54852855,  0.29927242,
       -0.46764416, -0.31799427, -0.84117365,  0.05095249,  1.105447  ,
        0.8208274 ,  0.1659146 , -0.25713712,  0.1382741 ,  0.33468646,
        0.7997937 ,  0.05455755, -0.19944909,  0.45362073,  0.09918478,
       -0.05857491, -0.03431866,  0.00754084,  0.19290435,  0.8423282 ,
       -0.0155649 ,  0.09942843, -0.2352441 , -0.31153223,  0.23810734],
      dtype=float32)

And if you want to see a 2D visualization of the learned vector space, you can use GloVeModel.generate_tsne():

In [27]:

%matplotlib inline
model.generate_tsne()

You might want to open that image in a new tab.

With no parameters, GloVeModel.generate_tsne() can be used interactively like in this notebook, but it also has parameters that will let you save the visualization to a file and adjust the size of the image and how many words appear:

path (Optional): The path at which to save the generated PNG image. Default is None, which only really makes sense for interactive environments.
size (Optional): A tuple of (width, height) in inches. (Yeah, I know right? This is inherited from matplotlib.) Default is 100 x 100.
word_count (Optional): How many words to plot in the visualization. Default is 1000, which works fairly well for a (100 x 100) visualization.

辽宁大学

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Glove论文代码复现

In[13]:import tf_gloveprint('done')doneIn[4]:#!ls -alInstantiating the model¶To create a new GloVe model, simply call tf_glove.GloVeModel():In[14]:model = tf_glove.GloVeMode...
复制链接

扫一扫

专栏目录