In [13]:
import tf_glove print('done')
done
In [4]:
#!ls -al
Instantiating the model¶
To create a new GloVe model, simply call tf_glove.GloVeModel()
:
In [14]:
model = tf_glove.GloVeModel(embedding_size=50, context_size=10, min_occurrences=25, learning_rate=0.05, batch_size=512) print('done')
done
GloVeModel()
has several parameters:
embedding_size
: the target dimensionality of the trained word representations. Typically between 50 and 300.context_size
: how many tokens on either side of a given word to include in each context window. Can be either a tuple of two ints, indicating how many token on the left and right to include, or a single int, which will be interpreted to mean symmetric context.max_vocab_size
(Optional): the maximum size of the model's vocabulary. The model's vocabulary will be the most frequently occurring words in the corpus up to this amount. The default is 100,000.min_occurrences
(Optional): the minimum number of times a word must have appeared in the corpus to be included in the model's vocabulary. Default is 1.scaling_factor
(Optional): the alpha term in Eqn. 9 of Pennington et al.'s paper. Default is 3/4, which is the paper's recommendationcooccurrence_cap
(Optional): the x_max term in Eqn. 9 of Pennington et al.'s paper. Default is 100, which is the paper's recommendationbatch_size
(Optional): the number of cooccurrences per minibatch of in training. Default is 512, which seems to work well on my machine. If training is very slow, consider playing with this.learning_rate
(Optional): the Adagrad learning rate used in training. Default is 0.05, which is the paper's recommendation
Reading the corpus¶
tf_glove needs to be fit to a corpus in order to learn word representations. To do this, we'll use GloVeModel.fit_to_corpus(corpus)
.
This method expects an iterable of iterables of strings, where each string is a token, like this:¶
[["this", "is", "a", "comment", "."], ["this", "is", "another", "comment", "."]]
That was a list of lists, but any iterable of iterables of strings should work.
Note on getting the dataset (if you want to follow along with these examples exactly)¶
For these examples, I'm going to use the dataset of Reddit comments described here: https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment
tf_glove is designed to work with any corpus, so there's no need to download this dataset. However, if you'd like to, that post has a link to a torrent for all of the comments as well as a link for just the comments from January 2015. Even just the January 2015 file is quite large (~5 GB).
I downloaded it and used
$ head -n 1000000 RC_2015-01 > /path/to/RC_2015-01-1m_sample
to get the 1 million comment sample file referenced below. You could also use 100k if you want to save some time. 1 million comments takes ~15 minutes to fit on my machine.
The code:¶
In [20]:
!ls -al
total 8 drwxr-xr-x 2 nbuser nbuser 0 Jan 1 1970 . drwxr-xr-x 1 nbuser nbuser 4096 Aug 16 20:58 .. -rw-r--r-- 1 nbuser nbuser 6590 Aug 16 20:51 1 -rw-r--r-- 1 nbuser nbuser 1009979 Aug 16 21:02 GettingStartedwithGloveImplement.ipynb drwxr-xr-x 2 nbuser nbuser 0 Aug 15 19:16 .ipynb_checkpoints -rw-r--r-- 1 nbuser nbuser 2106991 Aug 16 21:01 RC_2006-01 -rw-r--r-- 1 nbuser nbuser 94 Aug 15 19:17 README.md -rw-r--r-- 1 nbuser nbuser 350093 Aug 16 20:56 reddit2005-1 -rw-r--r-- 1 nbuser nbuser 11214 Aug 15 19:19 tf_glove.py -rw-r--r-- 1 nbuser nbuser 10868 Aug 15 19:20 tf_glove.pyc
In [21]:
import re import nltk def extract_reddit_comments(path): # A regex for extracting the comment body from one line of JSON (faster than parsing) body_snatcher = re.compile(r"\{.*?(?<!\\)\"body(?<!\\)\":(?<!\\)\"(.*?)(?<!\\)\".*}") with open(path) as file_: for line in file_: match = body_snatcher.match(line) if match: body = match.group(1) # Ignore deleted comments if not body == '[deleted]': # Return the comment as a string (not yet tokenized) yield body def tokenize_comment(comment_str): # Use the excellent NLTK to tokenize the comment body # # Note that we're lower-casing the comments here. tf_glove is case-sensitive, # so if you want 'You' and 'you' to be considered the same word, be sure to lower-case everything. return nltk.wordpunct_tokenize(comment_str.lower()) def reddit_comment_corpus(path): # A generator that returns lists of tokens representing individual words in the comment return (tokenize_comment(comment) for comment in extract_reddit_comments(path)) # Replace the path with the path to your corpus file corpus = reddit_comment_corpus("RC_2006-01")
In [16]:
type(corpus)
Out[16]:
generator
Now, to fit the model to the corpus:
In [22]:
model.fit_to_corpus(corpus)
Training the model¶
GloVeModel.fit_to_corpus() builds the vocabulary and cooccurrence matrix that will be used in training, but it doesn't actually train the word representations. It's time to kick off TensorFlow and train the model for real:
In [23]:
model.train(num_epochs=50, log_dir="log/example", summary_batch_interval=1000) print('done training')
Writing TensorBoard summaries to log/example done training
GloVeModel.train()
has a few parameters:
num_epochs
: How many passes through the cooccurrence matrix the training should make. The paper recommends at least 50 forembedding_size
< 300, and 100 otherwise.log_dir
(Optional): The path of the directory in which to log summaries for TensorBoard and t-SNE visualizations. Default isNone
, i.e. don't log anything.summary_batch_interval
(Optional): How many minibatches between logging events for TensorBoard. Default is 1000.tsne_epoch_interval
(Optional): How many epochs (full passes through cooccurrence matrix) between outputting a t-SNE visualization of the model's embeddings for the most frequent 1000 words in the vocabulary. Default is None, i.e. don't output t-SNE visualizations during training.
Checking out the results¶
Now that we've trained the model, let's look at the results.
Use GloVeModel.embedding_for()
to get the trained embedding for a single word:
In [24]:
model.embedding_for("reddit")
Out[24]:
array([-0.69618446, 0.20859328, 0.95094717, -0.6289609 , 1.0386745 , 0.17344241, 0.9935566 , 0.8475526 , -0.7601383 , -0.15705493, -0.13119277, 1.1211975 , -0.18929769, -0.14709277, -0.35697016, -0.26167196, 0.4804169 , -0.11896043, -0.14235029, 0.55361015, -0.31584463, 0.4823669 , -0.4825324 , 0.54852855, 0.29927242, -0.46764416, -0.31799427, -0.84117365, 0.05095249, 1.105447 , 0.8208274 , 0.1659146 , -0.25713712, 0.1382741 , 0.33468646, 0.7997937 , 0.05455755, -0.19944909, 0.45362073, 0.09918478, -0.05857491, -0.03431866, 0.00754084, 0.19290435, 0.8423282 , -0.0155649 , 0.09942843, -0.2352441 , -0.31153223, 0.23810734], dtype=float32)
You can also get the model's embeddings for every word in the vocabulary like this:
In [25]:
model.embeddings
Out[25]:
array([[ 1.8047712 , 1.4226124 , 0.81635815, ..., -0.7931991 , -0.21225452, -1.1910183 ], [ 0.26826242, -0.2517047 , 1.3441598 , ..., 0.22778615, -0.4677268 , -0.73932177], [ 0.18769039, 0.05553362, 1.0482045 , ..., 1.0999475 , -0.9687896 , -1.0854678 ], ..., [-0.23779865, -0.2531533 , -0.5582521 , ..., -0.22597912, 0.21925761, 0.11462189], [ 0.24282733, 0.44581434, -0.30409908, ..., 0.9912765 , 0.21773858, 1.3758885 ], [-0.5071667 , -0.53841263, -0.10033393, ..., 0.37520498, -0.70765084, 0.07273079]], dtype=float32)
GloVeModel.embeddings
will give you a NumPy matrix where each row is the model's embedding for a single word.
To make use of this, you'll want to know what row corresponds to a particular word. You can do that with GloVeModel.id_for_word
:
In [26]:
model.embeddings[model.id_for_word('reddit')]
Out[26]:
array([-0.69618446, 0.20859328, 0.95094717, -0.6289609 , 1.0386745 , 0.17344241, 0.9935566 , 0.8475526 , -0.7601383 , -0.15705493, -0.13119277, 1.1211975 , -0.18929769, -0.14709277, -0.35697016, -0.26167196, 0.4804169 , -0.11896043, -0.14235029, 0.55361015, -0.31584463, 0.4823669 , -0.4825324 , 0.54852855, 0.29927242, -0.46764416, -0.31799427, -0.84117365, 0.05095249, 1.105447 , 0.8208274 , 0.1659146 , -0.25713712, 0.1382741 , 0.33468646, 0.7997937 , 0.05455755, -0.19944909, 0.45362073, 0.09918478, -0.05857491, -0.03431866, 0.00754084, 0.19290435, 0.8423282 , -0.0155649 , 0.09942843, -0.2352441 , -0.31153223, 0.23810734], dtype=float32)
And if you want to see a 2D visualization of the learned vector space, you can use GloVeModel.generate_tsne()
:
In [27]:
%matplotlib inline model.generate_tsne()
You might want to open that image in a new tab.
With no parameters, GloVeModel.generate_tsne()
can be used interactively like in this notebook, but it also has parameters that will let you save the visualization to a file and adjust the size of the image and how many words appear:
path
(Optional): The path at which to save the generated PNG image. Default is None, which only really makes sense for interactive environments.size
(Optional): A tuple of (width, height) in inches. (Yeah, I know right? This is inherited from matplotlib.) Default is 100 x 100.word_count
(Optional): How many words to plot in the visualization. Default is 1000, which works fairly well for a (100 x 100) visualization.