word2vec python 接口安装使用

https://github.com/danielfrg/word2vec

Installation

I recommend the Anaconda python distribution

pip install word2vec

Wheel: Wheels packages for OS X and Windows are provided on Pypi on a best effort sense. The code is quite easy to compile so consider using: --no-use-wheel on Linux and OS X.

Linux: There is no wheel support for linux so you have to compile the C code. The only requirement is gcc. You can override the compilation flags if needed: CFLAGS='-march=corei7' pip install word2vec

Windows: Very experimental support based this win32 port

%load_ext autoreload
%autoreload 2

word2vec

This notebook is equivalent to demo-word.shdemo-analogy.shdemo-phrases.sh and demo-classes.sh from Google.

Training

Download some data, for example: http://mattmahoney.net/dc/text8.zip

In [2]:
import word2vec

Run word2phrase to group up similar words "Los Angeles" to "Los_Angeles"

In [3]:
word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)
[u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2']
Starting training using file /Users/drodriguez/Downloads/text8
Words processed: 17000K     Vocab size: 4399K  
Vocab size (unigrams + bigrams): 2419827
Words in train file: 17005206

This will create a text8-phrases that we can use as a better input for word2vec. Note that you could easily skip this previous step and use the origial data as input for word2vec.

Train the model using the word2phrase output.

In [4]:
word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8-phrases
Vocab size: 98331
Words in train file: 15857306
Alpha: 0.000002  Progress: 100.03%  Words/thread/sec: 286.52k  

That generated a text8.bin file containing the word vectors in a binary format.

Do the clustering of the vectors based on the trained model.

In [5]:
word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000002  Progress: 100.02%  Words/thread/sec: 287.55k  

That created a text8-clusters.txt with the cluster for every word in the vocabulary

Predictions

In [1]:
import word2vec

Import the word2vec binary file created above

In [2]:
model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')

We can take a look at the vocabulaty as a numpy array

In [3]:
model.vocab
Out[3]:
array([u'</s>', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'], 
      dtype='<U78')

Or take a look at the whole matrix

In [4]:
model.vectors.shape
Out[4]:
(98331, 100)
In [5]:
model.vectors
Out[5]:
array([[ 0.14333282,  0.15825513, -0.13715845, ...,  0.05456942,
         0.10955409,  0.00693387],
       [ 0.1220774 ,  0.04939618,  0.09545057, ..., -0.00804222,
        -0.05441621, -0.10076696],
       [ 0.16844609,  0.03734054,  0.22085373, ...,  0.05854521,
         0.04685341,  0.02546694],
       ..., 
       [-0.06760896,  0.03737842,  0.09344187, ...,  0.14559349,
        -0.11704484, -0.05246212],
       [ 0.02228479, -0.07340827,  0.15247506, ...,  0.01872172,
        -0.18154132, -0.06813737],
       [ 0.02778879, -0.06457976,  0.07102411, ..., -0.00270281,
        -0.0471223 , -0.135444  ]])

We can retreive the vector of individual words

In [6]:
model['dog'].shape
Out[6]:
(100,)
In [7]:
model['dog'][:10]
Out[7]:
array([ 0.05753701,  0.0585594 ,  0.11341395,  0.02016246,  0.11514406,
        0.01246986,  0.00801256,  0.17529851,  0.02899276,  0.0203866 ])

We can do simple queries to retreive words similar to "socks" based on cosine similarity:

In [8]:
indexes, metrics = model.cosine('socks')
indexes, metrics
Out[8]:
(array([20002, 28915, 30711, 33874, 27482, 14631, 22992, 24195, 25857, 23705]),
 array([ 0.8375354 ,  0.83590846,  0.82818749,  0.82533614,  0.82278399,
         0.81476386,  0.8139092 ,  0.81253798,  0.8105933 ,  0.80850171]))

This returned a tuple with 2 items:

  1. numpy array with the indexes of the similar words in the vocabulary
  2. numpy array with cosine similarity to each word

Its possible to get the words of those indexes

In [9]:
model.vocab[indexes]
Out[9]:
array([u'hairy', u'pumpkin', u'gravy', u'nosed', u'plum', u'winged',
       u'bock', u'petals', u'biscuits', u'striped'], 
      dtype='<U78')

There is a helper function to create a combined response: a numpy record array

In [10]:
model.generate_response(indexes, metrics)
Out[10]:
rec.array([(u'hairy', 0.8375353970603848), (u'pumpkin', 0.8359084628493809),
       (u'gravy', 0.8281874915608026), (u'nosed', 0.8253361379785071),
       (u'plum', 0.8227839904046932), (u'winged', 0.8147638561412592),
       (u'bock', 0.8139092031538545), (u'petals', 0.8125379796045767),
       (u'biscuits', 0.8105933044655644), (u'striped', 0.8085017054444408)], 
      dtype=[(u'word', '<U78'), (u'metric', '<f8')])

Is easy to make that numpy array a pure python response:

In [11]:
model.generate_response(indexes, metrics).tolist()
Out[11]:
[(u'hairy', 0.8375353970603848),
 (u'pumpkin', 0.8359084628493809),
 (u'gravy', 0.8281874915608026),
 (u'nosed', 0.8253361379785071),
 (u'plum', 0.8227839904046932),
 (u'winged', 0.8147638561412592),
 (u'bock', 0.8139092031538545),
 (u'petals', 0.8125379796045767),
 (u'biscuits', 0.8105933044655644),
 (u'striped', 0.8085017054444408)]

Phrases

Since we trained the model with the output of word2phrase we can ask for similarity of "phrases"

In [12]:
indexes, metrics = model.cosine('los_angeles')
model.generate_response(indexes, metrics).tolist()
Out[12]:
[(u'san_francisco', 0.886558000570455),
 (u'san_diego', 0.8731961018831669),
 (u'seattle', 0.8455603712285231),
 (u'las_vegas', 0.8407843553947962),
 (u'miami', 0.8341796009062884),
 (u'detroit', 0.8235412519780195),
 (u'cincinnati', 0.8199138493085706),
 (u'st_louis', 0.8160655356728751),
 (u'chicago', 0.8156786240847214),
 (u'california', 0.8154244925085712)]

Analogies

Its possible to do more complex queries like analogies such as: king - man + woman = queen This method returns the same as cosine the indexes of the words in the vocab and the metric

In [13]:
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)
indexes, metrics
Out[13]:
(array([1087, 1145, 7523, 3141, 6768, 1335, 8419, 1826,  648, 1426]),
 array([ 0.2917969 ,  0.27353295,  0.26877692,  0.26596514,  0.26487509,
         0.26428581,  0.26315492,  0.26261258,  0.26136635,  0.26099078]))
In [14]:
model.generate_response(indexes, metrics).tolist()
Out[14]:
[(u'queen', 0.2917968955611075),
 (u'prince', 0.27353295205311695),
 (u'empress', 0.2687769174818083),
 (u'monarch', 0.2659651399832089),
 (u'regent', 0.26487508713026797),
 (u'wife', 0.2642858109968327),
 (u'aragon', 0.2631549214361766),
 (u'throne', 0.26261257728511833),
 (u'emperor', 0.2613663460665488),
 (u'bishop', 0.26099078142148696)]

Clusters

In [15]:
clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')

We can see get the cluster number for individual words

In [16]:
clusters['dog']
Out[16]:
11

We can see get all the words grouped on an specific cluster

In [17]:
clusters.get_words_on_cluster(90).shape
Out[17]:
(221,)
In [18]:
clusters.get_words_on_cluster(90)[:10]
Out[18]:
array(['along', 'together', 'associated', 'relationship', 'deal',
       'combined', 'contact', 'connection', 'bond', 'respect'], dtype=object)

We can add the clusters to the word2vec model and generate a response that includes the clusters

In [19]:
model.clusters = clusters
In [20]:
indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)
In [21]:
model.generate_response(indexes, metrics).tolist()
Out[21]:
[(u'berlin', 0.32333651414395953, 20),
 (u'munich', 0.28851564633559, 20),
 (u'vienna', 0.2768927258877336, 12),
 (u'leipzig', 0.2690537010929304, 91),
 (u'moscow', 0.26531859560322785, 74),
 (u'st_petersburg', 0.259534503067277, 61),
 (u'prague', 0.25000637367753303, 72),
 (u'dresden', 0.2495974800117785, 71),
 (u'bonn', 0.24403155303236473, 8),
 (u'frankfurt', 0.24199720792200027, 31)]
In [ ]:

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值