Deep learning with word2vec and gensim

Neural networks have been a bit of a punching bag historically: neither particularly fast, nor robust or accurate, nor open to introspection by humans curious to gain insights from them. But things have been changing lately, withdeep learning becoming a hot topic in academia with spectacular results. I decided to check out one deep learning algorithm via gensim.

Word2vec: the good, the bad (and the fast)

The kind folks at Google have recently published several new unsupervised, deep learning algorithms in this article.

Selling point: “Our model can answer the query “give me a word like king, likewoman, but unlike man” with “queen“. Pretty cool.

Not only do these algorithms boast great performance, accuracy and a theoretically-not-so-well-founded-but-pragmatically-superior-model (all three solid plusses in my book), but they were also devised by my fellow country and county-man, Tomáš Mikolov from Brno! The googlers have also released anopen source implementation of these algorithms, which always helps with uptake of fresh academic ideas. Brilliant.

Although, in words of word2vec’s authors, the toolkit is meant for “research purposes”, it’s actually optimized C, down to cache alignments, memory look-up tables, static memory allocations and a penchant for single letter variable names. Somebody obviously spent time profiling this, which is good news for people running it, and bad news for people wanting to understand it, extend it orintegrate it (as researchers are wont to do).

In short, the spirit of word2vec fits gensim’s tagline of topic modelling for humans, but the actual code doesn’t, tight and beautiful as it is. I therefore decided to reimplement word2vec in gensim, starting with the hierarchical softmax skip-gram model, because that’s the one with the best reported accuracy. I reimplemented it from scratch, de-obfuscating word2vec into a less menial state. No need for a custom implementation of hashing, lists, dicts, random number generators… all of these come built-in with Python.

Free, fast, pretty — pick any two. As the ratio of clever code to comments shrank and shrank (down to ~100 Python lines, with 40% of them comments), so did the performance. About 1000x. Yuck. I rewrote the explicit Python loops in NumPy, speeding things up ~50x (yay), but that means it’s still ~20x slower than the original (ouch). I could optimize it further, using Cython and whatnot, but that would lead back to obfuscation, beating the purpose of this exercise. I may still do it anyway, for selected hotspots. EDIT: Done, see Part II: Optimizing word2vec in Python — performance of the Python port is now on par with the C code, and sometimes even faster.

For now, the code lives in a git branch, to be merged into gensim proper once I’m happy with its functionality and performance. In the meanwhile, the gensim version is already good enough to be unleashed on reasonably-sized corpora, taking on natural language processing tasks “the Python way”.

So, what can it do?

Distributional semantics goodness; see here and the original article for more background. Basically, the algorithm takes some unstructured text and learns “features” about each word. The neat thing is (apart from it learning the features completely automatically, without any human input/supervision!) that these features capture different relationships — both semantic and syntactic. This allows some (very basic) algebraic operations, like the above mentioned “king-man+woman=queen“. More concretely:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
>>>  # import modules and set up logging
>>>  from  gensim.models  import  word2vec
>>>  import  logging
>>> logging.basicConfig( format = '%(asctime)s : %(levelname)s : %(message)s' , level = logging.INFO)
>>>  # load up unzipped corpus from http://mattmahoney.net/dc/text8.zip
>>> sentences  =  word2vec.Text8Corpus( '/tmp/text8' )
>>>  # train the skip-gram model; default window=5
>>> model  =  word2vec.Word2Vec(sentences, size = 200 )
>>>  # ... and some hours later... just as advertised...
>>> model.most_similar(positive = [ 'woman' 'king' ], negative = [ 'man' ], topn = 1 )
[( 'queen' 0.5359965 )]
 
>>>  # pickle the entire model to disk, so we can load&resume training later
>>> model.save( '/tmp/text8.model' )
>>>  # store the learned weights, in a format the original C tool understands
>>> model.save_word2vec_format( '/tmp/text8.model.bin' , binary = True )
>>>  # or, import word weights created by the (faster) C word2vec
>>>  # this way, you can switch between the C/Python toolkits easily
>>> model  =  word2vec.Word2Vec.load_word2vec_format( '/tmp/vectors.bin' , binary = True )
 
>>>  # "boy" is to "father" as "girl" is to ...?
>>> model.most_similar([ 'girl' 'father' ], [ 'boy' ], topn = 3 )
[( 'mother' 0.61849487 ), ( 'wife' 0.57972813 ), ( 'daughter' 0.56296098 )]
>>> more_examples  =  [ "he his she" "big bigger bad" "going went being" ]
>>>  for  example  in  more_examples:
...     a, b, x  =  example.split()
...     predicted  =  model.most_similar([x, b], [a])[ 0 ][ 0 ]
...      print  "'%s' is to '%s' as '%s' is to '%s'"  %  (a, b, x, predicted)
'he'  is  to  'his'  as  'she'  is  to  'her'
'big'  is  to  'bigger'  as  'bad'  is  to  'worse'
'going'  is  to  'went'  as  'being'  is  to  'was'
 
>>>  # which word doesn't go with the others?
>>> model.doesnt_match( "breakfast cereal dinner lunch" .split())
'cereal'


This already beats the English of some of my friends :-)

Python, sweet home

Having deep learning available in Python allows us to plug in the multitude of NLP tools available in Python. More intelligent tokenization/sentence splitting, named entity recognition? Just use NLTK. Web crawling, lemmatization? Trypattern. Removing boilerplate HTML and extracting meaningful, plain text?jusText. Continue the learning pipeline with k-means or other machine learning algos? Scikit-learn has loads.

Needless to say, better integration with gensim is also under way.

Part II: Optimizing word2vec in Python


https://code.google.com/p/word2vec/(好东西)

在使用Python来安装geopandas包时,由于geopandas依赖于几个其他的Python库(如GDAL, Fiona, Pyproj, Shapely等),因此安装过程可能需要一些额外的步骤。以下是一个基本的安装指南,适用于大多数用户: 使用pip安装 确保Python和pip已安装: 首先,确保你的计算机上已安装了Python和pip。pip是Python的包管理工具,用于安装和管理Python包。 安装依赖库: 由于geopandas依赖于GDAL, Fiona, Pyproj, Shapely等库,你可能需要先安装这些库。通常,你可以通过pip直接安装这些库,但有时候可能需要从其他源下载预编译的二进制包(wheel文件),特别是GDAL和Fiona,因为它们可能包含一些系统级的依赖。 bash pip install GDAL Fiona Pyproj Shapely 注意:在某些系统上,直接使用pip安装GDAL和Fiona可能会遇到问题,因为它们需要编译一些C/C++代码。如果遇到问题,你可以考虑使用conda(一个Python包、依赖和环境管理器)来安装这些库,或者从Unofficial Windows Binaries for Python Extension Packages这样的网站下载预编译的wheel文件。 安装geopandas: 在安装了所有依赖库之后,你可以使用pip来安装geopandas。 bash pip install geopandas 使用conda安装 如果你正在使用conda作为你的Python包管理器,那么安装geopandas和它的依赖可能会更简单一些。 创建一个新的conda环境(可选,但推荐): bash conda create -n geoenv python=3.x anaconda conda activate geoenv 其中3.x是你希望使用的Python版本。 安装geopandas: 使用conda-forge频道来安装geopandas,因为它提供了许多地理空间相关的包。 bash conda install -c conda-forge geopandas 这条命令会自动安装geopandas及其所有依赖。 注意事项 如果你在安装过程中遇到任何问题,比如编译错误或依赖问题,请检查你的Python版本和pip/conda的版本是否是最新的,或者尝试在不同的环境中安装。 某些库(如GDAL)可能需要额外的系统级依赖,如地理空间库(如PROJ和GEOS)。这些依赖可能需要单独安装,具体取决于你的操作系统。 如果你在Windows上遇到问题,并且pip安装失败,尝试从Unofficial Windows Binaries for Python Extension Packages网站下载相应的wheel文件,并使用pip进行安装。 脚本示例 虽然你的问题主要是关于如何安装geopandas,但如果你想要一个Python脚本来重命名文件夹下的文件,在原始名字前面上字符串"geopandas",以下是一个简单的示例: python import os # 指定文件夹路径 folder_path = 'path/to/your/folder' # 遍历文件夹中的文件 for filename in os.listdir(folder_path): # 构造原始文件路径 old_file_path = os.path.join(folder_path, filename) # 构造新文件名 new_filename = 'geopandas_' + filename # 构造新文件路径 new_file_path = os.path.join(folder_path, new_filename) # 重命名文件 os.rename(old_file_path, new_file_path) print(f'Renamed "{filename}" to "{new_filename}"') 请确保将'path/to/your/folder'替换为你想要重命名文件的实际文件夹路径。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值