Paper reading summary for From Word Embeddings to Document Distance

Paper reading summary for From Word Embeddings to Document Distance

Paper Link: http://proceedings.mlr.press/v37/kusnerb15.html

This paper gives a new way of computing the distance between two text documents. Word Mover’s Distance(WMD) This new representation method allows us to use an existing solver that is already very efficient. This method also has no hyperparameters and is easy to implement with low classification error rates.

The author utilizes a word representation called work2vec. This is a word embedding procedure that uses a shallow neural network to maximize the log probability of neighboring words. So the distance between two words that have similar meanings would be small. WMD between documents is defined as the total distance all the words in one document need to travel to the other document. The distance metric can be reduced to an instance of Earth Mover’s Distance which is a well-studied transportation problem.

One traditional way of representing documents is Bag of Words (BOW). But it fails to capture the similarity between two sentences when there are no common words. One example the author gives is “Obama speaks to the media in Illinois” and “The Present greets the press in Chicago”. These two sentences have essentially the same meaning but do not share a single word. We would be able to capture this closeness using WMD.
Diagram from the paper
The objective function also has a lower bound which can be calculated quickly. This is very useful in terms of early termination and pruning to speed up the process of finding the k-nearest neighbors.

The author then uses the WMD to do classification on 8 document data sets and compares the results with the other 7 baselines. It turned out that WMD on average has the lowest error. They also test different word embedding mechanisms and find that word2vec models show better performance.

This new model allows us to have more accurate results when searching similar text documents. And model achieves it by really understand the meaning behind each sentence rather than simply comparing the words or the frequency of words in the documents.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值