Paper reading summary for From Word Embeddings to Document Distance-CSDN博客

本文链接：https://blog.csdn.net/jasonli4three/article/details/113172227

Paper reading summary for From Word Embeddings to Document Distance

Paper Link: http://proceedings.mlr.press/v37/kusnerb15.html

This paper gives a new way of computing the distance between two text documents. Word Mover’s Distance(WMD) This new representation method allows us to use an existing solver that is already very efficient. This method also has no hyperparameters and is easy to implement with low classification error rates.

The author utilizes a word representation called work2vec. This is a word embedding procedure that uses a shallow neural network to maximize the log probability of neighboring words. So the distance between two words that have similar meanings would be small. WMD between documents is defined as the total distance all the words in one document need to travel to the other document. The distance metric can be reduced to an instance of Earth Mover’s Distance which is a well-studied transportation problem.

One traditional way of representing documents is Bag of Words (BOW). But it fails to capture the similarity between two sentences when there are no common words. One example the author gives is “Obama speaks to the media in Illinois” and “The Present greets the press in Chicago”. These two sentences have essentially the same meaning but do not share a single word. We would be able to capture this closeness using WMD.
Diagram from the paper
The objective function also has a lower bound which can be calculated quickly. This is very useful in terms of early termination and pruning to speed up the process of finding the k-nearest neighbors.

The author then uses the WMD to do classification on 8 document data sets and compares the results with the other 7 baselines. It turned out that WMD on average has the lowest error. They also test different word embedding mechanisms and find that word2vec models show better performance.

This new model allows us to have more accurate results when searching similar text documents. And model achieves it by really understand the meaning behind each sentence rather than simply comparing the words or the frequency of words in the documents.