英文分词算法(Porter stemmer)

最新推荐文章于 2024-10-15 13:27:57 发布

whuslei

最新推荐文章于 2024-10-15 13:27:57 发布

阅读量4.2w

点赞数 8

分类专栏：软件使用心得数据处理文章标签：算法 dictionary lucene algorithm reference google

本文链接：https://blog.csdn.net/whuslei/article/details/7398443

版权

软件使用心得同时被 2 个专栏收录

34 篇文章 1 订阅

订阅专栏

数据处理

5 篇文章 1 订阅

订阅专栏

题记

最近需要对英文进行分词处理，希望能够实现还原英文单词原型，比如 boys 变为 boy 等。

简介

发现一个不错的工具Porter stemmer，主页是http://tartarus.org/~martin/PorterStemmer/。它被实现为N多版本，C、Java、Perl等。

下面是它的简单介绍：

Stemming, in the parlance of searching and information retrieval, is the operation of stripping the suffices from a word, leaving its stem. Google, for instance, uses stemming to search for web pages containing the words connected, connecting, connection and connections when you ask for a web page that contains the word connect.

There are basically two ways to implement stemming. The first approach is to create a big dictionary that maps words to their stems. The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear. The second approach is to use a set of rules that extract stems from words. The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes. But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen.

In 1979, Martin Porter developed a stemming algorithm that, with minor modifications, is still in use today; it uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right. Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html;

以前也曾经尝试过这个算法，但是因为下面的原因就放弃了！

比如输入 "create" 和 "created" ，得到的结果是 "creat" 。这点让我大失所望！这根本就没有把单词还原为原来的样子啊？

这次没办法，还是需要实现这样的功能，Google了半天，就发现Lucene里面有英文分词模块，可惜太复杂了，不适合我的这种简单应用。后来才知道，其实lucene里用的也就是这种方法。

于是乎，硬着头皮看了下他的主页，在FQA里发现了下面这句话！恍然大悟。

The purpose of stemming is to bring variant forms of a word together, not to map a word onto its ‘paradigm’ form.

Porter stemmer 并不是要把单词变为规范的那种原来的样子，它只是把很多基于这个单词的变种变为某一种形式！换句话说，它不能保证还原到单词的原本，也就是"created"不一定能还原到"create"，但却可以使"create" 和 "created" ，都得到"creat" ！

实例

比如我输入 "create" 和 "created" ，它解析得到 "creat"

那么，只需要在查询时也做同样的处理即可！比如查询 "create created"，在数据库里查的时候，都只需要检索"creat"即可！

附录

简单词汇处理前后的对比：http://snowball.tartarus.org/algorithms/porter/diffs.txt

主程序(相当精悍啊)：http://tartarus.org/martin/PorterStemmer/java.txt