Lemmatisation & Stemming 词干提取

最新推荐文章于 2022-05-09 09:35:36 发布

weixin_30468137

最新推荐文章于 2022-05-09 09:35:36 发布

阅读量250

点赞数

文章标签：人工智能

原文链接：http://www.cnblogs.com/lemonding/p/5978946.html

版权

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications. 1.Stemmer 抽取词的词干或词根形式（不一定能够表达完整语义） Porter Stemmer基于Porter词干提取算法

  >>> from nltk.stem.porter import PorterStemmer  
  >>> porter_stemmer = PorterStemmer()  
  >>> porter_stemmer.stem(‘maximum’)  
  u’maximum’  
  >>> porter_stemmer.stem(‘presumably’)  
  u’presum’  
  >>> porter_stemmer.stem(‘multiply’)  
  u’multipli’  
  >>> porter_stemmer.stem(‘provision’)  
  u’provis’  
  >>> porter_stemmer.stem(‘owed’)  
  u’owe’

Lancaster Stemmer 基于Lancaster 词干提取算法

  >>> from nltk.stem.lancaster import LancasterStemmer  
  >>> lancaster_stemmer = LancasterStemmer()  
  >>> lancaster_stemmer.stem(‘maximum’)  
  ‘maxim’  
  >>> lancaster_stemmer.stem(‘presumably’)  
  ‘presum’  
  >>> lancaster_stemmer.stem(‘presumably’)  
  ‘presum’  
  >>> lancaster_stemmer.stem(‘multiply’)  
  ‘multiply’  
  >>> lancaster_stemmer.stem(‘provision’)  
  u’provid’  
  >>> lancaster_stemmer.stem(‘owed’)  
  ‘ow’

Snowball Stemmer基于Snowball 词干提取算法

  >>> from nltk.stem import SnowballStemmer  
  >>> snowball_stemmer = SnowballStemmer(“english”)  
  >>> snowball_stemmer.stem(‘maximum’)  
  u’maximum’  
  >>> snowball_stemmer.stem(‘presumably’)  
  u’presum’  
  >>> snowball_stemmer.stem(‘multiply’)  
  u’multipli’  
  >>> snowball_stemmer.stem(‘provision’)  
  u’provis’  
  >>> snowball_stemmer.stem(‘owed’)  
  u’owe’

2.Lemmatization 把一个任何形式的语言词汇还原为一般形式，标记词性的前提下效果比较好

  >>> from nltk.stem.wordnet import WordNetLemmatizer
  >>> lmtzr = WordNetLemmatizer()
  >>> lmtzr.lemmatize('cars')
  'car'
  >>> lmtzr.lemmatize('feet')
  'foot'
  >>> lmtzr.lemmatize('people')
  'people'
  >>> lmtzr.lemmatize('fantasized',pos=“v”) #postag
  'fantasize'

NLTK 里这个词形还原工具的一个问题是需要手动指定词性，比如上面例子中的 "working" 这个词，如果不加后面那个 pos 参数，输出的结果将会是 "working" 本身。

如果希望在实际应用中使用 NLTK 进行词形还原，一个完整的解决方案是:

输入一个完整的句子
用 NLTK 提供的工具对句子进行分词和词性标注
将得到的词性标注结果转换为 WordNet 的格式
使用 WordNetLemmatizer 对词进行词形还原

其中分词和词性标注又有数据依赖:

nltk.download("punkt")
nltk.download("maxnet_treebank_pos_tagger")

from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None


def lemmatize_sentence(sentence):
    res = []
    lemmatizer = WordNetLemmatizer()
    for word, pos in pos_tag(word_tokenize(sentence)):
        wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN
        res.append(lemmatizer.lemmatize(word, pos=wordnet_pos))

    return res

3.MaxMatch 在中文自然语言处理中常常用来进行分词

  from nltk.stem import WordNetLemmatizer  
  from nltk.corpus import words  
    
  wordlist = set(words.words())  
  wordnet_lemmatizer = WordNetLemmatizer()  
    
  def max_match(text):  
      pos2 = len(text)  
      result = ''  
      while len(text) > 0:         
          word = wordnet_lemmatizer.lemmatize(text[0:pos2])  
          if word in wordlist:  
              result = result + text[0:pos2] + ' '  
              text = text[pos2:]  
              pos2 = len(text)  
          else:  
              pos2 = pos2-1                 
      return result[0:-1]  
      
  >>> string = 'theyarebirds'  
  >>> print(max_match(string))  
  they are birds     

https://marcobonzanini.com/2015/01/26/stemming-lemmatisation-and-pos-tagging-with-python-and-nltk/
http://blog.csdn.net/baimafujinji/article/details/51069522

转载于:https://www.cnblogs.com/lemonding/p/5978946.html

weixin_30468137

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lemmatisation & Stemming 词干提取

Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have ...
复制链接

扫一扫