nltk处理文本

nltk(Natural Language Toolkit)是处理文本的利器。

安装

pip install nltk  

进入python命令行,键入nltk.download()可以下载nltk需要的语料库等等。

分词

按词语分割(传入句子)

sentence='hello,world!'
tokens=nltk.word_tokenize(sentence)

tokens就是一个分割好的词表,如下:

['hello', ',', 'world', '!']

按句子分割(传入多个句子组成的文档)

text='This is a text. I want to split it.'
sens=nltk.sent_tokenize(text)

sens就是分割好的句子组成的list,如下:

['This is a text.', 'I want to split it.'] 

词性标注

tags = [nltk.pos_tag(tokens) for tokens in words]
[[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('text', 'NN'), ('for', 'IN'), ('test', 'NN'), ('.', '.')], [('And', 'CC'), ('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('learn', 'VB'), ('how', 'WRB'), ('to', 'TO'), ('use', 'VB'), ('nltk', 'NN'), ('.', '.')]]

附录:nltk的词性:

  1.  CC      Coordinating conjunction 连接词
  2. CD     Cardinal number  基数词
  3. DT     Determiner  限定词(如this,that,these,those,such,不定限定词:no,some,any,each,every,enough,either,neither,all,both,half,several,many,much,(a) few,(a) little,other,another.
  4. EX     Existential there 存在句
  5. FW     Foreign word 外来词
  6. IN     Preposition or subordinating conjunction 介词或从属连词
  7. JJ     Adjective 形容词或序数词
  8. JJR     Adjective, comparative 形容词比较级
  9. JJS     Adjective, superlative 形容词最高级
  10. LS     List item marker 列表标示
  11. MD     Modal 情态助动词
  12. NN     Noun, singular or mass 常用名词 单数形式
  13. NNS     Noun, plural  常用名词 复数形式
  14. NNP     Proper noun, singular  专有名词,单数形式
  15. NNPS     Proper noun, plural  专有名词,复数形式
  16. PDT     Predeterminer 前位限定词
  17. POS     Possessive ending 所有格结束词
  18. PRP     Personal pronoun 人称代词
  19. PRP$     Possessive pronoun 所有格代名词
  20. RB     Adverb 副词
  21. RBR     Adverb, comparative 副词比较级
  22. RBS     Adverb, superlative 副词最高级
  23. RP     Particle 小品词
  24. SYM     Symbol 符号
  25. TO     to 作为介词或不定式格式
  26. UH     Interjection 感叹词
  27. VB     Verb, base form 动词基本形式
  28. VBD     Verb, past tense 动词过去式
  29. VBG     Verb, gerund or present participle 动名词和现在分词
  30. VBN     Verb, past participle 过去分词
  31. VBP     Verb, non-3rd person singular present 动词非第三人称单数
  32. VBZ     Verb, 3rd person singular present 动词第三人称单数
  33. WDT     Wh-determiner 限定词(如关系限定词:whose,which.疑问限定词:what,which,whose.)
  34. WP      Wh-pronoun 代词(who whose which)
  35. WP$     Possessive wh-pronoun 所有格代词
  36. WRB     Wh-adverb   疑问代词(how where when)

提取关键词

如何对一段话提取关键词呢?主要思想就是先分词,再标词性。

# -*- coding=UTF-8 -*-
import nltk
from nltk.corpus import brown
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords


# This is our fast Part of Speech tagger
#############################################################################
brown_train = brown.tagged_sents(categories='news')
regexp_tagger = nltk.RegexpTagger(
    [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),
    (r'(-|:|;)$', ':'),
    (r'\'*$', 'MD'),
    (r'(The|the|A|a|An|an)$', 'AT'),
    (r'.*able$', 'JJ'),
    (r'^[A-Z].*$', 'NNP'),
    (r'.*ness$', 'NN'),
    (r'.*ly$', 'RB'),
    (r'.*s$', 'NNS'),
    (r'.*ing$', 'VBG'),
    (r'.*ed$', 'VBD'),
    (r'.*', 'NN')
])
unigram_tagger = nltk.UnigramTagger(brown_train, backoff=regexp_tagger)
bigram_tagger = nltk.BigramTagger(brown_train, backoff=unigram_tagger)
#############################################################################


# This is our semi-CFG; Extend it according to your own needs
#############################################################################
cfg = {}
cfg["NNP+NNP"] = "NNP"
cfg["NN+NN"] = "NNI"
cfg["NNI+NN"] = "NNI"
cfg["JJ+JJ"] = "JJ"
cfg["JJ+NN"] = "NNI"
#############################################################################


class NPExtractor(object):
    # Split the sentence into singlw words/tokens
    def tokenize_sentence(self, sentence):
        tokens = nltk.word_tokenize(sentence)
        #去除停用词,标点,数字,长度小于2的词
        tokens=[w.lower() for w in tokens if(w.isalpha())&(len(w)>1)]#使用tfid,不必去除停用词
        #词干提取
        stemmer=SnowballStemmer('english')
        tokens=[stemmer.stem(w) for w in tokens]
        return tokens

    # Normalize brown corpus' tags ("NN", "NN-PL", "NNS" > "NN")
    def normalize_tags(self, tagged):
        n_tagged = []
        for t in tagged:
            if t[1] == "NP-TL" or t[1] == "NP":
                n_tagged.append((t[0], "NNP"))
                continue
            if t[1].endswith("-TL"):
                n_tagged.append((t[0], t[1][:-3]))
                continue
            if t[1].endswith("S"):
                n_tagged.append((t[0], t[1][:-1]))
                continue
            n_tagged.append((t[0], t[1]))
        return n_tagged

    # Extract the main topics from the sentence
    def extract(self,sentence):

        tokens = self.tokenize_sentence(sentence)
        tags = self.normalize_tags(bigram_tagger.tag(tokens))

        merge = True
        while merge:
            merge = False
            for x in range(0, len(tags) - 1):
                t1 = tags[x]
                t2 = tags[x + 1]
                key = "%s+%s" % (t1[1], t2[1])
                value = cfg.get(key, '')
                if value:
                    merge = True
                    tags.pop(x)
                    tags.pop(x)
                    match = "%s %s" % (t1[0], t2[0])
                    pos = value
                    tags.insert(x, (match, pos))
                    break

        matches = []
        for t in tags:
            if t[1] == "NNP" or t[1] == "NNI" or t[1]=="NN":
                matches.append(t[0])
        return matches

利用这里的extract函数就可以提取文本的关键词。

更多参见nltk官方文档:nltk

转载于:https://www.cnblogs.com/mengnan/p/9307645.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值