读取文本节点_Python文本处理教程（2）

最新推荐文章于 2022-02-15 23:59:42 发布

weixin_39836726

最新推荐文章于 2022-02-15 23:59:42 发布

阅读量511

点赞数

文章标签：读取文本节点

本文链接：https://blog.csdn.net/weixin_39836726/article/details/111700309

版权

文本处理状态机

状态机是关于设计程序来控制应用程序中的流程。它是一个有向图，由一组节点和一组过渡函数组成。处理文本文件通常包括顺序读取文本文件的每个块并执行某些操作以响应每个块读取。块的含义取决于它之前存在的块的类型以及它之后的块。该机器是关于设计程序来控制应用程序中的流程。它是一个有向图，由一组节点和一组过渡函数组成。处理文本文件通常包括顺序读取文本文件的每个块并执行某些操作以响应每个块读取。块的含义取决于它之前存在的块的类型以及它之后的块。

考虑有一种情况，其中文本放置必须是AGC序列的重复连续串(用于蛋白质分析)。如果在输入字符串中保持此特定序列，则机器的状态保持为TRUE，但是一旦序列偏离，机器的状态将变为FALSE并且在之后保持为FALSE。这确保了即使稍后可能存在更多正确序列的块，也停止进一步处理。

下面的程序定义了一个状态机，它具有启动机器的功能，获取处理文本的输入并逐步完成处理。

class StateMachine:

# Initialize 
    def start(self):
        self.state = self.startState

# Step through the input
    def step(self, inp):
        (s, o) = self.getNextValues(self.state, inp)
        self.state = s
        return o

# Loop through the input        
    def feeder(self, inputs):
        self.start()
        return [self.step(inp) for inp in inputs]

# Determine the TRUE or FALSE state
class TextSeq(StateMachine):
    startState = 0
    def getNextValues(self, state, inp):
        if state == 0 and inp == 'A':
            return (1, True)
        elif state == 1 and inp == 'G':
            return (2, True)
        elif state == 2 and inp == 'C':
            return (0, True)
        else:
            return (3, False)


InSeq = TextSeq()

x = InSeq.feeder(['A','A','A'])
print x

y = InSeq.feeder(['A', 'G', 'C', 'A', 'C', 'A', 'G'])
print y

当运行上面的程序时，得到以下输出 -

[True, False, False]
[True, True, True, True, False, False, False]

在x的结果中，AGC的模式在第一个’A’之后的第二个输入失败。在此之后，结果的状态将永远保持为False。在Y的结果中，AGC的模式持续到第4个输入。因此，结果的状态在此之前保持为真。但是从第5个输入开始，结果变为False，因为G是预期的结果，但是查找结为C。

大写转换

大写字符串是任何文本处理系统中的常规需求。 Python通过使用标准库中的内置函数实现了它。在下面的例子中，我们使用两个字符串函数capwords()和upper()来实现这一点。'capwords'将每个单词的第一个字母大写，而'upper'将整个字符串大写。

import string

text = 'Yiibaipoint - simple easy learning.'

print string.capwords(text)
print string.upper(text)

当运行上面的程序时，得到以下输出 -

Yiibaipoint - Simple Easy Learning.
TUTORIALSPOINT - SIMPLE EASY LEARNING.

Python中的转换本质上意味着用另一个字母替换特定字母。它可以用于字符串的加密解密。

import string

text = 'Yiibaipoint - simple easy learning.'

transtable = string.maketrans('tpol', 'wxyz')
print text.translate(transtable)

当运行上面的程序时，我们得到以下输出 -

Tuwyriazsxyinw - simxze easy zearning.

符号化

在Python中，标记化基本上是指将更大的文本体分成更小的行，单词甚至为非英语语言创建单词。各种标记化函数功能内置在nltk模块中，可以在程序中使用，如下所示。

行标记化

在下面的示例中，使用函数sent_tokenize将给定文本划分为不同的行。

import nltk
sentence_data = "The First sentence is about Python. The Second: about Django. You can learn Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

当运行上面的程序时，得到以下输出 -

['The First sentence is about Python.', 'The Second: about Django.', 'You can learn Python,Django and Data Ananlysis here.']

非英语标记化

在下面的示例中，将德语文本标记为。

import nltk

german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
german_tokens=german_tokenizer.tokenize('Wie geht es Ihnen?  Gut, danke.')
print(german_tokens)

当运行上面的程序时，得到以下输出 -

['Wie geht es Ihnen?', 'Gut, danke.']

单词符号化

我们使用nltk的word_tokenize函数将单词标记。参考以下代码 -

import nltk

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

当运行上面的程序时，得到以下输出 -

['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']

删除停用词

停用词是英语单词，对句子没有多大意义。在不牺牲句子含义的情况下，可以安全地忽略它们。例如，the, he, have等等的单词已经在名为语料库的语料库中捕获了这些单词。我们首先将它下载到python环境中。如下代码 -

import nltk
nltk.download('stopwords')

它将下载带有英语停用词的文件。

验证停用词

from nltk.corpus import stopwords
stopwords.words('english')
print stopwords.words() [620:680]

当运行上面的程序时，得到以下输出 -

[u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', 
u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them', 
u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', 
u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be',
u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing',
u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until',
u'while', u'of', u'at']

除了英语之外，具有这些停用词的各种语言如下。

from nltk.corpus import stopwords
print stopwords.fileids()

当运行上面的程序时，我们得到以下输出 -

[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish', 
u'french', u'german', u'greek', u'hungarian', u'indonesian', u'italian', 
u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian',
u'spanish', u'swedish', u'turkish']

示例

参考下面的示例来说明如何从单词列表中删除停用词。

from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree','near','the','river']
for word in all_words: 
    if word not in en_stops:
        print(word)

当运行上面的程序时，我们得到以下输出 -

There
tree
near
river

同义词和反义词

同义词和反义词作为wordnet的一部分提供，wordnet是英语的词汇数据库。它作为nltk corpora访问的一部分提供。在wordnet中同义词是表示相同概念并且在许多上下文中可互换的单词，因此它们被分组为无序集(synsets)。我们使用这些同义词来导出同义词和反义词，如下面的程序所示。

from nltk.corpus import wordnet

synonyms = []

for syn in wordnet.synsets("Soil"):
    for lm in syn.lemmas():
             synonyms.append(lm.name())
print (set(synonyms))

当运行上面的程序时，我们得到以下输出 -

set([grease', filth', dirt', begrime', soil', 
grime', land', bemire', dirty', grunge', 
stain', territory', colly', ground'])

为了获得反义词，只使用反义词函数。

from nltk.corpus import wordnet
antonyms = []

for syn in wordnet.synsets("ahead"):
    for lm in syn.lemmas():
        if lm.antonyms():
            antonyms.append(lm.antonyms()[0].name())

print(set(antonyms))

当运行上面的程序时，我们得到以下输出 -

set([backward', back'])

文本翻译

从一种语言到另一种语言的文本翻译在各种网站中越来越普遍。帮助我们执行此操作的python包称为translate。

可以通过以下方式安装此软件包。它提供主要语言的翻译。

pip install translate

以下是将简单句子从英语翻译成德语的示例。语言的默认值为英语。

from translate import Translator
translator= Translator(to_lang="German")
translation = translator.translate("Good Morning!")
print translation

当运行上面的程序时，我们得到以下输出 -

Guten Morgen!

在任何两种语言之间

如果需要指定from-language和to-language，那么参考下面的程序中指定它。

from translate import Translator
translator= Translator(from_lang="german",to_lang="spanish")
translation = translator.translate("Guten Morgen")
print translation

执行上面示例代码，得到以下结果 -

Buenos días

单词替换

替换完整的字符串或字符串的一部分是文本处理中非常常见的要求。 replace()方法返回字符串的副本，其中old的出现次数替换为new，可选地将替换次数限制为max。

以下是replace()方法的语法 -

str.replace(old, new[, max])

old - 这是要替换的旧子字符串。
new - 这是新的子字符串，它将替换旧的子字符串。
max - 如果给出此可选参数max，则仅替换第一次计数出现次数。

此方法返回字符串的副本，子字符串所有出现的old都替换为new。如果给出了可选参数max，则仅替换第一个计数出现次数。

示例

以下示例显示了replace()方法的用法。

str = "this is string example....wow!!! this is really string"
print (str.replace("is", "was"))
print (str.replace("is", "was", 3))

当运行上面的程序时，它会产生以下结果 -

thwas was string example....wow!!! thwas was really string
thwas was string example....wow!!! thwas is really string

替换忽略大小写

import re
sourceline  = re.compile("Tutor", re.IGNORECASE)

Replacedline  = sourceline.sub("Tutor","Tutorialyiibai has the best tutorials for learning.")
print (Replacedline)

当运行上面的程序时，我们得到以下输出 -

Tutorialyiibai has the best Yiibai for learning.

拼写检查

检查拼写是任何文本处理或分析的基本要求。 python中的pyspellchecker包提供了这个功能，可以找到可能错误拼写的单词，并建议可能的更正。

首先，我们需要在python环境中使用以下命令安装所需的包。

pip install pyspellchecker

现在在下面看到如何使用包来指出错误拼写的单词以及对可能的正确单词提出一些建议。

from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['let', 'us', 'wlak','on','the','groun'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

当运行上面的程序时，我们得到以下输出 -

group
{'group', 'ground', 'groan', 'grout', 'grown', 'groin'}
walk
{'flak', 'weak', 'walk'}

区分大小写
如果使用Let代替let，那么这将成为单词与字典中最接近的匹配单词的区分大小写的比较，结果现在看起来不同。

from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['Let', 'us', 'wlak','on','the','groun'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

当运行上面的程序时，我们得到以下输出 -

group
{'groin', 'ground', 'groan', 'group', 'grown', 'grout'}
walk
{'walk', 'flak', 'weak'}
get
{'aet', 'ret', 'get', 'cet', 'bet', 'vet', 'pet', 'wet', 'let', 'yet', 'det', 'het', 'set', 'et', 'jet', 'tet', 'met', 'fet', 'net'}

WordNet接口

WordNet是一个英语词典，类似于传统的词库，NLTK包括英语版本的WordNet。我们可以使用它作为获取单词，用法示例和定义含义的参考。类似单词的集合称为lemmas。 WordNet中的单词是有组织的，节点和边是节点表示单词文本，边表示单词之间的关系。下面我们将来学习如何使用WordNet模块。

所有Lemmas

from nltk.corpus import wordnet as wn
res=wn.synset('locomotive.n.01').lemma_names()
print res

当运行上面的程序时，我们得到以下输出 -

[u'locomotive', u'engine', u'locomotive_engine', u'railway_locomotive']

词的定义
可以通过使用定义函数来获得单词的字典定义。它描述了可以在普通字典中找到的单词的含义。参考以下代码 -

from nltk.corpus import wordnet as wn
resdef = wn.synset('ocean.n.01').definition()
print resdef

当运行上面的程序时，得到以下输出 -

a large body of water constituting a principal part of the hydrosphere

用法示例
可以使用exmaples()函数获得显示单词的一些用法示例的示例句子。

from nltk.corpus import wordnet as wn
res_exm = wn.synset('good.n.01').examples()
print res_exm

执行上面示例代码，得到以下结果 -

['for your own good', "what's the good of worrying?"]

反义词

使用反义词功能获取所有相反的单词。

from nltk.corpus import wordnet as wn
# get all the antonyms
res_a = wn.lemma('horizontal.a.01.horizontal').antonyms()
print res_a

当运行上面的程序时，得到以下输出 -

[Lemma('inclined.a.02.inclined'), Lemma('vertical.a.01.vertical')]

语料访问

Corpora是一个展示多个文本文档集合的组。单个集合称为语料库。其中一个着名的语料库是古腾堡语料库，其中包含大约25,000本免费电子书，由 http://www.gutenberg.org/ 托管。在下面的例子中，只访问语料库中那些文件的名称，这些文件是纯文本，以.txt结尾的文件名。

from nltk.corpus import gutenberg
fields = gutenberg.fileids()

print(fields)

执行上面示例代码，得到以下结果 -

[austen-emma.txt', austen-persuasion.txt', austen-sense.txt', bible-kjv.txt', 
blake-poems.txt', bryant-stories.txt', burgess-busterbrown.txt',
carroll-alice.txt', chesterton-ball.txt', chesterton-brown.txt', 
chesterton-thursday.txt', edgeworth-parents.txt', melville-moby_dick.txt',
milton-paradise.txt', shakespeare-caesar.txt', shakespeare-hamlet.txt',
shakespeare-macbeth.txt', whitman-leaves.txt']

访问原始文本

可以使用sent_tokenize函数从这些文件中访问原始文本，该函数也可以在nltk中使用。在下面的例子中，将检索blake-poen文本的前两段。

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

token = sent_tokenize(sample)

for para in range(2):
    print(token[para])

当运行上面的程序时，我们得到以下输出 -

[Poems by William Blake 1789]


SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL


 SONGS OF INNOCENCE


 INTRODUCTION

 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:

 "Pipe a song about a Lamb!"
So I piped with merry cheer.

标记单词

标记是文本处理的基本特征，我们将单词标记为语法分类。借助tokenization和pos_tag函数来为每个单词创建标签。

import nltk

text = nltk.word_tokenize("A Python is a serpent which eats eggs from the nest")
tagged_text=nltk.pos_tag(text)
print(tagged_text)

执行上面示例代码，得到以下结果 -

[('A', 'DT'), ('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('serpent', 'NN'), 
('which', 'WDT'), ('eats', 'VBZ'), ('eggs', 'NNS'), ('from', 'IN'), 
('the', 'DT'), ('nest', 'JJS')]

标签说明

可以使用以下显示内置值的程序来描述每个标记的含义。

import nltk

nltk.help.upenn_tagset('NN')
nltk.help.upenn_tagset('IN')
nltk.help.upenn_tagset('DT')

当运行上面的程序时，我们得到以下输出 -

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those

标记语料库

还可以标记语料库数据并查看该语料库中每个单词的标记结果。参考以下实现代码 -

import nltk

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg
sample = gutenberg.raw("blake-poems.txt")
tokenized = sent_tokenize(sample)
for i in tokenized[:2]:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            print(tagged)

执行上面示例代码，得到以下结果 -

[([', 'JJ'), (Poems', 'NNP'), (by', 'IN'), (William', 'NNP'), (Blake', 'NNP'), (1789', 'CD'), 
(]', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (AND', 'NNP'), (OF', 'NNP'), 
(EXPERIENCE', 'NNP'), (and', 'CC'), (THE', 'NNP'), (BOOK', 'NNP'), (of', 'IN'), 
(THEL', 'NNP'), (SONGS', 'NNP'), (OF', 'NNP'), (INNOCENCE', 'NNP'), (INTRODUCTION', 'NNP'), 
(Piping', 'VBG'), (down', 'RP'), (the', 'DT'), (valleys', 'NN'), (wild', 'JJ'), 
(,', ','), (Piping', 'NNP'), (songs', 'NNS'), (of', 'IN'), (pleasant', 'JJ'), (glee', 'NN'),
 (,', ','), (On', 'IN'), (a', 'DT'), (cloud', 'NN'), (I', 'PRP'), (saw', 'VBD'), 
 (a', 'DT'), (child', 'NN'), (,', ','), (And', 'CC'), (he', 'PRP'), (laughing', 'VBG'), 
 (said', 'VBD'), (to', 'TO'), (me', 'PRP'), (:', ':'), (``', '``'), (Pipe', 'VB'),
 (a', 'DT'), (song', 'NN'), (about', 'IN'), (a', 'DT'), (Lamb', 'NN'), (!', '.'), (u"''", "''")]

块和裂口

分块是根据单词的性质将相似单词分组在一起的过程。在下面的示例中，我们定义了必须生成块的语法。语法表示在创建块时将遵循的诸如名词和形容词等短语的序列。块的图形输出如下所示。

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),("flower", "NN"), 
("flew", "VBD"), ("through", "IN"),  ("the", "DT"), ("window", "NN")]
grammar = "NP: {?
*}" 
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence) 
print(result)
result.draw()

当运行上面的程序时，我们得到以下输出 -

改变语法，我们得到一个不同的输出，如下代码所示 -

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),("flower", "NN"),
 ("flew", "VBD"), ("through", "IN"),  ("the", "DT"), ("window", "NN")]

grammar = "NP: {
?*}" 

chunkprofile = nltk.RegexpParser(grammar)
result = chunkprofile.parse(sentence) 
print(result)
result.draw()

如下所示 -

Chinking

Chinking是从块中移除一系列令牌的过程。如果令牌序列出现在块的中间，则删除这些令牌，留下两个已经存在的块。

import nltk

sentence = [("The", "DT"), ("small", "JJ"), ("red", "JJ"),("flower", "NN"), ("flew", "VBD"), ("through", "IN"),  ("the", "DT"), ("window", "NN")]

grammar = r"""
  NP:
    {<.*>+}         # Chunk everything
    }+{      # Chink sequences of JJ and NN
  """
chunkprofile = nltk.RegexpParser(grammar)
result = chunkprofile.parse(sentence) 
print(result)
result.draw()

当运行上面的程序时，我们得到以下输出 -

如所所示，符合语法标准的部分从名词短语中省略为单独的块。提取不在所需块中的文本的过程称为chinking。

块分类

基于分类的分块涉及将文本分类为一组单词而不是单个单词。一个简单的场景是在句子中标记文本，将使用语料库来演示分类。选择具有来自华尔街日报语料库(WSJ)的数据的语料库conll2000，用于基于名词短语的分块。

首先，使用以下命令将语料库添加到环境中。

import nltk
nltk.download('conll2000')

看看这个语料库中的前几句话。

from nltk.corpus import conll2000

x = (conll2000.sents())
for i in range(3):
     print x[i]
     print 'n'

当运行上面的程序时，我们得到以下输出 -

['Confidence', 'in', 'the', 'pond', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figres', 'for', 'September', ',', 'de', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'sbstantial', 'improvement', 'from', 'Jly', 'and', 'Agst', "'s", 'near-record', 'deficits', '.']


['Chancellor', 'of', 'the', 'Excheqer', 'Nigel', 'Lawson', "'s", 'restated', 'commitment', 'to', 'a', 'firm', 'monetary', 'policy', 'has', 'helped', 'to', 'prevent', 'a', 'freefall', 'in', 'sterling', 'over', 'the', 'past', 'week', '.']


['Bt', 'analysts', 'reckon', 'nderlying', 'spport', 'for', 'sterling', 'has', 'been', 'eroded', 'by', 'the', 'chancellor', "'s", 'failre', 'to', 'annonce', 'any', 'new', 'policy', 'measres', 'in', 'his', 'Mansion', 'Hose', 'speech', 'last', 'Thrsday', '.']

接下来，使用函数tagged_sents()来获取标记到其分类器的句子。

from nltk.corpus import conll2000

x = (conll2000.tagged_sents())
for i in range(3):
     print x[i]
     print 'n'

当运行上面的程序时，我们得到以下输出 -

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ('pond', 'NN'), ('is', 'VBZ'), ('widely', 'RB'), ('expected', 'VBN'), ('to', 'TO'), ('take', 'VB'), ('another', 'DT'), ('sharp', 'JJ'), ('dive', 'NN'), ('if', 'IN'), ('trade', 'NN'), ('figres', 'NNS'), ('for', 'IN'), ('September', 'NNP'), (',', ','), ('de', 'JJ'), ('for', 'IN'), ('release', 'NN'), ('tomorrow', 'NN'), (',', ','), ('fail', 'VB'), ('to', 'TO'), ('show', 'VB'), ('a', 'DT'), ('sbstantial', 'JJ'), ('improvement', 'NN'), ('from', 'IN'), ('Jly', 'NNP'), ('and', 'CC'), ('Agst', 'NNP'), ("'s", 'POS'), ('near-record', 'JJ'), ('deficits', 'NNS'), ('.', '.')]


[('Chancellor', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Excheqer', 'NNP'), ('Nigel', 'NNP'), ('Lawson', 'NNP'), ("'s", 'POS'), ('restated', 'VBN'), ('commitment', 'NN'), ('to', 'TO'), ('a', 'DT'), ('firm', 'NN'), ('monetary', 'JJ'), ('policy', 'NN'), ('has', 'VBZ'), ('helped', 'VBN'), ('to', 'TO'), ('prevent', 'VB'), ('a', 'DT'), ('freefall', 'NN'), ('in', 'IN'), ('sterling', 'NN'), ('over', 'IN'), ('the', 'DT'), ('past', 'JJ'), ('week', 'NN'), ('.', '.')]


[('Bt', 'CC'), ('analysts', 'NNS'), ('reckon', 'VBP'), ('nderlying', 'VBG'), ('spport', 'NN'), ('for', 'IN'), ('sterling', 'NN'), ('has', 'VBZ'), ('been', 'VBN'), ('eroded', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('chancellor', 'NN'), ("'s", 'POS'), ('failre', 'NN'), ('to', 'TO'), ('annonce', 'VB'), ('any', 'DT'), ('new', 'JJ'), ('policy', 'NN'), ('measres', 'NNS'), ('in', 'IN'), ('his', 'PRP$'), ('Mansion', 'NNP'), ('Hose', 'NNP'), ('speech', 'NN'), ('last', 'JJ'), ('Thrsday', 'NNP'), ('.', '.')]

文本分类

很多时候，需要通过一些预先定义的标准将可用文本分类为各种类别。 nltk提供此类功能作为各种语料库的一部分。在下面的示例中，查看电影评论语料库并检查可用的分类。

# Lets See how the movies are classified
from nltk.corpus import movie_reviews

all_cats = []
for w in movie_reviews.categories():
    all_cats.append(w.lower())
print(all_cats)

当运行上面的程序时，我们得到以下输出 -

['neg', 'pos']

现在看一下带有正面评论的文件的内容。这个文件中的句子是标记化的，打印前四个句子来查看样本。

from nltk.corpus import movie_reviews
from nltk.tokenize import sent_tokenize
fields = movie_reviews.fileids()

sample = movie_reviews.raw("pos/cv944_13521.txt")

token = sent_tokenize(sample)
for lines in range(4):
    print(token[lines])

当运行上面的程序时，我们得到以下输出 -

meteor threat set to blow away all volcanoes & twisters !
summer is here again !
this season could probably be the most ambitious = season this decade with hollywood churning out films 
like deep impact , = godzilla , the x-files , armageddon , the truman show , 
all of which has but = one main aim , to rock the box office .
leading the pack this summer is = deep impact , one of the first few film 
releases from the = spielberg-katzenberg-geffen's dreamworks production company .

接下来，通过使用nltk中的FreqDist函数来标记每个文件中的单词并找到最常用的单词。

import nltk
from nltk.corpus import movie_reviews
fields = movie_reviews.fileids()

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(10))

当运行上面的程序时，我们得到以下输出 -

[(,', 77717), (the', 76529), (.', 65876), (a', 38106), (and', 35576), 
(of', 34123), (to', 31937), (u"'", 30585), (is', 25195), (in', 21822)]

双字母组

一些英语单词更频繁地出现在一起。例如 - 天空高，做或死，最佳表现，大雨等。因此，在文本文档中，我们可能需要识别这样的一对词，这将有助于情绪分析。首先，我们需要从现有句子生成这样的单词对来维持它们的当前序列。这种对称为双字母。 Python有一个bigram函数，它是NLTK库的一部分，它可以帮助我们生成这些对。

示例

import nltk

word_data = "The best performance can bring in sky high success."
nltk_tokens = nltk.word_tokenize(word_data)      

print(list(nltk.bigrams(nltk_tokens)))

当运行上面的程序时，我们得到以下输出 -

[('The', 'best'), ('best', 'performance'), ('performance', 'can'), ('can', 'bring'), 
('bring', 'in'), ('in', 'sky'), ('sky', 'high'), ('high', 'success'), ('success', '.')]

该结果可用于给定文本中此类对的频率的统计结果。这将与文本正文中描述的一般情绪相关联。