第五章-对单词进行分类和标记

最新推荐文章于 2022-01-30 22:15:47 发布

SherryLovesCoding

最新推荐文章于 2022-01-30 22:15:47 发布

阅读量1.8k

点赞数 1

分类专栏： python自然语言处理文章标签：词性标注

本文链接：https://blog.csdn.net/sherrylovescoding/article/details/90242299

版权

python自然语言处理专栏收录该内容

10 篇文章 3 订阅

订阅专栏

本章的回答的问题：

1.什么是词汇分类，他们在自然语言处理中是怎么用的？
2.什么是用于存储单词及其类别的良好Python数据结构?
3.如何使用文本的word类自动标记文本中的每个单词?

在此过程中，我们将介绍NLP中的一些基本技术，包括序列标记、n-gram模型、后退和评估。
这些技术在许多领域都很有用，而标记为我们提供了一个展示它们的简单上下文。
我们还将看到标记是典型的NLP管道中的第二步，后面是标记化。
将单词按词性分类并相应地标注的过程称为词性标注、词后标注或简单的标注。词类也被称为词类或词类。用于特定任务的标记集合称为标记集。我们在本章的重点是利用标签，并自动标记文本。

5.1使用一个标注器

词性标注器或POS标记器处理单词序列，并将词性标记附加到每个单词（不要忘记导入nltk）：

text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
nltk.help.upenn_tagset('CC')

其中’CC’表示并列连词，'RB’表示副词，'IN’表示介词，'NN’表示名词，'JJ’表示形容词
“refuse”和“permit”都是现在时态动词(VBP)和名词(NN)，因此，文本到语音系统通常有词性标记

text.similar('')

标记器可以在句子上下文中正确识别这些单词上的标记
tagger还可以模拟我们对未知单词的知识，例如，我们可以猜测scrobbling可能是一个动词，它的词根是scrobble，很可能出现在他正在scrobbling这样的上下文中。

5.2 标记语料库

5.2.1表示有标记的tokens

tagged_token = nltk.tag.str2tuple('fly/NN')

我们可以直接从字符串构造带标记的令牌列表。
第一步是对字符串进行标记，以访问单个单词/标记字符串，然后将每个字符串转换为一个元组(使用str2tuple())。

[nltk.tag.str2tuple(t) for t in sent.split()]

5.2.2 读取有标记的语料库##

NLTK的语料库阅读器提供了一个统一的接口，因此您不必担心文件格式不同,为避免标记的复杂化，可设置tagset为‘universal’

nltk.corpus.brown.tagged_words()
nltk.corpus.brown.tagged_words(tagset='universal')

每当语料库包含带标记的文本时，NLTK语料库接口将具有tagged_words()方法。
下面是更多的例子，同样是为展示输出格式:

nltk.corpus.nps_chat.tagged_words()
nltk.corpus.conll2000.tagged_words()
nltk.corpus.treebank.tagged_words()

如果语料库也被分段为句子，它将具有tagged_sents（）方法，该方法将标记的单词划分为句子而不是将它们作为一个大列表呈现。
当我们开发自动标记器时，这将非常有用，因为它们是在句子列表上训练和测试的，而不是单词。

print(nltk.corpus.indian.tagged_sents())

5.2.3 通用词性标记集##

查看brown语料库中news类别下带标签词中使用频率由高到低的标签

brown_news_tagged =brown.tagged_words(categories='news', tagset='universal')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
print(tag_fd.most_common())

5.2.4 名词

简化名词标签N，表示普通名词，如book; NP表示专有名词，如Scotland。
查看brown语料库中news类别下带标签词中名词前面的词的词性频率（由高到低）

word_tag_pairs = nltk.bigrams(brown_news_tagged)
noun_preceders = [a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN']
fdist = nltk.FreqDist(noun_preceders)
[tag for (tag, _) in fdist.most_common()]

5.2.5 动词

查看treebank语料库下带标签的词中的动词按频率由高到低排的动词（tagset=‘universal’）

wsj = nltk.corpus.treebank.tagged_words(tagset='universal')
word_tag_fd = nltk.FreqDist(wsj)
[wt[0] for (wt, _) in word_tag_fd.most_common() if wt[1] == 'VERB']

条件频率分布，treebank语料库下带标签的词中某个词词性的分布（tagset=‘universal’）

cfd1 = nltk.ConditionalFreqDist(wsj)
print(cfd1['yield'].most_common()) #[('VERB', 28), ('NOUN', 20)]
print(cfd1['cut'].most_common())   #[('VERB', 25), ('NOUN', 3)]

查看treebank语料库下带标签的词中某个标签下有哪些词（tagset=‘wsj’）

wsj = nltk.corpus.treebank.tagged_words()
cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)
print(list(cfd2['VBN']))
print(nltk.help.upenn_tagset('VBN'))

为了明确VBD(过去时)和VBN(过去分词)之间的区别，让我们找到既可以是VBD又可以是VBN的单词，并查看一些周围的文本:（tagset=‘wsj’）

wsj = nltk.corpus.treebank.tagged_words()
cfd1 = nltk.ConditionalFreqDist(wsj)
[w for w in cfd1.conditions() if 'VBD' in cfd1[w] and 'VBN' in cfd1[w]]

5.2.6 形容词和副词

另外两个重要的词类是形容词和副词。
形容词描述名词，可以用作修饰语(例如 large in the large pizza)，也可以用作谓语(例如the pizza is large)。
英语形容词可以有内部结构(例如:fall+ing in the falling stocks)。
副词修饰动词，以指定动词所描述的事件的时间、方式、地点或方向(例如，fast in the stocks fell quickly)。
副词也可以修饰形容词(例如:really in Mary’s teacher was really nice)。
英语除了介词外，还有几类封闭的类词，如冠词(也常称为限定词)(如the, a)、情态动词(如should, may)和人称代词(如she, they)。每种词典和语法对这些词的分类都不同

5.2.7 Unsimplified Tags

获取tagged_text中以tag_prefix为前缀的词性中频率最高的前5个词，

def findtags(tag_prefix, tagged_text):
  cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                                 if tag.startswith(tag_prefix))
  return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())

tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))

对词性排序后显示

for tag in sorted(tagdict):
	print(tag, tagdict[tag])

5.2.8 探索有标记的数据库

假设我们学习’often’这个单词，想看看它在文本中的用法。我们可以看看跟在’often’后面的单词

brown_learned_text = brown.words(categories='learned')
print(sorted(set(b for (a, b) in nltk.bigrams(brown_learned_text) if a == 'often')))

'often’后面的词词性分布，请注意，下列词性中频率最高的部分通常是动词。名词永远不会出现在这个位置(在这个特殊的语料库中)。
brown_lrnd_tagged = brown.tagged_words(categories=‘learned’, tagset=‘universal’)
tags = [b[1] for (a, b) in nltk.bigrams(brown_lrnd_tagged) if a[0] == ‘often’]
fd = nltk.FreqDist(tags)
fd.tabulate()
查找动词 to 动词

def process(sentence):
for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):
    if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
        print(w1, w2, w3)
for tagged_sent in brown.tagged_sents():
  process(tagged_sent)

打印词性超过3个的词和词性频率分布

brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
data = nltk.ConditionalFreqDist((word.lower(), tag)
                                for (word, tag) in brown_news_tagged)
for word in sorted(data.conditions()):
    if len(data[word]) > 3:
        tags = [tag for (tag, _) in data[word].most_common()]
        print(word, ' '.join(tags))

5.3 使用Python字典将单词映射到属性

在本节中，我们将查看词典，并了解它们如何表示各种语言信息，包括词性。

5.3.1 索引列表与字典

我们希望字典能把不同数据类型的信息形成映射

5.3.2 python中的字典

python中的有一个数据类型：字典，它能让不同数据类型形成映射，它就像个传统的字典，它提供了一个有效的查找方式。一个键只能对应一个entry,但是这个entry可以是个list

5.3.3 定义字典

两种定义字典的方式

pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')

注意，字典键必须是不可变的类型，比如字符串和元组

5.3.4 默认字典

带有默认值的字典

frequency = defaultdict(int)
frequency['colorless'] = 4
print(frequency['ideas']) #0 	                                 
pos = defaultdict(list)          
pos['sleep'] = ['NOUN', 'VERB']  
print(pos['ideas']) #[]      
print(list(pos.items()))

pos = defaultdict(lambda:'NOUN')
pos['colorless'] = 'ADJ' 
pos['blog']              
print(list(pos.items()))

SherryLovesCoding

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
第五章-对单词进行分类和标记

本章的回答的问题：1.什么是词汇分类，他们在自然语言处理中是怎么用的？2.什么是用于存储单词及其类别的良好Python数据结构?3.如何使用文本的word类自动标记文本中的每个单词?在此过程中，我们将介绍NLP中的一些基本技术，包括序列标记、n-gram模型、后退和评估。这些技术在许多领域都很有用，而标记为我们提供了一个展示它们的简单上下文。我们还将看到标记是典型的NLP管道中的第二...
复制链接

扫一扫