《Python自然语言处理（第二版）-Steven Bird等》学习笔记：第06章学习分类文本

本文链接：https://blog.csdn.net/weixin_43935926/article/details/86513689

# -*- coding: utf-8 -*-

import nltk, re, pprint
from nltk import word_tokenize

6.1 有监督分类

如果分类的建立基于包含每个输入的正确标签的训练语料，被称为有监督分类。

（a）在训练过程中，特征提取器用来将每一个输入值转换为特征集。这些特征集捕捉每个输入中应被用于对其分类的基本信息。特征集与标签的配对被送入机器学习算法，生成模型。
（b）在预测过程中，相同的特征提取器被用来将未见过的输入转换为特征集。之后，这些特征集被送入模型产生预测标签。

性别鉴定

男性和女性的名字有一些鲜明的特点。以a，e 和i 结尾的很可能是女性，而以k，o，r，s 结尾的很可能是男性。让我们建立一个分类器更精确地模拟这些差异。

#定义一个特征提取器
def gender_features(word):
    '''
     这个函数返回的字典被称为特征集，映射特征名称到它们的值。
    '''
    return {'last_letter': word[-1]}

gender_features('Shrek')

{'last_letter': 'k'}

准备一个例子和对应类标签的链表

from nltk.corpus import names

import random

names = ([(name, 'male') for name in names.words('male.txt')] +
    [(name, 'female') for name in names.words('female.txt')])

random.shuffle(names)

接下来，我们使用特征提取器处理名称数据，并划分特征集的结果链表为一个训练集和一个测试集。训练集用于训练一个新的“朴素贝叶斯”分类器。

featuresets = [(gender_features(n), g) for (n,g) in names]

train_set, test_set = featuresets[500:], featuresets[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

classifier.classify(gender_features('Neo'))

'male'

classifier.classify(gender_features('Trinity'))

'female'

print(nltk.classify.accuracy(classifier, test_set))

0.762

检查分类器，确定哪些特征对于区分名字的性别是最有效的。

classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     33.4 : 1.0
             last_letter = 'k'              male : female =     32.2 : 1.0
             last_letter = 'v'              male : female =     17.5 : 1.0
             last_letter = 'f'              male : female =     15.3 : 1.0
             last_letter = 'p'              male : female =     11.9 : 1.0

使用函数nltk.classify.apply_features，返回一个行为像一个链表而不会在内存存储所有特征集的对象

from nltk.classify import apply_features

train_set = apply_features(gender_features, names[500:])

test_set = apply_features(gender_features, names[:500])

选择正确的特征

例6-1. 一个特征提取器，过拟合性别特征。这个特征提取器返回的特征集包括大量指定的特征，从而导致对于相对较小的名字语料库过拟合。

def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

gender_features2('John')

{'count(a)': 0,
 'count(b)': 0,
 'count(c)': 0,
 'count(d)': 0,
 'count(e)': 0,
 'count(f)': 0,
 'count(g)': 0,
 'count(h)': 1,
 'count(i)': 0,
 'count(j)': 1,
 'count(k)': 0,
 'count(l)': 0,
 'count(m)': 0,
 'count(n)': 1,
 'count(o)': 1,
 'count(p)': 0,
 'count(q)': 0,
 'count(r)': 0,
 'count(s)': 0,
 'count(t)': 0,
 'count(u)': 0,
 'count(v)': 0,
 'count(w)': 0,
 'count(x)': 0,
 'count(y)': 0,
 'count(z)': 0,
 'firstletter': 'j',
 'has(a)': False,
 'has(b)': False,
 'has(c)': False,
 'has(d)': False,
 'has(e)': False,
 'has(f)': False,
 'has(g)': False,
 'has(h)': True,
 'has(i)': False,
 'has(j)': True,
 'has(k)': False,
 'has(l)': False,
 'has(m)': False,
 'has(n)': True,
 'has(o)': True,
 'has(p)': False,
 'has(q)': False,
 'has(r)': False,
 'has(s)': False,
 'has(t)': False,
 'has(u)': False,
 'has(v)': False,
 'has(w)': False,
 'has(x)': False,
 'has(y)': False,
 'has(z)': False,
 'lastletter': 'n'}

当运作在小训练集上时尤其会有过拟合问题

featuresets = [(gender_features2(n), g) for (n,g) in names]

train_set, test_set = featuresets[500:], featuresets[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

0.774

一旦初始特征集被选定，完善特征集的一个非常有成效的方法是错误分析。首先，我们选择一个开发集，包含用于创建模型的语料数据。然后将这种开发集分为训练集和开发测试集。

train_names = names[1500:]

devtest_names = names[500:1500]

test_names = names[:500]

train_set = [(gender_features(n), g) for (n,g) in train_names]

devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]

test_set = [(gender_features(n), g) for (n,g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, devtest_set))

0.753

使用开发测试集，我们可以生成一个分类器预测名字性别时的错误列表。

errors = []

for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

for (tag, guess, name) in sorted(errors):
    print('correct={:<8} guess={:<8s} name={:<30}'.format(tag, guess, name))

correct=female   guess=male     name=Aigneis                       
correct=female   guess=male     name=Aileen                        
         
correct=male     guess=female   name=Yehudi                        
correct=male     guess=female   name=Zolly

#调整特征提取器包括两个字母后缀的特征
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

train_set = [(gender_features(n), gender) for (n, gender) in train_names]

devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, devtest_set))

0.784

这个错误分析过程可以不断重复
保持测试集分离、未使用过

文档分类

from nltk.corpus import movie_reviews #选择电影评论语料库，将每个评论归类为正面或负面。

首先，构造一个标记了相应类别的文档清单

documents = [(list(movie_reviews.words(fileid)), category)
    for category in movie_reviews.categories()
    for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

接下来，为文档定义一个特征提取器,对于文档主题识别，可以为每个词定义一个特性表示该文档是否包含这个词。

例6-2. 一个文档分类的特征提取器，其特征表示每个词是否在一个给定的文档中。

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

print(document_features(movie_reviews.words('pos/cv957_8737.txt')))

{'contains(soothe)': False, 'contains(flunked)': False, 'contains(marvel)': False, 'contains(monomaniacal)': False, 'contains(competiton)': False, 'contains(trolley)': False, 'contains(gesture)': False, 'contains(bearings)': False, 'contains(langenkamp)': False, 'tains(leisure)': False, 'contains(outsmarting)': False, 'contains(control)': False, 'contains(hormonal)': False, 'contains(warped)': False, 'contains(husk)': False}

例6-3. 训练和测试一个分类器进行文档分类。

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

0.61

classifier.show_most_informative_features(5)

Most Informative Features
       contains(idiotic) = True              neg : pos    =      7.4 : 1.0
    contains(schumacher) = True              neg : pos    =      7.4 : 1.0
         contains(anger) = True              pos : neg    =      7.0 : 1.0
          contains(lore) = True              pos : neg    =      7.0 : 1.0
        contains(suvari) = True              neg : pos    =      7.0 : 1.0

训练一个分类器来算出哪个后缀为词选择词性标记最有信息量。

from nltk.corpus import brown

suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1

common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]

print(common_suffixes)

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of', 'the', 'y', 'r', 'to', 'in', 'f', 'o', 'ed', 'nd', 'is', 'on', 'l', 'g', 'and', 'ng', 'er', 'as', 'ing', 'h', 'at', 'es', 'or', 're', 'it', '``', 'an', "''", 'm', ';', 'i', 'ly', 'ion', 'en', 'al', '?', 'nt', 'be', 'hat', 'st', 'his', 'th', 'll', 'le', 'ce', 'by', 'ts', 'me', 've', "'", 'se', 'ut', 'was', 'for', 'ent', 'ch', 'k', 'w', 'ld', '`', 'rs', 'ted', 'ere', 'her', 'ne', 'ns', 'ith', 'ad', 'ry', ')', '(', 'te', '--', 'ay', 'ty', 'ot', 'p', 'nce', "'s", 'ter', 'om', 'ss', ':', 'we', 'are', 'c', 'ers', 'uld', 'had', 'so', 'ey']

接下来，我们将定义一个特征提取器函数，检查给定的单词的这些后缀：

def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.DecisionTreeClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

classifier.classify(pos_features('cats'))

决策树模型的一个很好的性质是它们往往很容易解释。

print(classifier.pseudocode(depth=4))

探索上下文语境

例6-4. 一个词性分类器，它的特征检测器检查一个词出现的上下文以便决定应该分配的词性标记。特别的，前面的词被作为一个特征。

def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
        "suffix(2)": sentence[i][-2:],
        "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features

pos_features(brown.sents()[0], 8)

{'prev-word': 'an', 'suffix(1)': 'n', 'suffix(2)': 'on', 'suffix(3)': 'ion'}

tagged_sents = brown.tagged_sents(categories='news')

featuresets = []

for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append( (pos_features(untagged_sent, i), tag) )

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.7891596220785678

利用上下文特征可以提高词性标注器的性能。

序列分类

例6-5. 使用连续分类器进行词性标注

def pos_features(sentence, i, history):
    features = {"suffix(1)": sentence[i][-1:],
        "suffix(2)": sentence[i][-2:],
        "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-tag"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-tag"] = history[i-1]
    return features
class ConsecutivePosTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = pos_features(untagged_sent, i, history)
                train_set.append( (featureset, tag) )
                history.append(tag)
        self.classifier = nltk.NaiveBayesClassifier.train(train_set)
    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = pos_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

tagged_sents = brown.tagged_sents(categories='news')

size = int(len(tagged_sents) * 0.1)

train_sents, test_sents = tagged_sents[size:], tagged_sents[:size]

tagger = ConsecutivePosTagger(train_sents)

print(tagger.evaluate(test_sents))

0.7980528511821975

其他序列分类方法

Brill 标注器：转型联合分类的工作原理是为输入的标签创建一个初始值，然后反复提炼那个值，尝试修复相关输入之间的不一致。
隐马尔可夫模型，另一种方案是为词性标记所有可能的序列打分，选择总得分最高的序列。它不光看输入也看已预测标记的历史。隐马尔可夫模型要
求特征提取器只看最近的标记（或最近的n 个标记，其中n 是相当小的）。由于这种限制，它可以使用动态规划，有效地找出最有可能的标记序列。特别是，对每个连续的词
索引i，每个可能的当前及以前的标记都被计算得分。这种同样基础的方法被两个更先进的模型所采用，它们被称为最大熵马尔可夫模型和线性链条件随机场模型；但为标记序列打
分用的是不同的算法。

6.2 有监督分类的更多例子

句子分割

句子分割可以看作是一个标点符号的分类任务：每当我们遇到一个可能会结束一个句子的符号，如句号或问号，我们必须决定它是否终止了当前句子。

第一步是获得一些已被分割成句子的数据，将它转换成一种适合提取特征的形式：

import nltk

sents = nltk.corpus.treebank_raw.sents()

tokens = []   #tokens 是单独句子标识符的合并链表

boundaries = set() #，boundaries 是一个包含所有句子边界标识符索引的集合

offset = 0

for sent in nltk.corpus.treebank_raw.sents():
    tokens.extend(sent)
    offset += len(sent)
    boundaries.add(offset-1)

下一步，我们需要指定用于决定标点是否表示句子边界的数据特征

def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
        'prev-word': tokens[i-1].lower(),
        'punct': tokens[i],
        'prev-word-is-one-char': len(tokens[i-1]) == 1}

基于这一特征提取器，我们可以通过选择所有的标点符号创建一个加标签的特征集的链表，然后标注它们是否是边界标识符：

featuresets = [(punct_features(tokens, i), (i in boundaries))
    for i in range(1, len(tokens)-1)
    if tokens[i] in '.?!']

使用这些特征集，我们可以训练和评估一个标点符号分类器

size = int(len(featuresets) * 0.1)

train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.NaiveBayesClassifier.train(train_set)

nltk.classify.accuracy(classifier, test_set)

0.936026936026936

使用这种分类器进行断句，我们只需检查每个标点符号，看它是否是作为一个边界标识
符，在边界标识符处分割词链表。

例6-6. 基于分类的断句器

def segment_sentences(words):
    start = 0
    sents = []
    for i, word in words:
        if word in '.?!' and classifier.classify(words, i) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])

识别对话行为类型

第一步是提取基本的消息数据

posts = nltk.corpus.nps_chat.xml_posts()[:10000]

下一步，我们将定义一个简单的特征提取器，检查帖子包含什么词

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains(%s)' % word.lower()] = True
    return features

最后，我们通过为每个帖子提取特征（使用post.get(‘class’) 获得一个帖子的对话行
为类型）构造训练和测试数据，并创建一个新的分类器

featuresets = [(dialogue_act_features(post.text), post.get('class'))
    for post in posts]

size = int(len(featuresets) * 0.1)

train_set, test_set = featuresets[size:], featuresets[:size]

classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

0.668

识别文字蕴含

识别文字蕴含（Recognizing textual entailment(RTE)）是判断文本T 的一个给定片段是否蕴含着另一个叫做“假设”的文本

可以把RTE 当作一个分类任务，尝试为每一对预测真/假标签。虽然这项任务的成功做法似乎看上去涉及语法分析、语义和现实世界的知识的组合，RTE 的许多早期的尝试使用粗浅的分析基于文字和假设之间的在词级别的相似性取得了相当不错的结果。在理想情况下，我们希望如果有一个蕴涵那么假设所表示的所有信息也应该在文本中表示。相反，如果假设中有的资料文本中没有，那么就没有蕴涵。

例6-7. “认识文字蕴涵”的特征提取

RTEFeatureExtractor 类建立了一个除去一些停用词后在文本和假设中都有的词汇包，然后计算重叠和差异。

def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features

rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]

extractor = nltk.RTEFeatureExtractor(rtepair)

print(extractor.text_words)

{'fledgling', 'representing', 'together', 'association', 'China', 'Soviet', 'was', 'Davudi', 'Parviz', 'at', 'former', 'that', 'Russia', 'Organisation', 'Shanghai', 'Co', 'four', 'binds', 'terrorism.', 'meeting', 'Asia', 'Iran', 'fight', 'central', 'operation', 'republics', 'SCO'}

print(extractor.hyp_words)

{'member', 'China', 'SCO.'}

print(extractor.overlap('word'))

set()

print(extractor.overlap('ne'))

{'China'}

print(extractor.hyp_extra('word'))

{'member'}

扩展到大型数据集

用大量训练数据或大量特征来训练分类器，使用NLTK 与外部机器学习包的接口，会明显比纯Python 的分类实现快

6.3 评估

测试集

import random

from nltk.corpus import brown

tagged_sents = list(brown.tagged_sents(categories='news'))

random.shuffle(tagged_sents)

size = int(len(tagged_sents) * 0.1)

train_set, test_set = tagged_sents[size:], tagged_sents[:size]

确保训练集和测试集来自不同的文件

file_ids = brown.fileids(categories='news')

size = int(len(file_ids) * 0.1)

train_set = brown.tagged_sents(file_ids[size:])

test_set = brown.tagged_sents(file_ids[:size])

如果我们要执行更令人信服的评估，可以从与训练集中文档联系更少的文档中获取测试
集

train_set = brown.tagged_sents(categories='news')

test_set = brown.tagged_sents(categories='fiction')

如果我们在此测试集上建立了一个性能很好的分类器，那么我们完全可以相信它有能力很好的泛化到用于训练它的数据以外的。

准确度

例如：一个名字性别分类器，在包含80 个名字的测试集上预测正确的名字有60 个，它有60/80= 75％的准确度。nltk.classify.accuracy()函数会在给定的测试集上计算分类器模型的准确度

#classifier = nltk.NaiveBayesClassifier.train(train_set)

#print('Accuracy: {:4.2f}'.format(nltk.classify.accuracy(classifier, test_set)))

精确度和召回率

四个类别的每一个中的项目的数量：

真阳性是相关项目中我们正确识别为相关的。
真阴性是不相关项目中我们正确识别为不相关的。
假阳性（或I 型错误）是不相关项目中我们错误识别为相关的。
假阴性（或II型错误）是相关项目中我们错误识别为不相关的。
给定这四个数字，我们可以定义以下指标：
精确度（Precision），表示我们发现的项目中有多少是相关的，TP/(TP+ FP)。
召回率（Recall），表示相关的项目中我们发现了多少，TP/(TP+ FN)。
F-度量值（F-Measure）（或F-得分，F-Score），组合精确度和召回率为一个单独的得分，被定义为精确度和召回率的调和平均数(2 × Precision × Recall)/(Precision+Recall)。

混淆矩阵

一个混淆矩阵是一个表，其中每个cells[i,j]表示正确的标签i 被预测为标签j 的次数。因此，对角线项目（即cells[i,i]）表示正确预测的标签，非对角线项目表示错误。

def tag_list(tagged_sents):
    return [tag for sent in tagged_sents for (word, tag) in sent]

def apply_tagger(tagger, corpus):
    return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus]

gold = tag_list(brown.tagged_sents(categories='editorial'))

#test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial')))

#cm = nltk.ConfusionMatrix(gold, test)

交叉验证

在不同的测试集上执行多个评估，然后组合这些评估的得分，这种技术被称为交叉验证。特别是，我们将原始语料细分为N 个子集称为折叠（folds）。
对于每一个这些的折叠，我们使用除这个折叠中的数据外其他所有数据训练模型，然后在这个折叠上测试模型。即使个别的折叠可能是太小了而不能在其上给出准确的评价分数，综合评估得分是基于大量的数据，因此是相当可靠的。

第二，同样重要的，采用交叉验证的优势是，它可以让我们研究不同的训练集上性能变
化有多大。如果我们从所有N 个训练集得到非常相似的分数，然后我们可以相当有信心，
得分是准确的。另一方面，如果N 个训练集上分数很大不同，那么，我们应该对评估得分
的准确性持怀疑态度。

6.4 决策树

熵和信息增益

熵被定义为每个标签的概率乘以那个标签的log 概率的总和。
(1) H = Σl ∈ labelsP(l) × log2P(l).

例6-8. 计算标签链表的墒。

import math
import nltk

def entropy(labels):
    freqdist = nltk.FreqDist(labels)
    probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)]
    return -sum([p * math.log(p,2) for p in probs])

print(entropy(['male', 'male', 'male', 'male']))

-0.0

print(entropy(['male', 'female', 'male', 'male']))

0.8112781244591328

print(entropy(['female', 'male', 'female', 'male']))

1.0

print(entropy(['female', 'female', 'male', 'female']))

0.8112781244591328

print(entropy(['female', 'female', 'female', 'female']))

-0.0

一旦我们已经计算了原始输入值的标签集的墒，就可以判断应用了决策树桩之后标签会变得多么有序。为了这样做，我们计算每个决策树桩的叶子的熵，利用这些叶子熵值的平均值（加权每片叶子的样本数量）。信息增益等于原来的熵减去这个新的减少的熵。信息增益越高，将输入值分为相关组的决策树桩就越好，于是我们可以通过选择具有最高信息增益的决策树桩来建立决策树。

6.5 朴素贝叶斯分类器

在朴素贝叶斯分类器中，每个特征都得到发言权，来确定哪个标签应该被分配到一个给定的输入值。为一个输入值选择标签，朴素贝叶斯分类器以计算每个标签的先验概率开始，它由在训练集上检查每个标签的频率来确定。之后，每个特征的贡献与它的先验概率组合，得到每个标签的似然估计。似然估计最高的标签会分配给输入值。

如何计算朴素贝叶斯的标签似然得分？

朴素贝叶斯以计算每个标签的先验概率开始，基于每个标签出现在训练数据中的频率。然后每个特征都用于估计每个标签的似然估计，通过用输入值中有那个特征的标签的概率乘以它。似然得分结果可以认为是从具有给定的标签和特征集的训练集中随机选取的值的概率的估计，假设所有特征概率是独立的。

潜在概率模型

朴素贝叶斯假设（或独立性假设）

基于独立性假设，计算表达式P(label|features)

P(label|features) = P(features, label)/P(features)
P(features) = Σlabel ∈ labels P(features, label)
P(features, label) = P(label) × P(features|label)
P(features, label) = P(label) × Πf ∈ featuresP(f|label)

P(label)是一个给定标签的先验概率，每个P(f|label)是一个单独的特征对标签可能性的贡献。

零计数和平滑

P(f|label) = count(f, label)/count(label)

虽然count(f,label)/count(label)当count(f,label)相对高时是P(f|label)的好的估计，当count(f)变小时这个估计变得不那么可靠。因此，建立朴素贝叶斯模型时，我们通常采用更复杂的平滑技术，用于计算P(f|label)，给定标签的特征的概率。

期望似然估计
Heldout 估计

nltk.probability 模块提供了多种平滑技术的支持。

非二元特征

装箱
回归

独立的朴素

如果我们忽略了独立性假设，使用特征不独立的朴素贝叶斯分类器会发生什么？产生的一个问题是分类器“双重计数”高度相关的特征的影响，将分类器推向更接近给定的标签而不是合理的标签。

双重计数的原因

双重计数问题的原因是在训练过程中特征的贡献被分开计算，但当使用分类器为新输入选择标签时，这些特征的贡献被组合。因此，一个解决方案是考虑在训练中特征的贡献之间可能的相互作用。然后，我们就可以使用这些相互作用调整独立特征所作出的贡献。

写计算标签的可能性的方程，分离出每个功能（或标签）所作出的贡献：

P(features, label) = w[label] × Πf ∈ features w[f, label]

称w[label]和w[f, label]为模型的参数或权重。
w[label] = P(label),w[f, label] = P(f|label)

6.6 最大熵分类器

最大熵分类器使用搜索技术找出一组将最大限度地提高分类器性能的参数。特别的，它查找使训练语料的整体可能性最大的参数组。其定义如下：

P(features) = Σx ∈ corpus P(label(x)|features(x))

其中P(label|features)，一个特征为features 将有类标签label 的输入的概率，被定义为：

P(label|features) = P(label, features)/Σlabel P(label, features)

最大熵分类器采用迭代优化技术选择模型参数，该技术用随机值初始化模型的参数，然后反复优化这些参数，使它们更接近最优解。

最大熵模型

联合特征是有标签的的值的属性，而（简单）特征是未加标签的值的属性。描述和讨论最大熵模型的文字中，术语“特征features”往往指联合特征；术
语“上下文contexts”指我们一直说的（简单）特征。

P(input, label) = Πjoint-features(input,label)w[joint-feature]

熵的最大化

直觉上熵作为衡量一套标签是如何“无序”。特别的，如果是一个单独的标签则熵较低，但如果标签的分布比较均匀则熵较高。一般情况下，最大熵原
理是说在与我们所知道的一致的的分布中，我们会选择熵最高的。

生成式分类器对比条件式分类器

朴素贝叶斯分类器和最大熵分类器之间的一个重要差异是它们可以被用来回答问题的类型。

朴素贝叶斯分类器是一个生成式分类器的例子，建立一个模型，预测P(input, label)，即(input, label)对的联合概率。因此，生成式模型可以用来回答下列问题：

一个给定输入的最可能的标签是什么？
对于一个给定输入，一个给定标签有多大可能性？
最有可能的输入值是什么？
一个给定输入值的可能性有多大？
一个给定输入具有一个给定标签的可能性有多大？
对于一个可能有两个值中的一个值（但我们不知道是哪个）的输入，最可能的标签是什么？

最大熵分类器是条件式分类器的一个例子。条件式分类器建立模型预测P(label|input)——一个给定输入值的标签的概率。因此，条件式模型仍然可以被用来回答问题1和2。然而，条件式模型不能用来回答剩下的问题3-6。

一般情况下，生成式模型确实比条件式模型强大，因为我们可以从联合概率P(input, label)计算出条件概率P(label|input)，但反过来不行。

6.7 为语言模式建模

描述性模型与解释性模型

描述性模型捕获数据中的模式，但它们并不提供任何有关数据包含这些模式的原因的信息。与此相反，解释性模型试图捕捉造成语言模式
的属性和关系。

描述性模型提供数据内相关性的信息，而解释性模型再进一步假设因果关系。

6.8 小结

为语料库中的语言数据建模可以帮助我们理解语言模型，也可以用于预测新语言数据。
有监督分类器使用加标签的训练语料库来建立模型，基于输入的特征，预测那个输入的标签。
有监督分类器可以执行多种NLP 任务，包括文档分类、词性标注、语句分割、对话行为类型识别以及确定蕴含关系和很多其他任务。
训练一个有监督分类器时，你应该把语料分为三个数据集：用于构造分类器模型的训练集，用于帮助选择和调整模型特性的开发测试集，以及用于评估最终模型性能的测试集。
评估一个有监督分类器时，重要的是你要使用新鲜的没有包含在训练集或开发测试集中的数据。否则，你的评估结果可能会不切实际地乐观。
决策树可以自动地构建树结构的流程图，用于为输入变量值基于它们的特征加标签，虽然它们易于解释，但不适合处理特性值在决定合适标签过程中相互影响的情况。
在朴素贝叶斯分类器中，每个特征决定应该使用哪个标签的贡献是独立的。它允许特征值间有关联，但当两个或更多的特征高度相关时将会有问题。
最大熵分类器使用的基本模型与朴素贝叶斯相似；不过，它们使用了迭代优化来寻找使训练集的概率最大化的特征权值集合。
大多数从语料库自动构建的模型都是描述性的，也就是说，它们让我们知道哪些特征与给定的模式或结构相关，但它们没有给出关于这些特征和模式之间的因果关系的任何信息。

致谢
《Python自然语言处理》¹²³ ⁴，作者：Steven Bird, Ewan Klein & Edward Loper，是实践性很强的一部入门读物，2009年第一版，2015年第二版，本学习笔记结合上述版本，对部分内容进行了延伸学习、练习，在此分享，期待对大家有所帮助，欢迎加我微信（验证：NLP），一起学习讨论，不足之处，欢迎指正。
在这里插入图片描述

参考文献

http://nltk.org/ ↩︎
Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2009 ↩︎
（英）伯德，（英）克莱因，（美）洛普，《Python自然语言处理》，2010年，东南大学出版社 ↩︎
Steven Bird, Ewan Klein & Edward Loper,Natural Language Processing with Python,2015 ↩︎