来自http://www.nltk.org/book/ch06.html Learning to classify texts
分类:classification:就是为输入选择正确的标签
分类分为两种:supervised classification 和 unsupervised classification 这次先说supervised classification:
如果训练集中每个输入都有正确的标签,则训练出的分类方法就是supervised,过程如图:
对输入抽取出特征值集合,这些特征集合包含了每个输入的基本信息用于分类,这些特征与标签一同构成训练模型;在预测时,还是对输入提取出特征集合,然后预测标签
1.1 根据名字测性别
英文人名中,女性名字多以a,e,i结尾,男性多以k,t,o等结尾。我们这里写一个分类器来对给定的名字分类
函数
def gender_features(word):
return {'last_letter':word[-1]}
返回给定名字的最后一个字母(字典)作为输入的特征值
导入数据:
from nltk.corpus import names
from nltk.classify import apply_features
import nltk
labeled_names = ([(name,'male')for name in names.words('male.txt')]+[(name,'female')for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)
如果报错可能需要 import nltk
nltk.download(names)
labled_names就是所有带标签的数据集
建立分类器:
featuresets = [(gender_features(n),gender) for (n,gender) in labeled_names]
train_set,test_set = featuresets[500:],featuresets[:500]
classfier = nltk.NaiveBayesClassifier.train(train_set)
取出每个名字的最后一个字母作为特征,保留其标签gender,将全部数据划分为两个集合:训练集和测试集,用训练集来训练简单贝叶斯的分类算法
找两个不在训练集中的名字测试一下:
print(classfier.classify(gender_features('Neo')))
print(classfier.classify(gender_features('Trinity')))
可以用测试集来查看分类的准确度:
print(nltk.classify.accuracy(classfier,test_set))
这个大概在0.77左右,每次执行有轻微不同
最后,可以调用函数查看在判断性别上起了明显作用的前几个字母:
classfier.show_most_informative_features(5)
输出为:
Most Informative Features
last_letter = 'a' female : male = 35.6 : 1.0
last_letter = 'k' male : female = 30.3 : 1.0
last_letter = 'f' male : female = 16.6 : 1.0
last_letter = 'v' male : female = 15.3 : 1.0
last_letter = 'p' male : female = 11.9 : 1.0
全部代码为:
from nltk.corpus import names
from nltk.classify import apply_features
import nltk
labeled_names = ([(name,'male')for name in names.words('male.txt')]+[(name,'female')for name in names.words('female.txt')])
import random
random.shuffle(labeled_names)
def gender_features(word):
return {'last_letter':word[-1]}
#featuresets = [(gender_features(n),gender) for (n,gender) in labeled_names]
#train_set,test_set = featuresets[500:],featuresets[:500]
train_set = apply_features(gender_features,labeled_names[500:])
test_set = apply_features(gender_features,labeled_names[:500])
classfier = nltk.NaiveBayesClassifier.train(train_set)
print(classfier.classify(gender_features('Neo')))
print(classfier.classify(gender_features('Trinity')))
print(nltk.classify.accuracy(classfier,test_set))
classfier.show_most_informative_features(5)
注释掉的部分与下面apply_features的功能一样,只不过apply_features占用的内存少
可以修改gender_feature()函数的返回值来检测不同特征对分类算法准确度的影响
1.2 选择合适的特征
选择相关的特征并编码为适合学习算法的形式对算法效果有重要的影响。通常特征提取器都是要不断的尝试和检验,选出合适的特征值。但特征值过多时,可能会overfit,导致训练出的算法过渡适用于训练集,而放到其他数据集上工作的就不好。选定了初始的特征集后,用错误分析可以比较好的凝练特征集。除了测试集外,把其他数据集分为两组:train_set 和dev_set,用训练集训练模型,然后用dev_test来测试效果,检查算法的错误
比如上面的名字分类例子中,选定train_set,dev-set和test-set后,用train_set训练模型,dev-set测试准确度,以及输出分类不正确的结果:
errors = []
for (name, tag) in devtest_names:
guess = classifier.classify(gender_features(name))
if guess != tag:
errors.append( (tag, guess, name) )
从输出结果上可以看出名字的后两个字母组成的后缀也有很大影响,所以修改特征提取器:
def gender_features(word):
return {'suffix1': word[-1:],
'suffix2': word[-2:]}
再次运行,发现精确度提高了
1.3 Document Classification
用python nltk中的movie_reviews 中的数据,构造的数据集为fileid和分类
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
特征提取器是提取了是否包含整个数据集中的高频词汇
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) #words,FreqDist() 方法获取到每个单词的出现次数
word_features = list(all_words)[:2000]
#checks whether each of these words is present in a given document.
def document_features(document):
#The reason that we compute the set of all words in a document in [3], rather than just checking if word in document, is that checking whether a word occurs in a set is much faster than checking whether it occurs in a list (4.7).
document_words = set(document)
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
划分训练集、测试集,训练模型:
featuresets = [(document_features(d),c) for (d,c) in documents]
train_set =featuresets[100:]
test_set = featuresets[:100]
classfier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classfier,test_set))