文本分类
1. 有监督分类
先来个经典的图
(1) 性别判定
我们使用特征提取器处理名称数据,并划分特征集的结果链表为一个训练集和一个测试集。训练集用于训练一个新的“朴素贝叶斯”分类器。之后,我们在上面测试一些没有出现在训练数据中的名字(Neo and Trinity from 黑客帝国):
>>> def gender_features(word):
... return {'last_letter':word[-1]}
...
>>> from nltk.corpus import names
>>> import random
>>> names=([(name,'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])
>>> random.shuffle(names)
>>>
>>>
>>> f = [(gender_features(n),g) for (n,g) in names]
>>> trainset,testset = f[500:],f[:500]
>>> c = nltk.NaiveBayesClassifier.train(trainset)
>>>
>>> c.classify(gender_features('Neo'))
'male'
>>> c.classify(gender_features('Trinity'))
'female'
>>> print nltk.classify.accuracy(c,testset)
0.76
>>> c.show_most_informative_features(5)
Most Informative Features
last_letter = u'a' female : male = 34.4 : 1.0
last_letter = u'k' male : female = 29.9 : 1.0
last_letter = u'f' male : female = 16.7 : 1.0
last_letter = u'p' male : female = 11.9 : 1.0
last_letter = u'v' male : female = 10.5 : 1.0
>>> print nltk.classify.accuracy(c,trainset)
0.763030628694
>>> print names(0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'list' object is not callable
>>> print names[0]
(u'Kourtney', 'female')
>>> print names[1]
(u'Mariellen', 'female')
>>> print names[100]
(u'Effie', 'female')
>>> print names[500]
(u'Kalindi', 'female')
>>> print names[300]
(u'Loraine', 'female')
>>> print names[30]
(u'Munroe', 'male')
(2) 选择正确的特征
从你直觉能想到的所有特征开始,然后用反复试验和错误纠分析检查哪些特征是实际有用的。
你要用于一个给定的学习算法的特征的数目是有限的——如果你提供太多的特征,那么该算法将高度依赖你的训练数据的特,性而一般化到新的例子的效果不会很好。这
个问题被称为过拟合,当运作在小训练集上时尤其会有问题。书中给出的过拟合的例子如下。(这里需要注意的是原文举的例子accuracy是0.748,本意是说Feature多了反而过拟合,但我的计算中精度确实提高了一点儿,但是对于由于feature增加的计算复杂度来说,或许(不一定,数据集过小,不好验证)得不偿失)
>>> def gender_features2(name):
... features = {}
... features["firstletter"] = name[0].lower()
... features["lastletter"] = name[-1].lower()
... for letter in 'abcdefghijklmnopqrstuvwxyz':
... features["count(%s)" % letter] =name.lower().count(letter)
... features["has(%s)" % letter] = (letter in name.lower())
... return features
...
>>> gender_features2('John')
{'count(u)': 0, 'has(d)': False, 'count(b)': 0, 'count(w)': 0, 'has(b)': False, 'count(l)': 0, 'count(q)': 0, 'count(n)': 1, 'has(j)': True, 'count(s)': 0, 'count(h)': 1, 'has(h)': True, 'has(y)': False, 'count(j)': 1, 'has(f)': False, 'has(o)': True, 'count(x)': 0, 'has(m)': False, 'count(z)': 0, 'has(k)': False, 'has(u)': False, 'count(d)': 0, 'has(s)': False, 'count(f)': 0, 'lastletter': 'n', 'has(q)': False, 'has(w)': False, 'has(e)': False, 'has(z)': False, 'count(t)': 0, 'count(c)': 0, 'has(c)': False, 'has(x)': False, 'count(v)': 0, 'count(m)': 0, 'has(a)': False, 'has(v)': False, 'count(p)': 0, 'count(o)': 1, 'has(i)': False, 'count(i)': 0, 'has(r)': False, 'has(g)': False, 'count(k)': 0, 'firstletter': 'j', 'count(y)': 0, 'has(n)': True, 'has(l)': False, 'count(e)': 0, 'has(t)': False, 'count(g)': 0, 'count(r)': 0, 'count(a)': 0, 'has(p)': False}
>>> featuresets2 = [(gender_features2(n), g) for (n,g) in names]
>>> train_set2, test_set2 = featuresets2[500:], featuresets2[:500]
>>> classifier2 = nltk.NaiveBayesClassifier.train(train_set2)
>>> print nltk.classify.accuracy(classifier2, test_set2)
0.782
(3) 错误分析(error analysis)方法
一旦初始特征集被选定,完善特征集的一个非常有成效的方法是错误分析。首先,我们选择一个开发集,包含用于创建模型的语料数据。然后将这种开发集分为训练集和开发测试集。
>>> train_names = names[1500:]
>>> devtest_names = names[500:1500]
>>> test_names = names[:500]
>>> train_set = [(gender_features(n), g) for (n,g) in train_names]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
>>> test_set = [(gender_features(n), g) for (n,g) in test_names]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.77
然后使用开发测试集,我们可以生成一个分类器预测名字性别时的错误列表。
>>> errors = []
>>> for (name, tag) in devtest_names:
... guess = classifier.classify(gender_features(name))
... if guess != tag:
... errors.append( (tag, guess, name) )
...
>>> for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
... print 'correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name)
...
correct=female guess=male name=Aileen
correct=female guess=male name=Alexis
correct=female guess=male name=Allsun
correct=female guess=male name=Alyss
correct=female guess=male name=Amber
correct=female guess=male name=Anabel
correct=female guess=male name=Anett
correct=female guess=male name=Arden
correct=female guess=male name=Ariel
correct=female guess=male name=Barb
correct=female guess=male name=Blondell
correct=female guess=male name=Brear
correct=female guess=male name=Brett
correct=female guess=male name=Bridget
correct=female guess=male name=Brier
correct=female guess=male name=Brook
correct=female guess=male name=Carmon
correct=female guess=male name=Caro
correct=female guess=male name=Carolan
correct=female guess=male name=Carolyn
correct=female guess=male name=Carolynn
correct=female guess=male name=Cathrin
correct=female guess=male name=Cherlyn
correct=female guess=male name=Clio
correct=female guess=male name=Daniel
correct=female guess=male name=Deb
correct=female guess=male name=Demeter
correct=female guess=male name=Devon
correct=female guess=male name=Dido
correct=female guess=male name=Doralynn
correct=female guess=male name=Doreen
correct=female guess=male name=Dyann
correct=female guess=male name=Eilis
correct=female guess=male name=Emlynn
correct=female guess=male name=Eran
correct=female guess=male name=Ester
correct=female guess=male name=Ethel
correct=female guess=male name=Faun
correct=female guess=male name=Felicdad
correct=female guess=male name=Flor
correct=female guess=male name=Gabriel
correct=female guess=male name=Garland
correct=female guess=male name=Gates
correct=female guess=male name=Gill
correct=female guess=male name=Glyn
correct=female guess=male name=Glynnis
correct=female guess=male name=Gredel
correct=female guess=male name=Harriot
correct=female guess=male name=Hildegaard
correct=female guess=male name=Ingaberg
correct=female guess=male name=Isabel
correct=female guess=male name=Izabel
correct=female guess=male name=Jacquelin
correct=female guess=male name=Jannel
correct=female guess=male name=Jazmin
correct=female guess=male name=Jo-Ann
correct=female guess=male name=Jonell
correct=female guess=male name=Karyl
correct=female guess=male name=Katheryn
correct=female guess=male name=Katleen
correct=female guess=male name=Kellyann
correct=female guess=male name=Keriann
correct=female guess=male name=Kial
correct=female guess=male name=Koo
correct=female guess=male name=Kristal
correct=female guess=male name=Kylynn
correct=female guess=male name=Leanor
correct=female guess=male name=Lilas
correct=female guess=male name=Lilias
correct=female guess=male name=Lind
correct=female guess=male name=Linnell
correct=female guess=male name=Lorain
correct=female guess=male name=Mab
correct=female guess=male name=Mag
correct=female guess=male name=Magdalen
correct=female guess=male name=Mair
correct=female guess=male name=Marilyn
correct=female guess=male name=Marion
correct=female guess=male name=Maryann
correct=female guess=male name=Meaghan
correct=female guess=male name=Merilyn
correct=female guess=male name=Merl
correct=female guess=male name=Michal
correct=female guess=male name=Millisent
correct=female guess=male name=Moll
correct=female guess=male name=Nert
correct=female guess=male name=Nichol
correct=female guess=male name=Peg
correct=female guess=male name=Phil
correct=female guess=male name=Philis
correct=female guess=male name=Pier
correct=female guess=male name=Rahal
correct=female guess=male name=Raquel
correct=female guess=male name=Rayshell
correct=female guess=male name=Rhianon
correct=female guess=male name=Roselin
correct=female guess=male name=Shannon
correct=female guess=male name=Sharl
correct=female guess=male name=Shaun
correct=female guess=male name=Sheilakathryn
correct=female guess=male name=Sheril
correct=female guess=male name=Sherill
correct=female guess=male name=Sioux
correct=female guess=male name=Star
correct=female guess=male name=Stoddard
correct=female guess=male name=Theo
correct=female guess=male name=Wallis
correct=female guess=male name=Wileen
correct=female guess=male name=Yoko
correct=male guess=female name=Abbey
correct=male guess=female name=Aguste
correct=male guess=female name=Ali
correct=male guess=female name=Anatole
correct=male guess=female name=Andri
correct=male guess=female name=Arie
correct=male guess=female name=Ash
correct=male guess=female name=Ashby
correct=male guess=female name=Ashley
correct=male guess=female name=Avery
correct=male guess=female name=Baillie
correct=male guess=female name=Barde
correct=male guess=female name=Barney
correct=male guess=female name=Barnie
correct=male guess=female name=Benny
correct=male guess=female name=Bertie
correct=male guess=female name=Billy
correct=male guess=female name=Bjorne
correct=male guess=female name=Carlie
correct=male guess=female name=Chance
correct=male guess=female name=Chaunce
correct=male guess=female name=Christoph
correct=male guess=female name=Claire
correct=male guess=female name=Clare
correct=male guess=female name=Claude
correct=male guess=female name=Conway
correct=male guess=female name=Curtice
correct=male guess=female name=Davide
correct=male guess=female name=Davie
correct=male guess=female name=Dewey
correct=male guess=female name=Dickie
correct=male guess=female name=Dominique
correct=male guess=female name=Donnie
correct=male guess=female name=Dougie
correct=male guess=female name=Doyle
correct=male guess=female name=Dudley
correct=male guess=female name=Duffie
correct=male guess=female name=Dwayne
correct=male guess=female name=Emmy
correct=male guess=female name=Eugene
correct=male guess=female name=Ezra
correct=male guess=female name=Felipe
correct=male guess=female name=Garth
correct=male guess=female name=Gerome
correct=male guess=female name=Gerry
correct=male guess=female name=Graeme
correct=male guess=female name=Grove
correct=male guess=female name=Guillaume
correct=male guess=female name=Hadley
correct=male guess=female name=Harry
correct=male guess=female name=Hartley
correct=male guess=female name=Hercule
correct=male guess=female name=Jay
correct=male guess=female name=Jedediah
correct=male guess=female name=Jeramie
correct=male guess=female name=Jeremiah
correct=male guess=female name=Jody
correct=male guess=female name=Keefe
correct=male guess=female name=Kennedy
correct=male guess=female name=Lance
correct=male guess=female name=Lawrence
correct=male guess=female name=Locke
correct=male guess=female name=Lorrie
correct=male guess=female name=Luce
correct=male guess=female name=Marlowe
correct=male guess=female name=Matty
correct=male guess=female name=Maurise
correct=male guess=female name=Meredeth
correct=male guess=female name=Mitch
correct=male guess=female name=Mordecai
correct=male guess=female name=Morty
correct=male guess=female name=Noah
correct=male guess=female name=Noe
correct=male guess=female name=Paddie
correct=male guess=female name=Pearce
correct=male guess=female name=Pierce
correct=male guess=female name=Quincy
correct=male guess=female name=Radcliffe
correct=male guess=female name=Rafe
correct=male guess=female name=Ravi
correct=male guess=female name=Ray
correct=male guess=female name=Rene
correct=male guess=female name=Rodolphe
correct=male guess=female name=Rolfe
correct=male guess=female name=Rourke
correct=male guess=female name=Ruddie
correct=male guess=female name=Rusty
correct=male guess=female name=Sawyere
correct=male guess=female name=Sergei
correct=male guess=female name=Seth
correct=male guess=female name=Sheffie
correct=male guess=female name=Sherlocke
correct=male guess=female name=Shorty
correct=male guess=female name=Slade
correct=male guess=female name=Smith
correct=male guess=female name=Stearne
correct=male guess=female name=Steve
correct=male guess=female name=Stevy
correct=male guess=female name=Tanny
correct=male guess=female name=Temple
correct=male guess=female name=Terrence
correct=male guess=female name=Thorny
correct=male guess=female name=Trace
correct=male guess=female name=Troy
correct=male guess=female name=Tulley
correct=male guess=female name=Ty
correct=male guess=female name=Ulrich
correct=male guess=female name=Valentine
correct=male guess=female name=Vance
correct=male guess=female name=Vassily
correct=male guess=female name=Verge
correct=male guess=female name=Vinny
correct=male guess=female name=Vite
correct=male guess=female name=Wallace
correct=male guess=female name=Wayne
correct=male guess=female name=Willie
correct=male guess=female name=Yance
correct=male guess=female name=Yule
correct=male guess=female name=Zachary
correct=male guess=female name=Zary
correct=male guess=female name=Zollie
根据观察找到规律如下
例如:yn 结尾的名字显示以女性为主,尽管事实上,n 结尾的名字往往是男性;以ch 结尾的名字通常是男性,尽管以h 结尾的名字倾向于是女性。因此,调整我们的特征提取器包括两个字母后缀的特征:(值得注意的是效果虽然不错,从0.76升到了0.77,但是书中举例时其实是升了2%到达了0.78.回头看目前的计算效果还不如我们追加了feature的效果,也许是nltk3.0中bayes方法得到了改善,feature越多效果越好?)
>>> def gender_features(word):
... return {'suffix1': word[-1:],
... 'suffix2': word[-2:]}
...
>>> train_set = [(gender_features(n), g) for (n,g) in train_names]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.771
这个错误分析过程可以不断重复,检查存在于由新改进的分类器产生的错误中的模式,每一次错误分析过程被重复,我们应该选择一个不同的开发测试/训练分割,以确保该分类器不会开始反映开发测试集的特质。
但是,一旦我们已经使用了开发测试集帮助我们开发模型,关于这个模型在新数据会表现多好,我们将不能再相信它会给我们一个准确地结果!因此,保持测试集分离、未使用过,直到我们的模型开发完毕是很重要的。在这一点上,我们可以使用测试集评估模型在新的输入值上执行的有多好。(很可惜的是我们在算了一下测试集的accuracy0.62,反而远远逊于一开始的0.76。虽然方向是对的,增加的这个feature效果却不好)
>>> print nltk.classify.accuracy(classifier, test_set)
0.62