NLTK 学习笔记(4)

文本分类

1. 有监督分类

先来个经典的图

(1) 性别判定

我们使用特征提取器处理名称数据,并划分特征集的结果链表为一个训练集和一个测试集。训练集用于训练一个新的“朴素贝叶斯”分类器。之后,我们在上面测试一些没有出现在训练数据中的名字(Neo and Trinity from 黑客帝国):

>>> def gender_features(word):
...       return {'last_letter':word[-1]}
... 
>>> from nltk.corpus import names
>>> import random
>>> names=([(name,'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])
>>> random.shuffle(names)
>>> 
>>> 
>>> f = [(gender_features(n),g) for (n,g) in names]
>>> trainset,testset = f[500:],f[:500]
>>> c = nltk.NaiveBayesClassifier.train(trainset)
>>> 
>>> c.classify(gender_features('Neo'))
'male'
>>> c.classify(gender_features('Trinity'))
'female'

>>> print nltk.classify.accuracy(c,testset)
0.76
>>> c.show_most_informative_features(5)
Most Informative Features
             last_letter = u'a'           female : male   =     34.4 : 1.0
             last_letter = u'k'             male : female =     29.9 : 1.0
             last_letter = u'f'             male : female =     16.7 : 1.0
             last_letter = u'p'             male : female =     11.9 : 1.0
             last_letter = u'v'             male : female =     10.5 : 1.0

>>> print nltk.classify.accuracy(c,trainset)
0.763030628694
>>> print names(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'list' object is not callable
>>> print names[0]
(u'Kourtney', 'female')
>>> print names[1]
(u'Mariellen', 'female')
>>> print names[100]
(u'Effie', 'female')
>>> print names[500]
(u'Kalindi', 'female')
>>> print names[300]
(u'Loraine', 'female')
>>> print names[30]
(u'Munroe', 'male')

(2) 选择正确的特征

从你直觉能想到的所有特征开始,然后用反复试验和错误纠分析检查哪些特征是实际有用的。

你要用于一个给定的学习算法的特征的数目是有限的——如果你提供太多的特征,那么该算法将高度依赖你的训练数据的特,性而一般化到新的例子的效果不会很好。这

个问题被称为过拟合,当运作在小训练集上时尤其会有问题。书中给出的过拟合的例子如下。(这里需要注意的是原文举的例子accuracy是0.748,本意是说Feature多了反而过拟合,但我的计算中精度确实提高了一点儿,但是对于由于feature增加的计算复杂度来说,或许(不一定,数据集过小,不好验证)得不偿失)

>>> def gender_features2(name):
...     features = {}
...     features["firstletter"] = name[0].lower()
...     features["lastletter"] = name[-1].lower()
...     for letter in 'abcdefghijklmnopqrstuvwxyz':
...             features["count(%s)" % letter] =name.lower().count(letter)
...             features["has(%s)" % letter] = (letter in name.lower())
...     return features
... 

>>> gender_features2('John')
{'count(u)': 0, 'has(d)': False, 'count(b)': 0, 'count(w)': 0, 'has(b)': False, 'count(l)': 0, 'count(q)': 0, 'count(n)': 1, 'has(j)': True, 'count(s)': 0, 'count(h)': 1, 'has(h)': True, 'has(y)': False, 'count(j)': 1, 'has(f)': False, 'has(o)': True, 'count(x)': 0, 'has(m)': False, 'count(z)': 0, 'has(k)': False, 'has(u)': False, 'count(d)': 0, 'has(s)': False, 'count(f)': 0, 'lastletter': 'n', 'has(q)': False, 'has(w)': False, 'has(e)': False, 'has(z)': False, 'count(t)': 0, 'count(c)': 0, 'has(c)': False, 'has(x)': False, 'count(v)': 0, 'count(m)': 0, 'has(a)': False, 'has(v)': False, 'count(p)': 0, 'count(o)': 1, 'has(i)': False, 'count(i)': 0, 'has(r)': False, 'has(g)': False, 'count(k)': 0, 'firstletter': 'j', 'count(y)': 0, 'has(n)': True, 'has(l)': False, 'count(e)': 0, 'has(t)': False, 'count(g)': 0, 'count(r)': 0, 'count(a)': 0, 'has(p)': False}
>>> featuresets2 = [(gender_features2(n), g) for (n,g) in names]
>>> train_set2, test_set2 = featuresets2[500:], featuresets2[:500]
>>> classifier2 = nltk.NaiveBayesClassifier.train(train_set2)
>>> print nltk.classify.accuracy(classifier2, test_set2)
0.782

(3) 错误分析(error analysis)方法

一旦初始特征集被选定,完善特征集的一个非常有成效的方法是错误分析。首先,我们选择一个开发集,包含用于创建模型的语料数据。然后将这种开发集分为训练集开发测试集

>>> train_names = names[1500:]
>>> devtest_names = names[500:1500]
>>> test_names = names[:500]
>>> train_set = [(gender_features(n), g) for (n,g) in train_names]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
>>> test_set = [(gender_features(n), g) for (n,g) in test_names]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.77

然后使用开发测试集,我们可以生成一个分类器预测名字性别时的错误列表

>>> errors = []
>>> for (name, tag) in devtest_names:
...     guess = classifier.classify(gender_features(name))
...     if guess != tag:
...             errors.append( (tag, guess, name) )
... 
>>> for (tag, guess, name) in sorted(errors): # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
...     print 'correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name)
... 
correct=female   guess=male     name=Aileen                        
correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Allsun                        
correct=female   guess=male     name=Alyss                         
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Anabel                        
correct=female   guess=male     name=Anett                         
correct=female   guess=male     name=Arden                         
correct=female   guess=male     name=Ariel                         
correct=female   guess=male     name=Barb                          
correct=female   guess=male     name=Blondell                      
correct=female   guess=male     name=Brear                         
correct=female   guess=male     name=Brett                         
correct=female   guess=male     name=Bridget                       
correct=female   guess=male     name=Brier                         
correct=female   guess=male     name=Brook                         
correct=female   guess=male     name=Carmon                        
correct=female   guess=male     name=Caro                          
correct=female   guess=male     name=Carolan                       
correct=female   guess=male     name=Carolyn                       
correct=female   guess=male     name=Carolynn                      
correct=female   guess=male     name=Cathrin                       
correct=female   guess=male     name=Cherlyn                       
correct=female   guess=male     name=Clio                          
correct=female   guess=male     name=Daniel                        
correct=female   guess=male     name=Deb                           
correct=female   guess=male     name=Demeter                       
correct=female   guess=male     name=Devon                         
correct=female   guess=male     name=Dido                          
correct=female   guess=male     name=Doralynn                      
correct=female   guess=male     name=Doreen                        
correct=female   guess=male     name=Dyann                         
correct=female   guess=male     name=Eilis                         
correct=female   guess=male     name=Emlynn                        
correct=female   guess=male     name=Eran                          
correct=female   guess=male     name=Ester                         
correct=female   guess=male     name=Ethel                         
correct=female   guess=male     name=Faun                          
correct=female   guess=male     name=Felicdad                      
correct=female   guess=male     name=Flor                          
correct=female   guess=male     name=Gabriel                       
correct=female   guess=male     name=Garland                       
correct=female   guess=male     name=Gates                         
correct=female   guess=male     name=Gill                          
correct=female   guess=male     name=Glyn                          
correct=female   guess=male     name=Glynnis                       
correct=female   guess=male     name=Gredel                        
correct=female   guess=male     name=Harriot                       
correct=female   guess=male     name=Hildegaard                    
correct=female   guess=male     name=Ingaberg                      
correct=female   guess=male     name=Isabel                        
correct=female   guess=male     name=Izabel                        
correct=female   guess=male     name=Jacquelin                     
correct=female   guess=male     name=Jannel                        
correct=female   guess=male     name=Jazmin                        
correct=female   guess=male     name=Jo-Ann                        
correct=female   guess=male     name=Jonell                        
correct=female   guess=male     name=Karyl                         
correct=female   guess=male     name=Katheryn                      
correct=female   guess=male     name=Katleen                       
correct=female   guess=male     name=Kellyann                      
correct=female   guess=male     name=Keriann                       
correct=female   guess=male     name=Kial                          
correct=female   guess=male     name=Koo                           
correct=female   guess=male     name=Kristal                       
correct=female   guess=male     name=Kylynn                        
correct=female   guess=male     name=Leanor                        
correct=female   guess=male     name=Lilas                         
correct=female   guess=male     name=Lilias                        
correct=female   guess=male     name=Lind                          
correct=female   guess=male     name=Linnell                       
correct=female   guess=male     name=Lorain                        
correct=female   guess=male     name=Mab                           
correct=female   guess=male     name=Mag                           
correct=female   guess=male     name=Magdalen                      
correct=female   guess=male     name=Mair                          
correct=female   guess=male     name=Marilyn                       
correct=female   guess=male     name=Marion                        
correct=female   guess=male     name=Maryann                       
correct=female   guess=male     name=Meaghan                       
correct=female   guess=male     name=Merilyn                       
correct=female   guess=male     name=Merl                          
correct=female   guess=male     name=Michal                        
correct=female   guess=male     name=Millisent                     
correct=female   guess=male     name=Moll                          
correct=female   guess=male     name=Nert                          
correct=female   guess=male     name=Nichol                        
correct=female   guess=male     name=Peg                           
correct=female   guess=male     name=Phil                          
correct=female   guess=male     name=Philis                        
correct=female   guess=male     name=Pier                          
correct=female   guess=male     name=Rahal                         
correct=female   guess=male     name=Raquel                        
correct=female   guess=male     name=Rayshell                      
correct=female   guess=male     name=Rhianon                       
correct=female   guess=male     name=Roselin                       
correct=female   guess=male     name=Shannon                       
correct=female   guess=male     name=Sharl                         
correct=female   guess=male     name=Shaun                         
correct=female   guess=male     name=Sheilakathryn                 
correct=female   guess=male     name=Sheril                        
correct=female   guess=male     name=Sherill                       
correct=female   guess=male     name=Sioux                         
correct=female   guess=male     name=Star                          
correct=female   guess=male     name=Stoddard                      
correct=female   guess=male     name=Theo                          
correct=female   guess=male     name=Wallis                        
correct=female   guess=male     name=Wileen                        
correct=female   guess=male     name=Yoko                          
correct=male     guess=female   name=Abbey                         
correct=male     guess=female   name=Aguste                        
correct=male     guess=female   name=Ali                           
correct=male     guess=female   name=Anatole                       
correct=male     guess=female   name=Andri                         
correct=male     guess=female   name=Arie                          
correct=male     guess=female   name=Ash                           
correct=male     guess=female   name=Ashby                         
correct=male     guess=female   name=Ashley                        
correct=male     guess=female   name=Avery                         
correct=male     guess=female   name=Baillie                       
correct=male     guess=female   name=Barde                         
correct=male     guess=female   name=Barney                        
correct=male     guess=female   name=Barnie                        
correct=male     guess=female   name=Benny                         
correct=male     guess=female   name=Bertie                        
correct=male     guess=female   name=Billy                         
correct=male     guess=female   name=Bjorne                        
correct=male     guess=female   name=Carlie                        
correct=male     guess=female   name=Chance                        
correct=male     guess=female   name=Chaunce                       
correct=male     guess=female   name=Christoph                     
correct=male     guess=female   name=Claire                        
correct=male     guess=female   name=Clare                         
correct=male     guess=female   name=Claude                        
correct=male     guess=female   name=Conway                        
correct=male     guess=female   name=Curtice                       
correct=male     guess=female   name=Davide                        
correct=male     guess=female   name=Davie                         
correct=male     guess=female   name=Dewey                         
correct=male     guess=female   name=Dickie                        
correct=male     guess=female   name=Dominique                     
correct=male     guess=female   name=Donnie                        
correct=male     guess=female   name=Dougie                        
correct=male     guess=female   name=Doyle                         
correct=male     guess=female   name=Dudley                        
correct=male     guess=female   name=Duffie                        
correct=male     guess=female   name=Dwayne                        
correct=male     guess=female   name=Emmy                          
correct=male     guess=female   name=Eugene                        
correct=male     guess=female   name=Ezra                          
correct=male     guess=female   name=Felipe                        
correct=male     guess=female   name=Garth                         
correct=male     guess=female   name=Gerome                        
correct=male     guess=female   name=Gerry                         
correct=male     guess=female   name=Graeme                        
correct=male     guess=female   name=Grove                         
correct=male     guess=female   name=Guillaume                     
correct=male     guess=female   name=Hadley                        
correct=male     guess=female   name=Harry                         
correct=male     guess=female   name=Hartley                       
correct=male     guess=female   name=Hercule                       
correct=male     guess=female   name=Jay                           
correct=male     guess=female   name=Jedediah                      
correct=male     guess=female   name=Jeramie                       
correct=male     guess=female   name=Jeremiah                      
correct=male     guess=female   name=Jody                          
correct=male     guess=female   name=Keefe                         
correct=male     guess=female   name=Kennedy                       
correct=male     guess=female   name=Lance                         
correct=male     guess=female   name=Lawrence                      
correct=male     guess=female   name=Locke                         
correct=male     guess=female   name=Lorrie                        
correct=male     guess=female   name=Luce                          
correct=male     guess=female   name=Marlowe                       
correct=male     guess=female   name=Matty                         
correct=male     guess=female   name=Maurise                       
correct=male     guess=female   name=Meredeth                      
correct=male     guess=female   name=Mitch                         
correct=male     guess=female   name=Mordecai                      
correct=male     guess=female   name=Morty                         
correct=male     guess=female   name=Noah                          
correct=male     guess=female   name=Noe                           
correct=male     guess=female   name=Paddie                        
correct=male     guess=female   name=Pearce                        
correct=male     guess=female   name=Pierce                        
correct=male     guess=female   name=Quincy                        
correct=male     guess=female   name=Radcliffe                     
correct=male     guess=female   name=Rafe                          
correct=male     guess=female   name=Ravi                          
correct=male     guess=female   name=Ray                           
correct=male     guess=female   name=Rene                          
correct=male     guess=female   name=Rodolphe                      
correct=male     guess=female   name=Rolfe                         
correct=male     guess=female   name=Rourke                        
correct=male     guess=female   name=Ruddie                        
correct=male     guess=female   name=Rusty                         
correct=male     guess=female   name=Sawyere                       
correct=male     guess=female   name=Sergei                        
correct=male     guess=female   name=Seth                          
correct=male     guess=female   name=Sheffie                       
correct=male     guess=female   name=Sherlocke                     
correct=male     guess=female   name=Shorty                        
correct=male     guess=female   name=Slade                         
correct=male     guess=female   name=Smith                         
correct=male     guess=female   name=Stearne                       
correct=male     guess=female   name=Steve                         
correct=male     guess=female   name=Stevy                         
correct=male     guess=female   name=Tanny                         
correct=male     guess=female   name=Temple                        
correct=male     guess=female   name=Terrence                      
correct=male     guess=female   name=Thorny                        
correct=male     guess=female   name=Trace                         
correct=male     guess=female   name=Troy                          
correct=male     guess=female   name=Tulley                        
correct=male     guess=female   name=Ty                            
correct=male     guess=female   name=Ulrich                        
correct=male     guess=female   name=Valentine                     
correct=male     guess=female   name=Vance                         
correct=male     guess=female   name=Vassily                       
correct=male     guess=female   name=Verge                         
correct=male     guess=female   name=Vinny                         
correct=male     guess=female   name=Vite                          
correct=male     guess=female   name=Wallace                       
correct=male     guess=female   name=Wayne                         
correct=male     guess=female   name=Willie                        
correct=male     guess=female   name=Yance                         
correct=male     guess=female   name=Yule                          
correct=male     guess=female   name=Zachary                       
correct=male     guess=female   name=Zary                          
correct=male     guess=female   name=Zollie 

根据观察找到规律如下

例如:yn 结尾的名字显示以女性为主,尽管事实上,n 结尾的名字往往是男性;以ch 结尾的名字通常是男性,尽管以h 结尾的名字倾向于是女性。因此,调整我们的特征提取器包括两个字母后缀的特征:(值得注意的是效果虽然不错,从0.76升到了0.77,但是书中举例时其实是升了2%到达了0.78.回头看目前的计算效果还不如我们追加了feature的效果,也许是nltk3.0中bayes方法得到了改善,feature越多效果越好?

>>> def gender_features(word):
...     return {'suffix1': word[-1:],
...             'suffix2': word[-2:]}
... 
>>> train_set = [(gender_features(n), g) for (n,g) in train_names]
>>> devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, devtest_set)
0.771

这个错误分析过程可以不断重复,检查存在于由新改进的分类器产生的错误中的模式,每一次错误分析过程被重复,我们应该选择一个不同的开发测试/训练分割,以确保该分类器不会开始反映开发测试集的特质。

但是,一旦我们已经使用了开发测试集帮助我们开发模型,关于这个模型在新数据会表现多好,我们将不能再相信它会给我们一个准确地结果!因此,保持测试集分离、未使用过,直到我们的模型开发完毕是很重要的。在这一点上,我们可以使用测试集评估模型在新的输入值上执行的有多好。(很可惜的是我们在算了一下测试集的accuracy0.62,反而远远逊于一开始的0.76。虽然方向是对的,增加的这个feature效果却不好)

>>> print nltk.classify.accuracy(classifier, test_set)
0.62





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值