1.P126代码
为了定义阈值,请修改初始化方法,在classifier中加入一个新的实例变量:
def __init__(self, getfeatures):
classifier.__init__(self, getfeatures)
self.thresholds = {}
这段代码在做修改时,应直接在类classifier里的定义_ _ init _ _() 中加入最后一句代码,前面的一句代码就不要了。
修改后的_ _ init _ _()为:
class classifier:
def __init__(self, getfeatures, filename = None):
#count the number of feature or classify group
self.fc = {}
#count the number of doc in each classification
self.cc = {}
self.getfeatures = getfeatures
#classifier.__init__(self, getfeatures)
self.thresholds = {}
2.P131
在输入代码验证时,如果输入
>>> reload(docclass)
<module 'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
就会提示错误
>>> reload(docclass)
<module 'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "docclass.py", line 94, in classify
probs[cat] = self.prob(item, cat)
AttributeError: fisherclassifier instance has no attribute 'prob'
正确做法应该是在重新加载了文件后应该先重新对c1进行重新定义,就不会提示错误了。如下
>>> reload(docclass)
<module 'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "docclass.py", line 94, in classify
probs[cat] = self.prob(item, cat)
AttributeError: fisherclassifier instance has no attribute 'prob'
>>> c1 = docclass.fisherclassifier(docclass.getwords)
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
'good'
>>> c1.classify('quick money')
'bad'
>>> c1.setminimum('bad', 0.8)
>>> c1.classify('quick money')
'good'
>>> c1.setminimum('good', 0.4)
>>> c1.classify('quick money')
'good'
>>>
3.P128
本页中进行归一化计算时,文章中的公式为:
cprob = clf/(clf+nclf)
但是在程序中却是
p = clf / (freqsum)
我认为在计算nclf时就已经包括了clf,故不需要再加一次既可以实现归一化,所以应该将文章中的公式改为:
cprob = clf/nclf
当然,加不加clf并不会影响最终结果,只会影响概率的数值,不会影响排行。
4.P129
文中的“包含单词‘casino’的文档是垃圾邮件的概率为0.9”一句有误,经过计算,包含单词‘casino’的文档是垃圾邮件的概率应该为1.0
5.P137
书中代码:
def entryfeatures(entry):
splitter = re.compile('\\W*')
f = {}
#get words in title and sign it
titlewords = [s.lower() for s in splitter.split(entry['title']) if len(s) > 2 and len(s) < 20]
for w in titlewords: f['Title: ' + w] = 1
#get words in absrtact
summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]
#count capitalize words
uc = 0
for i in range(len(summarywords)):
w = summarywords[i]
f[w] = 1
if w.isupper(): uc += 1
#words from absrtact as features
if i < len(summarywords) - 1:
twowords = ' '.join(summarywords[i : i + 1])
f[twowords] = 1
#keep names compile of artile's creater and publicor
f['Publisher: ' + entry['publisher']] = 1
#UPPERCASE is a virtual word, and it is used to aim at too many capitalize words exist
if float(uc) / len(summarywords) > 0.3: f['UPPERCASE'] = 1
return f
统计大写单词的数量时,用到了前面提取到的summarywords变量,但是,在提取summarywords变量时,
summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]
可以看到,lower()函数已经把summarywords变量中的单词全变成小写的了。所以在统计后面的大写单词也就没有意义了。所以我认为应该改为
summarywords = [s for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]
请忽略我的渣英语。