《集体智慧编程》第六章

最新推荐文章于 2018-12-26 15:29:48 发布

清风不识字12138

最新推荐文章于 2018-12-26 15:29:48 发布

阅读量358

点赞数

分类专栏：集体智慧编程文章标签：集体智慧编程

本文链接：https://blog.csdn.net/qq_33363973/article/details/78329579

版权

集体智慧编程专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1.P126代码
为了定义阈值，请修改初始化方法，在classifier中加入一个新的实例变量：

def __init__(self, getfeatures):
    classifier.__init__(self, getfeatures)
    self.thresholds = {}

这段代码在做修改时，应直接在类classifier里的定义_ _ init _ _() 中加入最后一句代码，前面的一句代码就不要了。
修改后的_ _ init _ _()为：

class classifier:
        def __init__(self, getfeatures, filename = None):
                #count the number of feature or classify group
                self.fc = {}
                #count the number of doc in each classification
                self.cc = {}
                self.getfeatures = getfeatures
                #classifier.__init__(self, getfeatures)
                self.thresholds = {}

2.P131
在输入代码验证时，如果输入

>>> reload(docclass)
<module 'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')

就会提示错误

>>> reload(docclass)
<module 'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "docclass.py", line 94, in classify
    probs[cat] = self.prob(item, cat)
AttributeError: fisherclassifier instance has no attribute 'prob'

正确做法应该是在重新加载了文件后应该先重新对c1进行重新定义，就不会提示错误了。如下

>>> reload(docclass)
<module 'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "docclass.py", line 94, in classify
    probs[cat] = self.prob(item, cat)
AttributeError: fisherclassifier instance has no attribute 'prob'
>>> c1 = docclass.fisherclassifier(docclass.getwords)
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
'good'
>>> c1.classify('quick money')
'bad'
>>> c1.setminimum('bad', 0.8)
>>> c1.classify('quick money')
'good'
>>> c1.setminimum('good', 0.4)
>>> c1.classify('quick money')
'good'
>>>

3.P128
本页中进行归一化计算时，文章中的公式为：
cprob = clf/(clf+nclf)
但是在程序中却是

p = clf / (freqsum)

我认为在计算nclf时就已经包括了clf，故不需要再加一次既可以实现归一化，所以应该将文章中的公式改为：
cprob = clf/nclf
当然，加不加clf并不会影响最终结果，只会影响概率的数值，不会影响排行。
4.P129
文中的“包含单词‘casino’的文档是垃圾邮件的概率为0.9”一句有误，经过计算，包含单词‘casino’的文档是垃圾邮件的概率应该为1.0
5.P137
书中代码：

def entryfeatures(entry):
    splitter = re.compile('\\W*')
    f = {}

    #get words in title and sign it
    titlewords = [s.lower() for s in splitter.split(entry['title']) if len(s) > 2 and len(s) < 20]
    for w in titlewords: f['Title: ' + w] = 1

    #get words in absrtact
    summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]

    #count capitalize words
    uc = 0
    for i in range(len(summarywords)):
        w = summarywords[i]
        f[w] = 1
        if w.isupper(): uc += 1

        #words from absrtact as features
        if i < len(summarywords) - 1:
            twowords = ' '.join(summarywords[i : i + 1])
            f[twowords] = 1

    #keep names compile of artile's creater and publicor
    f['Publisher: ' + entry['publisher']] = 1

    #UPPERCASE is a virtual word, and it is used to aim at too many capitalize words exist
    if float(uc) / len(summarywords) > 0.3: f['UPPERCASE'] = 1

    return f

统计大写单词的数量时，用到了前面提取到的summarywords变量，但是，在提取summarywords变量时，

summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]

可以看到，lower（）函数已经把summarywords变量中的单词全变成小写的了。所以在统计后面的大写单词也就没有意义了。所以我认为应该改为

summarywords = [s for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]

请忽略我的渣英语。

清风不识字12138

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《集体智慧编程》第六章

P126代码为了定义阈值，请修改初始化方法，在classifier中加入一个新的实例变量：def __init__(self, getfeatures): classifier.__init__(self, getfeatures) self.thresholds = {}这段代码在做修改时，应直接在类classifier里的定义_ _ init _ _() 中加入最后一句代码，
复制链接

扫一扫