sklearn朴素贝叶斯分类器_Python 手写朴素贝叶斯分类器检测垃圾邮件／短信

最新推荐文章于 2023-09-19 09:00:28 发布

weixin_39908263

最新推荐文章于 2023-09-19 09:00:28 发布

阅读量650

点赞数 1

文章标签： sklearn朴素贝叶斯分类器

自己从头手写一下这些经典的算法，不调用 sklearn 等 API，调一调参数，蛮有收获和启发。

Code on Github: https://gist.github.com/JackonYang/5d354a2985f1bd77ead1c8a260649225

数据集

来自 kaggle: SMS Spam Collection Dataset

概要：5572 条短信，13% 的 spam。

选择这个数据集的原因：

短信的文本预处理要比 email 简单一些，运算量小，更容易聚焦算法本身。
数据集来自 kaggle，取样相对科学一些，更容易准确的反应算法的效果。

我的数据备份：

github.com spam.csvgithub.com

算法原理

目标函数：给定一篇文章(d)，计算属于各个分类(c) 的概率，以概率最大的分类作为最终结果。

在垃圾邮件／短信检测的案例里，分类只有 2 个：spam，not-spam.

在垃圾邮件检测的特定领域里，not-spam 通常又叫 ham。没有什么原因，最初的大佬一时兴起想到了这个名字而已。所以，分类名字就变成了 spam, ham

根据贝叶斯公式，变形为

其中 f1, f2, ..., fn 是 document 的 feature。

有很多种选取 feature 的方法，比如，单词出现频率，单词的 TF/IDF 值，去除 stop-words 以后的单词频率。选择什么 feature，与贝叶斯无关，由我们要解决的问题本身决定。后面专门讨论和实验对比不同的 feature。

假定 feature 相互独立，实践上，即使不相互独立，直接用的效果也不错。

得到新公式：

跟其他的 language model 一样，换成 log space，

Feature 选取

spammy email 检测的算法已经很成熟，商用的版本里，会用到发件人地址等非文本特征。

此处，我们把讨论约定在文本／自然语言特征范围内。

在机器学习之前，已经有 rule based 的过滤器，大家发现的规则，比如：

online pharmaceutical
WITHOUT ANY COST
Dear Winner

总结下来，垃圾邮件喜欢把一些 keyword 全部大写，表述习惯上与普通文本不同。

常见的 NLP 预处理 pipeline，比如，全部转小写，TF/IDF 等，反而会把一些 feature 处理掉。此处可能不适合做这些预处理。

我们以单词出现次数这个最简单的指标作为特征。

训练语料里，所有分类下的所有单词，构成一个 vocabulary，然后在每个类别下，分别统计各个单词的出现次数。

在某个分类下没有出现的单词，概率是 0，导致最终的概率也都是 0。为了解决这个问题，使用 add-one (Laplace) smoothing,

伪代码

模型评价

以 spam 分类作为 positive 分类。

Python 实现

import csv
import string
import numpy as np
import math


def load_data(filename, train_ratio):
    with open(filename, "rb") as f:
        csv_reader = csv.reader(f)
        csv_reader.next()  # header
        dataset = [(line[0], line[1]) for line in csv_reader]

    np.random.shuffle(dataset)
    train_size = int(len(dataset) * train_ratio)
    return dataset[:train_size], dataset[train_size:]


def train(train_set):
    total_doc_cnt = len(train_set)

    label_doc_cnt = {}
    bigdoc_words = {}

    for label, doc in train_set:
        if label not in label_doc_cnt:
            # init
            label_doc_cnt[label] = 0
            bigdoc_words[label] = []

        label_doc_cnt[label] += 1
        bigdoc_words[label].extend([
            w.strip(string.punctuation) for w in doc.split()])

    vocabulary = set()
    for words in bigdoc_words.values():
        vocabulary |= set(words)

    V = len(vocabulary)
    log_priors = {label: math.log(1.0 * cnt / total_doc_cnt) for label, cnt in label_doc_cnt.items()}

    log_likelihoods = dict()
    for label, words in bigdoc_words.items():
        word_cnt = len(words) + V
        log_likelihoods[label] = [math.log(1.0 * (1 + words.count(w)) / word_cnt) for w in vocabulary]

    return log_priors, log_likelihoods, vocabulary


def predict(log_priors, log_likelihoods, vocabulary, input_text, expect_label=None):
    words = {w.strip(string.punctuation) for w in input_text.split()}

    prob_max = 0
    label_max = None

    probs = {}  # tmp for log
    for label, likelihood in log_likelihoods.items():
        prob = log_priors[label] + sum([p for w, p in zip(vocabulary, likelihood) if w in words])
        probs[label] = prob

        if not prob_max or prob > prob_max:
            prob_max = prob
            label_max = label

    if expect_label and expect_label != label_max:
        print '---'
        print 'expect: %s, got: %s' % (expect_label, label_max)
        print probs
        print input_text

    return label_max


def main():
    filename = 'input/spam.csv'
    train_ratio = 0.75
    train_data, test_data = load_data(filename, train_ratio)

    print('data loaded. train: {}, test: {}').format(
        len(train_data), len(test_data))

    # train the model
    log_priors, log_likelihoods, vocabulary = train(train_data)

    print 'model trained. log_priors: {}, V(vocabulary word count): {}'.format(log_priors, len(vocabulary))

    pos_true = 0
    pos_false = 0
    neg_false = 0
    neg_true = 0

    for label, text in test_data:
        got = predict(log_priors, log_likelihoods, vocabulary, text, label)
        if label != got:
            if label == 'spam':
                pos_false += 1
            else:
                neg_false += 1
        else:
            if label == 'spam':
                pos_true += 1
            else:
                neg_true += 1

    print 'positive(spam) true: %s, false: %s' % (pos_true, pos_false)
    print 'negative true: %s, false: %s' % (neg_true, neg_false)
    print 'Precision: %.2f%%, Recall: %.2f%%' % (
        100.0 * pos_true / (pos_true + pos_false),
        100.0 * pos_true / (pos_true + neg_false),
        )


if __name__ == '__main__':
    main()

运行结果

在 load_data 函数里，对 dataset 做了 shuffle 洗牌，所以，每次运行结果都会有区别，但 Precision 基本在 90% 左右，Recall 96%

data loaded. train: 4179, test: 1393
model trained. log_priors: {'ham': -0.14608724117045765, 'spam': -1.995705843726764}, V(vocabulary word count): 9996
positive(spam) true: 156, false: 23
negative true: 1213, false: 1
Precision: 87.15%, Recall: 99.36%

误判数据的详细信息

---
expect: spam, got: ham
{'ham': -84.45531071975456, 'spam': -90.70055052987975}
You can donate �2.50 to UNICEF's Asian Tsunami disaster support fund by texting DONATE to 864233. �2.50 will be added to your next bill
---
expect: spam, got: ham
{'ham': -139.40891845750357, 'spam': -147.473600840229}
Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!
---
expect: spam, got: ham
{'ham': -52.676090263258466, 'spam': -56.80155705621162}
Are you unique enough? Find out from 30th August. www.areyouunique.co.uk
---
expect: spam, got: ham
{'ham': -70.17115950997167, 'spam': -72.51873090685471}
This message is brought to you by GMW Ltd. and is not connected to the
---
expect: spam, got: ham
{'ham': -139.40891845750357, 'spam': -147.473600840229}
Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!
---
expect: spam, got: ham
{'ham': -168.26318051822577, 'spam': -178.0225162634835}
Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES
---
expect: spam, got: ham
{'ham': -142.83829011312517, 'spam': -160.32809534946767}
ROMCAPspam Everyone around should be responding well to your presence since you are so warm and outgoing. You are bringing in a real breath of sunshine.
---
expect: spam, got: ham
{'ham': -153.87952746055848, 'spam': -157.51773572622523}
How about getting in touch with folks waiting for company? Just txt back your NAME and AGE to opt in! Enjoy the community (150p/SMS)
---
expect: spam, got: ham
{'ham': -70.49633709940345, 'spam': -71.4324288990577}
Latest News! Police station toilet stolen, cops have nothing to go on!
---
expect: spam, got: ham
{'ham': -167.8034762079193, 'spam': -181.29820672852267}
Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1
---
expect: spam, got: ham
{'ham': -201.64646891761794, 'spam': -221.190966006233}
Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)
---
expect: spam, got: ham
{'ham': -179.72840292764013, 'spam': -198.04988429454147}
Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!
---
expect: spam, got: ham
{'ham': -169.32983912276305, 'spam': -171.08314763971316}
Talk sexy!! Make new friends or fall in love in the worlds most discreet text dating service. Just text VIP to 83110 and see who you could meet.
---
expect: spam, got: ham
{'ham': -94.5052592368405, 'spam': -100.35544827044257}
Reminder: You have not downloaded the content you have already paid for. Goto http://doit. mymoby. tv/ to collect your content.
---
expect: spam, got: ham
{'ham': -92.32667266264504, 'spam': -98.95596194645321}
Dont forget you can place as many FREE Requests with 1stchoice.co.uk as you wish. For more Information call 08707808226.
---
expect: spam, got: ham
{'ham': -72.48756276984723, 'spam': -76.59107843644884}
Missed call alert. These numbers called but left no message. 07008009200
---
expect: spam, got: ham
{'ham': -77.48695175645791, 'spam': -93.71448200582458}
Did you hear about the new Divorce Barbie"? It comes with all of Ken's stuff!"
---
expect: spam, got: ham
{'ham': -178.2932238903798, 'spam': -184.67604680539299}
Am new 2 club & dont fink we met yet Will B gr8 2 C U Please leave msg 2day wiv ur area 09099726553 reply promised CARLIE x Calls�1/minMobsmore LKPOBOX177HP51FL
---
expect: spam, got: ham
{'ham': -146.47067170071077, 'spam': -149.22917007660618}
Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give Arsenal a 2 goal margin after 78 mins.
---
expect: spam, got: ham
{'ham': -50.71127368750657, 'spam': -55.32646466594103}
Money i have won wining number 946 wot do i do next
---
expect: spam, got: ham
{'ham': -86.61673741525858, 'spam': -100.67904461391855}
Sorry I missed your call let's talk when you have the time. I'm on 07090201529
---
expect: spam, got: ham
{'ham': -135.8452563916395, 'spam': -138.23271598529016}
Download as many ringtones as u like no restrictions, 1000s 2 choose. U can even send 2 yr buddys. Txt Sir to 80082 �3
---
expect: spam, got: ham
{'ham': -106.60938155575177, 'spam': -115.73546864261806}
INTERFLORA - ��It's not too late to order Interflora flowers for christmas call 0800 505060 to place your order before Midnight tomorrow.
---
expect: ham, got: spam
{'ham': -109.18333505635592, 'spam': -107.80307740800048}
MAKE SURE ALEX KNOWS HIS BIRTHDAY IS OVER IN FIFTEEN MINUTES AS FAR AS YOU'RE CONCERNED

调参数

我们只做一个非常简单的实验，如果先全部转成小写字母，再统计出现次数，效果会不会提升。

当前代码，连续 3 次的运行结果：

positive(spam) true: 188, false: 21
negative true: 1182, false: 2
Precision: 89.95%, Recall: 98.95%
---
positive(spam) true: 160, false: 16
negative true: 1209, false: 8
Precision: 90.91%, Recall: 95.24%
---
positive(spam) true: 167, false: 13
negative true: 1208, false: 5
Precision: 92.78%, Recall: 97.09%

代码修改：

第 11 行

dataset = [(line[0], line[1]) for line in csv_reader]

改成

dataset = [(line[0], line[1].lower()) for line in csv_reader]

连续 3 次的运行结果：

positive(spam) true: 164, false: 18
negative true: 1205, false: 6
Precision: 90.11%, Recall: 96.47%
---
positive(spam) true: 174, false: 20
negative true: 1197, false: 2
Precision: 89.69%, Recall: 98.86%
---
positive(spam) true: 162, false: 23
negative true: 1204, false: 4
Precision: 87.57%, Recall: 97.59%

结果变化不大，Precision 略有降低，Recall 略微提升。

3 次运行结果，随机性比较大，不能作出哪一个 feature 更好的结论。

但我们也没有看到明显的优化或下降。

总结

training 阶段主要是计算 prior 和 likelihood，这两个都与具体的 document 内的文本无关，而是在整个 label 所有训练集内统计 document count 和 word count。
对结果产生影响的，不是一个训练集内 word A 与 word B 的相对高低，而是 word 在不同 label 集内的概率差异。