sklearn朴素贝叶斯分类器_Python 手写朴素贝叶斯分类器检测垃圾邮件/短信

自己从头手写一下这些经典的算法,不调用 sklearn 等 API,调一调参数,蛮有收获和启发。

Code on Github: https://gist.github.com/JackonYang/5d354a2985f1bd77ead1c8a260649225

数据集

来自 kaggle: SMS Spam Collection Dataset

概要:5572 条短信,13% 的 spam。

选择这个数据集的原因:

  1. 短信的文本预处理要比 email 简单一些,运算量小,更容易聚焦算法本身。
  2. 数据集来自 kaggle,取样相对科学一些,更容易准确的反应算法的效果。

我的数据备份:

github.com spam.csv​github.com

算法原理

目标函数:给定一篇文章(d),计算属于各个分类(c) 的概率,以概率最大的分类作为最终结果。

v2-673d09adbc0acbd88195602a526a1b85_b.jpg

在垃圾邮件/短信检测的案例里,分类只有 2 个:spam,not-spam.

在垃圾邮件检测的特定领域里,not-spam 通常又叫 ham。没有什么原因,最初的大佬一时兴起想到了这个名字而已。所以,分类名字就变成了 spam, ham

根据贝叶斯公式,变形为

v2-866ccd74d545ca2daad587eef97cbcc7_b.jpg

其中 f1, f2, ..., fn 是 document 的 feature。

有很多种选取 feature 的方法,比如,单词出现频率,单词的 TF/IDF 值,去除 stop-words 以后的单词频率。选择什么 feature,与贝叶斯无关,由我们要解决的问题本身决定。后面专门讨论和实验对比不同的 feature。

假定 feature 相互独立,实践上,即使不相互独立,直接用的效果也不错。

得到新公式:

v2-dd6145233003a08daef2d212aab4a508_b.jpg

跟其他的 language model 一样,换成 log space,

v2-c18732d162fee156da2fe71917e1df08_b.jpg

Feature 选取

spammy email 检测的算法已经很成熟,商用的版本里,会用到发件人地址等非文本特征。

此处,我们把讨论约定在文本/自然语言特征范围内。

在机器学习之前,已经有 rule based 的过滤器,大家发现的规则,比如:

  1. online pharmaceutical
  2. WITHOUT ANY COST
  3. Dear Winner

总结下来,垃圾邮件喜欢把一些 keyword 全部大写,表述习惯上与普通文本不同。

常见的 NLP 预处理 pipeline,比如,全部转小写,TF/IDF 等,反而会把一些 feature 处理掉。此处可能不适合做这些预处理。

我们以单词出现次数这个最简单的指标作为特征。

训练语料里,所有分类下的所有单词,构成一个 vocabulary,然后在每个类别下,分别统计各个单词的出现次数。

v2-becf511760ac89929bf9152ee6c1ba2f_b.jpg

在某个分类下没有出现的单词,概率是 0,导致最终的概率也都是 0。为了解决这个问题,使用 add-one (Laplace) smoothing,

v2-8c3bb5bdc03644629420f64ad22dc99d_b.jpg

伪代码

v2-f883277e91dba0af3ebc14f938e7f6bc_b.jpg

模型评价

以 spam 分类作为 positive 分类。

v2-da8c78ac8d0ac3efa75a827e1534e433_b.jpg

v2-4243940a7580804438fc9ab25519e4f1_b.jpg

Python 实现

import csv
import string
import numpy as np
import math


def load_data(filename, train_ratio):
    with open(filename, "rb") as f:
        csv_reader = csv.reader(f)
        csv_reader.next()  # header
        dataset = [(line[0], line[1]) for line in csv_reader]

    np.random.shuffle(dataset)
    train_size = int(len(dataset) * train_ratio)
    return dataset[:train_size], dataset[train_size:]


def train(train_set):
    total_doc_cnt = len(train_set)

    label_doc_cnt = {}
    bigdoc_words = {}

    for label, doc in train_set:
        if label not in label_doc_cnt:
            # init
            label_doc_cnt[label] = 0
            bigdoc_words[label] = []

        label_doc_cnt[label] += 1
        bigdoc_words[label].extend([
            w.strip(string.punctuation) for w in doc.split()])

    vocabulary = set()
    for words in bigdoc_words.values():
        vocabulary |= set(words)

    V = len(vocabulary)
    log_priors = {label: math.log(1.0 * cnt / total_doc_cnt) for label, cnt in label_doc_cnt.items()}

    log_likelihoods = dict()
    for label, words in bigdoc_words.items():
        word_cnt = len(words) + V
        log_likelihoods[label] = [math.log(1.0 * (1 + words.count(w)) / word_cnt) for w in vocabulary]

    return log_priors, log_likelihoods, vocabulary


def predict(log_priors, log_likelihoods, vocabulary, input_text, expect_label=None):
    words = {w.strip(string.punctuation) for w in input_text.split()}

    prob_max = 0
    label_max = None

    probs = {}  # tmp for log
    for label, likelihood in log_likelihoods.items():
        prob = log_priors[label] + sum([p for w, p in zip(vocabulary, likelihood) if w in words])
        probs[label] = prob

        if not prob_max or prob > prob_max:
            prob_max = prob
            label_max = label

    if expect_label and expect_label != label_max:
        print '---'
        print 'expect: %s, got: %s' % (expect_label, label_max)
        print probs
        print input_text

    return label_max


def main():
    filename = 'input/spam.csv'
    train_ratio = 0.75
    train_data, test_data = load_data(filename, train_ratio)

    print('data loaded. train: {}, test: {}').format(
        len(train_data), len(test_data))

    # train the model
    log_priors, log_likelihoods, vocabulary = train(train_data)

    print 'model trained. log_priors: {}, V(vocabulary word count): {}'.format(log_priors, len(vocabulary))

    pos_true = 0
    pos_false = 0
    neg_false = 0
    neg_true = 0

    for label, text in test_data:
        got = predict(log_priors, log_likelihoods, vocabulary, text, label)
        if label != got:
            if label == 'spam':
                pos_false += 1
            else:
                neg_false += 1
        else:
            if label == 'spam':
                pos_true += 1
            else:
                neg_true += 1

    print 'positive(spam) true: %s, false: %s' % (pos_true, pos_false)
    print 'negative true: %s, false: %s' % (neg_true, neg_false)
    print 'Precision: %.2f%%, Recall: %.2f%%' % (
        100.0 * pos_true / (pos_true + pos_false),
        100.0 * pos_true / (pos_true + neg_false),
        )


if __name__ == '__main__':
    main()

运行结果

在 load_data 函数里,对 dataset 做了 shuffle 洗牌,所以,每次运行结果都会有区别,但 Precision 基本在 90% 左右,Recall 96%

data loaded. train: 4179, test: 1393
model trained. log_priors: {'ham': -0.14608724117045765, 'spam': -1.995705843726764}, V(vocabulary word count): 9996
positive(spam) true: 156, false: 23
negative true: 1213, false: 1
Precision: 87.15%, Recall: 99.36%

误判数据的详细信息

---
expect: spam, got: ham
{'ham': -84.45531071975456, 'spam': -90.70055052987975}
You can donate �2.50 to UNICEF's Asian Tsunami disaster support fund by texting DONATE to 864233. �2.50 will be added to your next bill
---
expect: spam, got: ham
{'ham': -139.40891845750357, 'spam': -147.473600840229}
Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!
---
expect: spam, got: ham
{'ham': -52.676090263258466, 'spam': -56.80155705621162}
Are you unique enough? Find out from 30th August. www.areyouunique.co.uk
---
expect: spam, got: ham
{'ham': -70.17115950997167, 'spam': -72.51873090685471}
This message is brought to you by GMW Ltd. and is not connected to the
---
expect: spam, got: ham
{'ham': -139.40891845750357, 'spam': -147.473600840229}
Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!
---
expect: spam, got: ham
{'ham': -168.26318051822577, 'spam': -178.0225162634835}
Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES
---
expect: spam, got: ham
{'ham': -142.83829011312517, 'spam': -160.32809534946767}
ROMCAPspam Everyone around should be responding well to your presence since you are so warm and outgoing. You are bringing in a real breath of sunshine.
---
expect: spam, got: ham
{'ham': -153.87952746055848, 'spam': -157.51773572622523}
How about getting in touch with folks waiting for company? Just txt back your NAME and AGE to opt in! Enjoy the community (150p/SMS)
---
expect: spam, got: ham
{'ham': -70.49633709940345, 'spam': -71.4324288990577}
Latest News! Police station toilet stolen, cops have nothing to go on!
---
expect: spam, got: ham
{'ham': -167.8034762079193, 'spam': -181.29820672852267}
Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1
---
expect: spam, got: ham
{'ham': -201.64646891761794, 'spam': -221.190966006233}
Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)
---
expect: spam, got: ham
{'ham': -179.72840292764013, 'spam': -198.04988429454147}
Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!
---
expect: spam, got: ham
{'ham': -169.32983912276305, 'spam': -171.08314763971316}
Talk sexy!! Make new friends or fall in love in the worlds most discreet text dating service. Just text VIP to 83110 and see who you could meet.
---
expect: spam, got: ham
{'ham': -94.5052592368405, 'spam': -100.35544827044257}
Reminder: You have not downloaded the content you have already paid for. Goto http://doit. mymoby. tv/ to collect your content.
---
expect: spam, got: ham
{'ham': -92.32667266264504, 'spam': -98.95596194645321}
Dont forget you can place as many FREE Requests with 1stchoice.co.uk as you wish. For more Information call 08707808226.
---
expect: spam, got: ham
{'ham': -72.48756276984723, 'spam': -76.59107843644884}
Missed call alert. These numbers called but left no message. 07008009200
---
expect: spam, got: ham
{'ham': -77.48695175645791, 'spam': -93.71448200582458}
Did you hear about the new Divorce Barbie"? It comes with all of Ken's stuff!"
---
expect: spam, got: ham
{'ham': -178.2932238903798, 'spam': -184.67604680539299}
Am new 2 club & dont fink we met yet Will B gr8 2 C U Please leave msg 2day wiv ur area 09099726553 reply promised CARLIE x Calls�1/minMobsmore LKPOBOX177HP51FL
---
expect: spam, got: ham
{'ham': -146.47067170071077, 'spam': -149.22917007660618}
Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry scores with a simple shot from 6 yards from a pass by Bergkamp to give Arsenal a 2 goal margin after 78 mins.
---
expect: spam, got: ham
{'ham': -50.71127368750657, 'spam': -55.32646466594103}
Money i have won wining number 946 wot do i do next
---
expect: spam, got: ham
{'ham': -86.61673741525858, 'spam': -100.67904461391855}
Sorry I missed your call let's talk when you have the time. I'm on 07090201529
---
expect: spam, got: ham
{'ham': -135.8452563916395, 'spam': -138.23271598529016}
Download as many ringtones as u like no restrictions, 1000s 2 choose. U can even send 2 yr buddys. Txt Sir to 80082 �3
---
expect: spam, got: ham
{'ham': -106.60938155575177, 'spam': -115.73546864261806}
INTERFLORA - ��It's not too late to order Interflora flowers for christmas call 0800 505060 to place your order before Midnight tomorrow.
---
expect: ham, got: spam
{'ham': -109.18333505635592, 'spam': -107.80307740800048}
MAKE SURE ALEX KNOWS HIS BIRTHDAY IS OVER IN FIFTEEN MINUTES AS FAR AS YOU'RE CONCERNED

调参数

我们只做一个非常简单的实验,如果先全部转成小写字母,再统计出现次数,效果会不会提升。

当前代码,连续 3 次的运行结果:

positive(spam) true: 188, false: 21
negative true: 1182, false: 2
Precision: 89.95%, Recall: 98.95%
---
positive(spam) true: 160, false: 16
negative true: 1209, false: 8
Precision: 90.91%, Recall: 95.24%
---
positive(spam) true: 167, false: 13
negative true: 1208, false: 5
Precision: 92.78%, Recall: 97.09%

代码修改:

第 11 行

dataset = [(line[0], line[1]) for line in csv_reader]

改成

dataset = [(line[0], line[1].lower()) for line in csv_reader]

连续 3 次的运行结果:

positive(spam) true: 164, false: 18
negative true: 1205, false: 6
Precision: 90.11%, Recall: 96.47%
---
positive(spam) true: 174, false: 20
negative true: 1197, false: 2
Precision: 89.69%, Recall: 98.86%
---
positive(spam) true: 162, false: 23
negative true: 1204, false: 4
Precision: 87.57%, Recall: 97.59%

结果变化不大,Precision 略有降低,Recall 略微提升。

3 次运行结果,随机性比较大,不能作出哪一个 feature 更好的结论。

但我们也没有看到明显的优化或下降。

总结

  1. training 阶段主要是计算 prior 和 likelihood,这两个都与具体的 document 内的文本无关,而是在整个 label 所有训练集内统计 document count 和 word count。
  2. 对结果产生影响的,不是一个训练集内 word A 与 word B 的相对高低,而是 word 在不同 label 集内的概率差异。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值