基于朴素贝叶斯+Python实现垃圾邮件分类
朴素贝叶斯原理
Python实现
源代码主干来自: python实现贝叶斯推断——垃圾邮件分类
我只是加了注释,然后做了对结果的分析统计的输出添加。
源码下载: GitHub:下载NaiveBayesEmail.py
本文原载: 基于朴素贝叶斯+Python实现垃圾邮件分类
结果分析
仅出现在垃圾邮件(或非垃圾邮件)中的单词在非垃圾邮件(或垃圾邮件)中的概率设为P(not_appear)
1)P(not_appear) = 0.01时的结果:
去停用词结果:
不去停用词结果:
2)P(not_appear) = 0.05时的结果:
去停用词结果:
不去停用词结果:
可见,
- 去不去停用词差别不大;
- P(not_appear) 越大越会把spam误判成ham。
3)[把垃圾邮件误判成非垃圾邮件的次数,总误判次数] 对应关系查看
-
Rate of mistaking spam for ham in 100 times when P(not_appear) = 0.05 without stopwords removal.
-
[wrong_spamToham, wrong]
结果1:
[[1, 1], [1, 1], [2, 2], [2, 2], [2, 2], [3, 3], [1, 1], [2, 2], [2, 2], [4, 4], [3, 3], [1, 1], [1, 1], [2, 2], [5, 5], [1, 1], [1, 1], [1, 1], [1, 1], [2, 2], [2, 2], [-1], [2, 2], [1, 1], [2, 2], [-1], [3, 3], [2, 2], [1, 1], [1, 1], [2, 2], [-1], [4, 4], [1, 1], [3, 3], [2, 2], [2, 2], [3, 3], [2, 2], [3, 3], [2, 2], [2, 2], [1, 1], [1, 1], [-1], [1, 1], [1, 1], [2, 2], [-1], [2, 2], [1, 1], [2, 2], [1, 1], [-1], [2, 2], [2, 2], [2, 2], [3, 3], [4, 4], [1, 1], [2, 2], [1, 1], [2, 2], [3, 3], [3, 3], [-1], [3, 3], [2, 2], [2, 2], [2, 2], [2, 2], [3, 3], [3, 3], [2, 2], [5, 5], [2, 2], [-1], [4, 4], [3, 3], [4, 4], [1, 1], [3, 3], [1, 1], [1, 1], [-1], [1, 1], [1, 1], [1, 1], [3, 3], [2, 2], [1, 1], [2, 2], [4, 4], [2, 2], [3, 3], [3, 3], [2, 2], [1, 1], [2, 2], [1, 1]]
结果2:
[[1, 1], [1, 1], [4, 4], [1, 1], [2, 2], [1, 1], [3, 3], [-1], [-1], [4, 4], [1, 1], [2, 2], [-1], [3, 3], [5, 5], [2, 2], [1, 1]