Document Filtering


Document filtering demonstrates how to classify documents based on their contents. Perhaps the most useful and well-known application of document filtering is the elimination of spam.

It recognize whether a document belongs in one category or another, they can be used for less unsavory purposes.


Filtering Spam

Early attempts to filter spam were all rule-based classifiers, where a person would design a set of rules that was supposed to indicate whether or not a message was spam.Rule-based classifiers quickly became apparent—spammers learned all the rules and stopped exhibiting the obvious behaviors to get around the filters.

The other problem with rule-based filters is that what can be considered spam varies depending on where it’s being posted and for whom it is being written. Keywords that would strongly indicate spam for one particular user, message board, or Wiki may be quite normal for others.

Documents and Words

The classifier that you will be building needs features to use for classifying different
items. A feature is anything that you can determine as being either present or absent
in the item. When considering documents for classification, the items are the documents
and the features are the words in the document.

Determining which features to use is both very tricky and very important.

Training the classifier

The first thing you’ll need is a class to represent the classifier. This class will encapsulate
what the classifier has learned so far. The advantage of structuring the module
this way is that you can instantiate multiple classifiers for different users, groups, or
queries, and train them differently to respond to a particular group’s needs.


Three instance variables: fc, cc and getfeatures. 

1) The fc variable will store the counts for different features in different classifications. For example:
{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}

It shows the frequencies of the word "python" used in spam(bad) and others(good). 

2) The cc variable is a dictionary of how many times every classification has been used.

3) getfeatures, is the function that will be used to extract the features from the items being classified.


The train method takes an item (a document in this case) and a classification. It uses the getfeatures function of the class to break the item into its separate features. It then calls incf to increase the counts for this classification for every feature. Finally, it increases the total count for this classification:


Calculating Probabilities


>>> cl.fprob('quick','good')

0.66666666666666663


You can see that the word “quick”(feature) appears in two of the three documents classified as good(category).

when you have very little information about the feature in question. A good number to start with is 0.5.


Navie Bayes Classiifier

This method is called naïve because it assumes that the probabilities being combined
are independent of each other.


This is actually a false assumption, since you’ll probably find that documents
containing the word “casino” are much more likely to contain the word
“money” than documents about Python programming are.

To use the naïve Bayesian classifier, you’ll first have to determine the probability of
an entire document being given a classification.


For example, suppose you’ve noticed that the word “Python” appears in 20 percent
of your bad documents—Pr(Python | Bad) = 0.2—and that the word “casino”
appears in 80 percent of your bad documents (Pr(Casino | Bad) = 0.8). You would
then expect the independent probability of both words appearing in a bad document—
Pr(Python & Casino | Bad)—to be 0.8 × 0.2 = 0.16. From this you can see
that calculating the entire document probability is just a matter of multiplying
together all the probabilities of the individual words in that document.


没看懂到底什么是Navie Bayes分类,还是看个易懂的吧:


这个定理解决了现实生活里经常遇到的问题:已知某条件概率,如何得到两个事件交换后的概率,也就是在已知P(A|B)的情况下如何求得P(B|A)。


贝叶斯定理之所以有用,是因为我们在生活中经常遇到这种情况:我们可以很容易直接得出P(A|B),P(B|A)则很难直接得出,但我们更关心P(B|A),贝叶斯定理就为我们打通从P(A|B)获得P(B|A)的道路。公式如下:

P(A/B)=P(AB)/P(B)

P(B/A)=P(AB)/P(A)=P(A/B)P(B)/P(A)


http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html

看完这个blog给的SNS真实账号识别的例子,应该就懂了。。。

这个例子也展示了当特征属性充分多时,朴素贝叶斯分类对个别属性的抗干扰性。


输入:

1) 特征属性: X={a1, a2, a3。。。an}

2)分类 :       Y={Y1 , Y2,... Ym}

3)   样本


输出:

给定某一个体x,求x属于哪个分类Yk


举例:假设判断某个男人是中国哪个省份的人。该男人的特征参数为:{1.72, 0.75, 黄}


特征属性X={身高,口音,肤色}

分类: {DB,HN,GD}

特征属性划分:我们把身高分为三个段(1.65>z, 1.65<=z<1.75,  z>=1.75),把口音按普通话接近程度分为{0.3>h, 0.3<=h<0.7, h>=0.7},把肤色分为{白,黄,黑}


统计样本:

1)计算训练样本中每个类别的频率

P(DB)=0.32  P(HN)=0.43  P(GD)=0.25

2)计算每个类别条件下各特征属性划分的频率

属于分类DB的三个属性:

P(1.65>z |DB)=0.1

P(1.65<=z<1.75 | DB) = 0.3

P( z>=1.75 |DB) = 0.6


P(0.3>h |DB)=0.15

P(0.3<=h<0.7 | DB) = 0.1

P( h>=0.7|DB) = 0.75


P(白色 |DB)=0.35

P(黄色 | DB) = 0.55

P(黑色 |DB) = 0.1


属于分类HN的三个属性划分的频率:

P(1.65>z |HN)=0.1

P(1.65<=z<1.75 |HN) = 0.65

P( z>=1.75 |HN) = 0.25


P(0.3>h |HN)=0.1

P(0.3<=h<0.7 |HN) = 0.15

P( h>=0.7|HN) = 0.75


P(白色 |HN)=0.2

P(黄色 | HN) = 0.6

P(黑色 |HN) = 0.2

属于分类GD的三个属性划分的频率:

P(1.65>z |GD)=0.35

P(1.65<=z<1.75 | GD) = 0.45

P( z>=1.75 |GD) = 0.2


P(0.3>h |GD)=0.8

P(0.3<=h<0.7 | GD) = 0.1

P( h>=0.7|GD) = 0.1


P(白色 |GD)=0.1

P(黄色 | GD) = 0.55

P(黑色 |GD) = 0.35

3)确定样本分类

样本x={1.72, 0.75, 黄}属于DB的概率:

P(DB)P(x|DB)=P(DB) * P(1.65<=z<1.75|DB) * P( h>=0.7|DB) * P(黄色 | DB) =0.32*0.3*0.75* 0.55=0.0396


属于HN的概率:

P(HN)P(x|HN)=P(HN) * P(1.65<=z<1.75|DB) * P( h>=0.7|HN) * P(黄色 | HN) = 0.43*0.65*0.75*0.6 =0.125775


属于GD的概率:

P(GD)P(x|GD)=P(GD) * P(1.65<=z<1.75|GD) * P( h>=0.7|GD) * P(黄色 | GD) = 0.25*0.45*0.1*0.55 = 0.006875


可见x属于HN















  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip 【备注】 1、该资源内项目代码都经过测试运行成功,功能ok的情况下才上传的,请放心下载使用!有问题请及时沟通交流。 2、适用人群:计算机相关专业(如计科、信息安全、数据科学与大数据技术、人工智能、通信、物联网、自动化、电子信息等)在校学生、专业老师或者企业员工下载使用。 3、用途:项目具有较高的学习借鉴价值,不仅适用于小白学习入门进阶。也可作为毕设项目、课程设计、大作业、初期项目立项演示等。 4、如果基础还行,或热爱钻研,亦可在此项目代码基础上进行修改添加,实现其他不同功能。 欢迎下载!欢迎交流学习!不清楚的可以私信问我! 毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip毕设新项目-基于Java开发的智慧养老院信息管理系统源码+数据库(含vue前端源码).zip
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值