Document Filtering

最新推荐文章于 2020-04-13 20:04:52 发布

zhxue123

最新推荐文章于 2020-04-13 20:04:52 发布

阅读量761

点赞数

分类专栏： MachineLearning

MachineLearning 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

Document filtering demonstrates how to classify documents based on their contents. Perhaps the most useful and well-known application of document filtering is the elimination of spam.

It recognize whether a document belongs in one category or another, they can be used for less unsavory purposes.

Filtering Spam

Early attempts to filter spam were all rule-based classifiers, where a person would design a set of rules that was supposed to indicate whether or not a message was spam.Rule-based classifiers quickly became apparent—spammers learned all the rules and stopped exhibiting the obvious behaviors to get around the filters.

The other problem with rule-based filters is that what can be considered spam varies depending on where it’s being posted and for whom it is being written. Keywords that would strongly indicate spam for one particular user, message board, or Wiki may be quite normal for others.

Documents and Words

The classifier that you will be building needs features to use for classifying different
items. A feature is anything that you can determine as being either present or absent
in the item. When considering documents for classification, the items are the documents
and the features are the words in the document.

Determining which features to use is both very tricky and very important.

Training the classifier

The first thing you’ll need is a class to represent the classifier. This class will encapsulate
what the classifier has learned so far. The advantage of structuring the module
this way is that you can instantiate multiple classifiers for different users, groups, or
queries, and train them differently to respond to a particular group’s needs.

Three instance variables: fc, cc and getfeatures.

1) The fc variable will store the counts for different features in different classifications. For example:
{'python': {'bad': 0, 'good': 6}, 'the': {'bad': 3, 'good': 3}}

It shows the frequencies of the word "python" used in spam(bad) and others(good).

2) The cc variable is a dictionary of how many times every classification has been used.

3) getfeatures, is the function that will be used to extract the features from the items being classified.

The train method takes an item (a document in this case) and a classification. It uses the getfeatures function of the class to break the item into its separate features. It then calls incf to increase the counts for this classification for every feature. Finally, it increases the total count for this classification:

Calculating Probabilities

>>> cl.fprob('quick','good')

0.66666666666666663

You can see that the word “quick”（feature） appears in two of the three documents classified as good（category）.

when you have very little information about the feature in question. A good number to start with is 0.5.

Navie Bayes Classiifier

This method is called naïve because it assumes that the probabilities being combined
are independent of each other.

This is actually a false assumption, since you’ll probably find that documents
containing the word “casino” are much more likely to contain the word
“money” than documents about Python programming are.

To use the naïve Bayesian classifier, you’ll first have to determine the probability of
an entire document being given a classification.

For example, suppose you’ve noticed that the word “Python” appears in 20 percent
of your bad documents—Pr(Python | Bad) = 0.2—and that the word “casino”
appears in 80 percent of your bad documents (Pr(Casino | Bad) = 0.8). You would
then expect the independent probability of both words appearing in a bad document—
Pr(Python & Casino | Bad)—to be 0.8 × 0.2 = 0.16. From this you can see
that calculating the entire document probability is just a matter of multiplying
together all the probabilities of the individual words in that document.

没看懂到底什么是Navie Bayes分类，还是看个易懂的吧：

这个定理解决了现实生活里经常遇到的问题：已知某条件概率，如何得到两个事件交换后的概率，也就是在已知P(A|B)的情况下如何求得P(B|A)。

P(A/B)=P(AB)/P(B)

P(B/A)=P(AB)/P(A)=P(A/B)P(B)/P(A)

http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html

看完这个blog给的SNS真实账号识别的例子，应该就懂了。。。

这个例子也展示了当特征属性充分多时，朴素贝叶斯分类对个别属性的抗干扰性。

输入：

1）特征属性： X={a1， a2， a3。。。an}

2）分类： Y={Y1 ， Y2，... Ym}

3) 样本

输出：

给定某一个体x，求x属于哪个分类Yk

举例：假设判断某个男人是中国哪个省份的人。该男人的特征参数为：{1.72， 0.75，黄}

特征属性：X={身高，口音，肤色}

分类： {DB，HN，GD}

特征属性划分：我们把身高分为三个段（1.65>z, 1.65<=z<1.75, z>=1.75），把口音按普通话接近程度分为{0.3>h, 0.3<=h<0.7, h>=0.7}，把肤色分为{白，黄，黑}

统计样本：

1)计算训练样本中每个类别的频率

P(DB)=0.32 P(HN)=0.43 P(GD)=0.25

2)计算每个类别条件下各特征属性划分的频率

属于分类DB的三个属性：

P（1.65>z |DB）=0.1

P(1.65<=z<1.75 | DB) = 0.3

P( z>=1.75 |DB) = 0.6

P（0.3>h |DB）=0.15

P(0.3<=h<0.7 | DB) = 0.1

P( h>=0.7|DB) = 0.75

P（白色 |DB）=0.35

P(黄色 | DB) = 0.55

P(黑色 |DB) = 0.1

属于分类HN的三个属性划分的频率：

P（1.65>z |HN）=0.1

P(1.65<=z<1.75 |HN) = 0.65

P( z>=1.75 |HN) = 0.25

P（0.3>h |HN）=0.1

P(0.3<=h<0.7 |HN) = 0.15

P( h>=0.7|HN) = 0.75

P（白色 |HN）=0.2

P(黄色 | HN) = 0.6

P(黑色 |HN) = 0.2

属于分类GD的三个属性划分的频率：

P（1.65>z |GD）=0.35

P(1.65<=z<1.75 | GD) = 0.45

P( z>=1.75 |GD) = 0.2

P（0.3>h |GD）=0.8

P(0.3<=h<0.7 | GD) = 0.1

P( h>=0.7|GD) = 0.1

P（白色 |GD）=0.1

P(黄色 | GD) = 0.55

P(黑色 |GD) = 0.35

3）确定样本分类

样本x={1.72， 0.75，黄}属于DB的概率：

P（DB）P（x|DB）=P(DB) * P(1.65<=z<1.75|DB) * P( h>=0.7|DB) * P(黄色 | DB) =0.32*0.3*0.75* 0.55=0.0396

属于HN的概率：

P（HN）P（x|HN）=P(HN) * P(1.65<=z<1.75|DB) * P( h>=0.7|HN) * P(黄色 | HN) = 0.43*0.65*0.75*0.6 =0.125775

属于GD的概率：

P（GD）P（x|GD）=P(GD) * P(1.65<=z<1.75|GD) * P( h>=0.7|GD) * P(黄色 | GD) = 0.25*0.45*0.1*0.55 = 0.006875

可见x属于HN

zhxue123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Document Filtering

Document filtering demonstrates how to classify documents based on their contents. Perhaps the most useful and well-known application of document filtering is the elimination of spam.It recognize wh
复制链接

扫一扫

专栏目录