信息检索导论第十三章笔记(英文)

Text classification and Naive Nayes

Abstract

To capture the genreality and scope of the problem space to which standing queries belong, it intorduces the general notion of a classification problem.

Classification using standing queries is also called routing or filtering.

Most retrieval systems today contain multiple components that use some form of classifier.

Apart from manual classification and hand-crafted rules, there is a third approach to textclassification, namely, machine learning-based text classification. This approach is alse called statistical text classification if the learning method is statistical.

In this situation, we require a number of good example documents for each class.

The text classification problem

Using a learning method or learning algorithm, we then wish to learn a classifier or classification function γ that maps documents to classes:

在这里插入图片描述

This type of learning is called suervised learning because ja survior serves as a teacher directing the learning process.

  • Example - Reuters-RCV1 collection

在这里插入图片描述

The training set provides some typical examples for each class, so that we can learn the classification function γ.

Once we have learned γ, we can apply it to the test set (or test data)

Naive Bayes text classification

The probability of a document d being in class c is computed as

在这里插入图片描述

<t1, t2, … , tnd> are the tokens in d that are part of the vocabulary we use for classification and nd is the number of such tokens in d.

In text classification, our goal is to find the best class for the document.

  • MAP

在这里插入图片描述

estimate the parameters from the training set as we will see in a moment.

the maximization that is actually done in most implementations of NB is:

在这里插入图片描述

The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the class, and the above equation selects the class for which we have the most evidence.

  1. estimate the

在这里插入图片描述

we try MLE, maximum likehood estimate.

在这里插入图片描述

where Nc is the number of documents in class c and N is the total number of documents

  1. estimate this probability as the relative frequency of term t in documents belonging to class c:

在这里插入图片描述

where Tct is the number of occurrences of t in training documents from class c, including multiple occurrences of a term in a document.

  1. solve the problem that is zero from a term-class combination that did not occur in the training set

在这里插入图片描述

  • Pseudo code

Training part

在这里插入图片描述

Testing part

在这里插入图片描述

  • Example
    在这里插入图片描述

According to the above figure, we get:
在这里插入图片描述

The Bernoulli model

There are two different ways we can set up an NB classifier. The model we introduced in the previous section is the multinomial model. It generates one term from the vocabulary in each position of the document, where we assume a generative model.

The different generation models imply different estimation strategies and different classification rules.

The Bernoulli model estimates Pˆ(t|c) as the fraction of documents of class c that contain term t.
In contrast, the multinomial model estimates Pˆ(t|c) as the fraction of tokens or fraction of positions in documents of class c that contain term t.

When classifying a test document, the Bernoulli model uses binary occurrence information, ignoring the number of occurrences, whereas the multinomial model keeps track of multiple occurrences.

Note:
the Bernoulli model typically makes many mistakes when classifying long documents

  • Bernoulli model (NB Algorithm)
  1. Training

在这里插入图片描述

  1. Testing

在这里插入图片描述

Properties of Naive Bayes

We decide class membership of a document by assigning it to the class with the maximum a posteriori probability

在这里插入图片描述

The two models differ in the formalization of the second step, the generation of the document given the class, corresponding to the conditional distribution P(d|c)

在这里插入图片描述

The multinomial NB model.
在这里插入图片描述

The Bernoulli NB model
在这里插入图片描述

  • we compare the two models

在这里插入图片描述

Naive Bayes is so called because the independence assumptions we have just made are indeed very naive for a model of natural language. The conditional independence assumption states that features are independent of each other given the class. This is hardly ever true for terms in documents. In many cases, the opposite is true.

Note:
The computational granularity of the two models is different. The polynomial model uses words as the granularity, and Bernoulli model takes the file as the granularity. Therefore, the calculation methods of the prior probability and the class conditional probability of the two models are different.
When calculating the posterior probability, for a document D, only the words that appear in D in the polynomial model will participate in the posterior probability calculation. In the Bernoulli model, the words that do not appear in D but appear in the global word list will also participate in the calculation, just as “the opposite side”.

  • Python with sklearn
  1. BernoulliNB
from sklearn.naive_bayes import BernoulliNB
bnb=BernoulliNB()
bnb.fit(X_train,y_train)
  1. MultinomialNB
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,y_train)

A variant of the multinomial model

P(d|c) is then computed as follows

在这里插入图片描述

This is also an alternative formalization of the mltinomial model representations.

Feature selection

Feature selection is the process of selecting a subset of the terms occurring in selection the training set and using only this subset as features in text classification.

  • Two main purposes
  1. it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
  2. feature selection often increases classification accuracy by eliminating noise features.
  • The basic feature selection algorithm is shown in figure

在这里插入图片描述

Of the two NB models, the Bernoulli model is particularly sensitive to noise features. A Bernoulli NB classifier requires some form of feature selection or else its accuracy will be low.

Mutual information

MI measures how much information the presence/absence of a term contributes to making the correct classification decision on c

在这里插入图片描述

χ^2 Feature selection

In feature selection, the two events are occurrence of the term and occurrence of the class. We then rank terms with respect to the following quantity:

在这里插入图片描述

Frequency-based feature selection

Document frequency is more appropriate for the Bernoulli model, collection frequency for the multinomial model.

Comparison of feature selection methods

χ2 selects more rare terms (which are often less reliable indicators) than mutual information. But the selection criterion of mutual information also does not necessarily select the terms that maximize classification accuracy.

All three methods – MI, χ2 and frequency based – are greedy methods.
They may select features that contribute no incremental information over previously selected features.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值