信息检索导论第十三章笔记(英文)

最新推荐文章于 2023-06-20 21:04:03 发布

Braylon1002

最新推荐文章于 2023-06-20 21:04:03 发布

阅读量722

点赞数 1

分类专栏：数据挖掘文章标签：信息检索导论

本文链接：https://blog.csdn.net/qq_40742298/article/details/108632928

版权

数据挖掘专栏收录该内容

53 篇文章 12 订阅

订阅专栏

文章目录

Text classification and Naive Nayes

Text classification and Naive Nayes

Abstract

To capture the genreality and scope of the problem space to which standing queries belong, it intorduces the general notion of a classification problem.

Classification using standing queries is also called routing or filtering.

Most retrieval systems today contain multiple components that use some form of classifier.

Apart from manual classification and hand-crafted rules, there is a third approach to textclassification, namely, machine learning-based text classification. This approach is alse called statistical text classification if the learning method is statistical.

In this situation, we require a number of good example documents for each class.

The text classification problem

Using a learning method or learning algorithm, we then wish to learn a classifier or classification function γ that maps documents to classes:

在这里插入图片描述

This type of learning is called suervised learning because ja survior serves as a teacher directing the learning process.

Example - Reuters-RCV1 collection

在这里插入图片描述

The training set provides some typical examples for each class, so that we can learn the classification function γ.

Once we have learned γ, we can apply it to the test set (or test data)

Naive Bayes text classification

The probability of a document d being in class c is computed as

在这里插入图片描述

<t1, t2, … , tnd> are the tokens in d that are part of the vocabulary we use for classification and nd is the number of such tokens in d.

In text classification, our goal is to find the best class for the document.

在这里插入图片描述

estimate the parameters from the training set as we will see in a moment.

the maximization that is actually done in most implementations of NB is:

在这里插入图片描述

The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the class, and the above equation selects the class for which we have the most evidence.

estimate the

在这里插入图片描述

we try MLE, maximum likehood estimate.

在这里插入图片描述

where Nc is the number of documents in class c and N is the total number of documents

estimate this probability as the relative frequency of term t in documents belonging to class c:

在这里插入图片描述

where Tct is the number of occurrences of t in training documents from class c, including multiple occurrences of a term in a document.

solve the problem that is zero from a term-class combination that did not occur in the training set

在这里插入图片描述

Pseudo code

Training part

在这里插入图片描述

Testing part

在这里插入图片描述

Example

According to the above figure, we get:

The Bernoulli model

There are two different ways we can set up an NB classifier. The model we introduced in the previous section is the multinomial model. It generates one term from the vocabulary in each position of the document, where we assume a generative model.

The different generation models imply different estimation strategies and different classification rules.

The Bernoulli model estimates Pˆ(t|c) as the fraction of documents of class c that contain term t.
In contrast, the multinomial model estimates Pˆ(t|c) as the fraction of tokens or fraction of positions in documents of class c that contain term t.

When classifying a test document, the Bernoulli model uses binary occurrence information, ignoring the number of occurrences, whereas the multinomial model keeps track of multiple occurrences.

Note:
the Bernoulli model typically makes many mistakes when classifying long documents

Bernoulli model (NB Algorithm)

Training

在这里插入图片描述

Testing

在这里插入图片描述

Properties of Naive Bayes

We decide class membership of a document by assigning it to the class with the maximum a posteriori probability

在这里插入图片描述

The two models differ in the formalization of the second step, the generation of the document given the class, corresponding to the conditional distribution P(d|c)

在这里插入图片描述

The multinomial NB model.

The Bernoulli NB model

we compare the two models

在这里插入图片描述

Naive Bayes is so called because the independence assumptions we have just made are indeed very naive for a model of natural language. The conditional independence assumption states that features are independent of each other given the class. This is hardly ever true for terms in documents. In many cases, the opposite is true.

Note:
The computational granularity of the two models is different. The polynomial model uses words as the granularity, and Bernoulli model takes the file as the granularity. Therefore, the calculation methods of the prior probability and the class conditional probability of the two models are different.
When calculating the posterior probability, for a document D, only the words that appear in D in the polynomial model will participate in the posterior probability calculation. In the Bernoulli model, the words that do not appear in D but appear in the global word list will also participate in the calculation, just as “the opposite side”.

Python with sklearn

BernoulliNB

from sklearn.naive_bayes import BernoulliNB
bnb=BernoulliNB()
bnb.fit(X_train,y_train)

MultinomialNB

from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,y_train)

A variant of the multinomial model

P(d|c) is then computed as follows

在这里插入图片描述

This is also an alternative formalization of the mltinomial model representations.

Feature selection

Feature selection is the process of selecting a subset of the terms occurring in selection the training set and using only this subset as features in text classification.

Two main purposes

it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
feature selection often increases classification accuracy by eliminating noise features.

The basic feature selection algorithm is shown in figure

在这里插入图片描述

Of the two NB models, the Bernoulli model is particularly sensitive to noise features. A Bernoulli NB classifier requires some form of feature selection or else its accuracy will be low.

Mutual information

MI measures how much information the presence/absence of a term contributes to making the correct classification decision on c

在这里插入图片描述

χ^2 Feature selection

In feature selection, the two events are occurrence of the term and occurrence of the class. We then rank terms with respect to the following quantity:

在这里插入图片描述

Frequency-based feature selection

Document frequency is more appropriate for the Bernoulli model, collection frequency for the multinomial model.

Comparison of feature selection methods

χ2 selects more rare terms (which are often less reliable indicators) than mutual information. But the selection criterion of mutual information also does not necessarily select the terms that maximize classification accuracy.

All three methods – MI, χ2 and frequency based – are greedy methods.
They may select features that contribute no incremental information over previously selected features.