Text classification and Naive Nayes
Abstract
To capture the genreality and scope of the problem space to which standing queries belong, it intorduces the general notion of a classification problem.
Classification using standing queries is also called routing or filtering.
Most retrieval systems today contain multiple components that use some form of classifier.
Apart from manual classification and hand-crafted rules, there is a third approach to textclassification, namely, machine learning-based text classification. This approach is alse called statistical text classification if the learning method is statistical.
In this situation, we require a number of good example documents for each class.
The text classification problem
Using a learning method or learning algorithm, we then wish to learn a classifier or classification function γ that maps documents to classes:
This type of learning is called suervised learning because ja survior serves as a teacher directing the learning process.
- Example - Reuters-RCV1 collection
The training set provides some typical examples for each class, so that we can learn the classification function γ.
Once we have learned γ, we can apply it to the test set (or test data)
Naive Bayes text classification
The probability of a document d being in class c is computed as
<t1, t2, … , tnd> are the tokens in d that are part of the vocabulary we use for classification and nd is the number of such tokens in d.
In text classification, our goal is to find the best class for the document.
- MAP
estimate the parameters from the training set as we will see in a moment.
the maximization that is actually done in most implementations of NB is:
The sum of log prior and term weights is then a measure of
how much evidence there is for the document being in the class, and the above equation selects the class for which we have the most evidence.
- estimate the
we try MLE, maximum likehood estimate.
where Nc is the number of documents in class c and N is the total number of documents
- estimate this probability as the relative frequency of term t in documents belonging to class c:
where Tct is the number of occurrences of t in training documents from class c, including multiple occurrences of a term in a document.
- solve the problem that is zero from a term-class combination that did not occur in the training set
- Pseudo code
Training part
Testing part
- Example
According to the above figure, we get:
The Bernoulli model
There are two different ways we can set up an NB classifier. The model we introduced in the previous section is the multinomial model. It generates one term from the vocabulary in each position of the document, where we assume a generative model.
The different generation models imply different estimation strategies and different classification rules.
The Bernoulli model estimates Pˆ(t|c) as the fraction of documents of class c that contain term t.
In contrast, the multinomial model estimates Pˆ(t|c) as the fraction of tokens or fraction of positions in documents of class c that contain term t.
When classifying a test document, the Bernoulli model uses binary occurrence information, ignoring the number of occurrences, whereas the multinomial model keeps track of multiple occurrences.
Note:
the Bernoulli model typically makes many mistakes when classifying long documents
- Bernoulli model (NB Algorithm)
- Training
- Testing
Properties of Naive Bayes
We decide class membership of a document by assigning it to the class with the maximum a posteriori probability
The two models differ in the formalization of the second step, the generation of the document given the class, corresponding to the conditional distribution P(d|c)
The multinomial NB model.
The Bernoulli NB model
- we compare the two models
Naive Bayes is so called because the independence assumptions we have just made are indeed very naive for a model of natural language. The conditional independence assumption states that features are independent of each other given the class. This is hardly ever true for terms in documents. In many cases, the opposite is true.
Note:
The computational granularity of the two models is different. The polynomial model uses words as the granularity, and Bernoulli model takes the file as the granularity. Therefore, the calculation methods of the prior probability and the class conditional probability of the two models are different.
When calculating the posterior probability, for a document D, only the words that appear in D in the polynomial model will participate in the posterior probability calculation. In the Bernoulli model, the words that do not appear in D but appear in the global word list will also participate in the calculation, just as “the opposite side”.
- Python with sklearn
- BernoulliNB
from sklearn.naive_bayes import BernoulliNB
bnb=BernoulliNB()
bnb.fit(X_train,y_train)
- MultinomialNB
from sklearn.naive_bayes import MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,y_train)
A variant of the multinomial model
P(d|c) is then computed as follows
This is also an alternative formalization of the mltinomial model representations.
Feature selection
Feature selection is the process of selecting a subset of the terms occurring in selection the training set and using only this subset as features in text classification.
- Two main purposes
- it makes training and applying a classifier more efficient by decreasing the size of the effective vocabulary.
- feature selection often increases classification accuracy by eliminating noise features.
- The basic feature selection algorithm is shown in figure
Of the two NB models, the Bernoulli model is particularly sensitive to noise features. A Bernoulli NB classifier requires some form of feature selection or else its accuracy will be low.
Mutual information
MI measures how much information the presence/absence of a term contributes to making the correct classification decision on c
χ^2 Feature selection
In feature selection, the two events are occurrence of the term and occurrence of the class. We then rank terms with respect to the following quantity:
Frequency-based feature selection
Document frequency is more appropriate for the Bernoulli model, collection frequency for the multinomial model.
Comparison of feature selection methods
χ2 selects more rare terms (which are often less reliable indicators) than mutual information. But the selection criterion of mutual information also does not necessarily select the terms that maximize classification accuracy.
All three methods – MI, χ2 and frequency based – are greedy methods.
They may select features that contribute no incremental information over previously selected features.