

Given a set of classes

Classification:Assign the correct class label to the given input

Examples of Text Classification:

Topic identification

Spam Dectection

Sentiment analysis

Spelling correction

Supervised learning

Supervised Classificaiton

Iearn a classification model on properties(‘features’) and their importance(‘weight’)from labeled instances.

Apply the model on new instances to predict the label


Supervised classification:Phases and Datasets

Classification paradigms

Binary Classification : when there are only two possible classes.

Multi-class Classification : when there are more than two possible classes.

Multi-label Classification : when data instances can have two or more labels


Questions to ask in Supervised Learning

Training phase:

What are the features? How do you represent them?

What is the classification model/algorithm?

What are the model parameters?

Inference phase:

What is the expected performance? What is a good measure?


Why is textual data unique

Textual data presents a unique set of challenges

All the information you need is in the text

But features can be pulled out from text at different granularities(粒度)


Types of textual features


By far the most common class of features

Handling commonly-occurring words: Stop words

Normalization: Make lower case vs. Leave as-is


Characteristics of words: Capitalization

Parts of speech of words in a sentence

Grammatical structure, sentence parsing

Grouping words of similar meaning, semantics

Depending on classification tasks,features may come from inside words and word sequences.

Naive Bayes Classifiers

Case study: Classifying text search queries


Probabilistic model

Update the likelihood of the class given new information

Prior Probability: Pr(y = Entertainment), Pr(y = CS), Pr(y = Zoology)

When I have new information:

Posterior probability: Pr(y = Entertainment|x = ‘Python’)

Bayes’ Rule

Posterior probability = Prior probability * Likelihood / Evidence

Y* = argmax Pr(y|X) = argmax Pr(y) × Pr(X|y)

Naive assumption: Given the class label,features are assumed to be independent of each other.


Query: ‘Python download’

Y* = argmax Pr(y) × Pr(‘Python’|y)×Pr(‘download’|y)


Naive Bayes: What are the parameters?

Prior probabilities: Pr(y) for all y in Y

Likelihood: Pr(xi|y) for all features xi in labels y in Y


Q: You are training a naive Bayes classifier, where the number of possible labels,|Y|=3 and the dimention of the data element,|x| = 100,where every feature(dimention)is binary.How many parameters does the naive Bayes classification model have? 603


Naive Bayes: Learning parameters

Prior probabilities: Pr(y) for all y in Y

  -Remenber training data?

  -Count the number of instances in each class

  -If there are N instances in all, and n out of those are labeled as class y --->Pr(y) = n/N

Likelihood: Pr(xi|y) for all features xi and labels y in Y

  -Count how many times feature xi appears in instances labeled as class y

  -If there are p instances of class y, and xi appears in k of those,Pr(xi|y) = k/p


Naive Bayes: Smoothing

What happens if Pr(xi|y) = 0?

  -Features xi never occurs in documents labeled y

  -But then, the posterior probability Pr(y|xi) will be 0!

Instead, smooth the parameters

Laplace smoothing or Additive smoothing: Add a dummy count

  - Pr(xi|y) = (k+1)/(p+n); where n is number of features


Take Home Concept

Naive Bayes is a probabilistic model

Naive,because it assumes features are independent of each other,given the class label.(**)

For text classification problems, naive Bayes models typically provide very strong baselines.

Simple model, easy to learn parameters


Two Classic Naive Bayes Variants for Text

Multinomial Naive Bayes

   -Data follows a multinomial distribution

   -Each fearure value is a count(word occurrence counts,TF-IDF weighting, ......)

Bernoulli Naive Bayes

   -Data follows a multivariate Bernoulli distribution

   -Each feature is binary(word is present/absent)

It does not matter how many times that word was present


Case study: Sentiment analysis

Words that you might find in typical reviews

Classifier = Function on input data


Decision Boundaries

Classification function is represented by decision surfaces


Choosing a Decision Boundary

Data overfitting:Decision boundary learned over training data dosen’t generalize to test data.


Linear Boundaries

   -Easy to find

   -Easy to evaluate

   -More generalizable: ‘Occam’s razor’


Finding a Linear Boundary

   -Find the linear boundary = Find w or the slope of the line

Many methods


   -Linear Discriminative Analysis

   -Linear least squares

Problem:If linearly separable,then infinite number of linear boundaries.

What is a reasonable boundary? Maximum margin

Support Vector Machine are maximum-margin classifiers


Support Vector Machine(SVM)

Uses optimization techniques to do it

SVMs are linear classifiers that find a hyperplane to separate two classes of data: positive and negative.




SVM:Multi-class classification

SVMs work only for binary classification problems

One vs Rest

n-class SVM has n classifiers

One vs One

N-class SVM has C(n,2) classifiers


SVM Parameters(I): Parameter C

Regularization: How much importance should you give individual data points as compared to better generalized model

Regularization parameter c

  -Larger values of c = less regularization

   -Fit training data as well as possible,every data point important

  -Smaller values of c = more regularization

   -More tolerant to errors on individual data points

SVMs Parameters(II): Other params

Linear kernels usually work best for text data

  -Other kernels include rbf, polynomial

Multi_class: ovr(one-vs-rest)

Class_weight: Different classes can get different weights

  -if you want a particular class,spam or not spam,know that the spams are usually like 80% of e-mails somebody gets,it’s just a skewed distribution where one of the classes 80% and the other classes 20% you would want to give different weight to these 2 classes.


Take Home Messaages

-Support Vector Machine tend to be the most accurate classifiers, especially in high-dimensional data.

-Strong theoretical foundation

-Handles only numeric features

   -Convert categorical features to numeric features


-Hyperplane hard to interpret













Toolkits for Supervised Text Classification



  -Interfaces with sklearn and other ML tookits(like Weka)!

Using Sklearn’s NaiveBayesClassifier

from sklearn import naive_bayes

clfrNB = naive_bayes.MultinomialNB()

clfrNB,fit(train_data, train_labels)

Predicted_labels = clfrNB.predict(test_data)

metrics.f1_score(rest_labels, predicted_labels, average=’micro’)

micro averaging and macro averaging


Using Sklearn’s SVM classifier

from sklearn import svm

clfrSVM = svm.SVC(kernel = ‘linear’, C = 0.1)  linear classifier always for text classification

    C is the parameter for soft margin

clfrSVM.fit(train_data, train_labels)

Predicted_labels = clfrSVM.predict(test_data)


Model Selection

Recall the discussion on multiple phases in a supervised learning task

Model Selection in Scikit-learn

from sklearn import model_selection

model_selection.train_test_split(train_data, train_labels, test_size = 0.333,  random_state = 0)

predicted_labels = model_selection.cross_val_predict(clfrSVM, train_data, train_labels, cv = 5)


Supervised Text Classification in NLTK

NLTK has some classification algorithms








Using NLTK’s NaiveBayesClassifier

from nltk.classify import NaiveBayesClassifier

classifier = NaiveBayesClassifier.train(train_set)



Nltk.classify.util.accuracy(classifier, test_set)


Classifier.show most informative features()


Using NLTK’s SklearnClassifier

from nltk.classify import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB

from sklearn.svm import SVC

clfrNB = SklearnClassifier(MultinomialNB()).train(train_set)

clfrSVM = SklearnClassifier(SVC(),kernel = ‘linear’).train(train_set)




Demonstration:Case study - Sentiment Analysis

import pandas as pd

import numpy as np

df = pd.read_csv(‘Amazon_Unlocked_Mobile.csv’)


df.dropna(inplace = True)

df = df[df[‘Rating’] != 3]

df[‘Positively Rated’] = np.where(df[‘Rating’] > 3,1,0)


df[‘Positively Rated’].mean()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[‘Reviews’],df[‘Positively Rated’],random_states=0)

#only counts how often each word occurs.

#CountVectorizer allows us to use the bag-of-words apprach by converting a collection of text #document into a matrix of token counts.


from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)

vect.get_feature_names( )[: : 2000]


X_train_vectorized = vect.transform(X_train)

X_train_vectorized   #The entries in this matrix are the number of times each word appears in #each document

from sklearn.linear_model import LogisticRegression

model = LogisticRegression() #use LogisticRegression which works well for high demensional #sparse data.

model.fit(X_train_vectorized, y_train)


from sklearn.metrics import roc_auc_score

predictions = model.predict(vect.transform(X_test))

print(‘AUC :  ’,  roc_auc_score(y_test, predictions))  #AUC score

#Note that any words in X_test that didn’t appear in X_train will just be ignored.



Tfidf  Term frequency-inverse document frequency

from sklearn.feature_extraction.text import TfidfVectorizer

#allow us to weight terms based on how important they are to a document

#High weight is given to terms that appear often in a particular document but don’t appear often #in the corpus.

vect = TfidfVectorizer(min_df = 5).fit(X_train)


#Features with high tf-idf are frequently used within specific documents,but rarely used across all documents.



CountVectorizor and

tfidf Vectorizor both take an argument, mindf, which allows us to specify

a minimum number of documents in which a token needs to appear

to become part of the vocabulary. This helps us remove some words

that might appear in only a few and are unlikely to be useful predictors. For example, here we'll pass in min_df = 5, which will remove any words from our vocabulary that appear in fewer than five documents.


from sklearn.feature_extraction.text import TfidVectorizer

vect = TfidfVectorizer(min_df = 5).fit(X_train)


X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()

model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print(‘AUC: ’,roc_auc_score(y_test, predictions))


#see notes3_1

vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)

X_train_vectorized = vect.transform(X_train)

len(vect.get_feature_names( ))


model = LogisticRegression()

model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

print(‘AUC:  ’,roc_auc_score(y_test, predictions))

  • 0
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


