Sentiment Analysis
学习目标:
- Cleaning and preparing text data
- Building feature vectors from text document
- Training a machine learning model to classify positive and negative reviews
- Working with large text datasets using o u t out out- o f of of- c o r e core core learning
1.Obtaining IMDb movie review dataset
Introductions:
- Sentiment analysis is sometimes also called opinion mining
- IMDb(Internet Movie Database) collected by Maas et al.
Actually, the author Sebastian Raschka uses a quite fancy way to downlowd the files and comments conveniently through os method, which may be not so convenient to program… Therefore we just skip this process and get our movie_data.csv as follow:
2.Bag-of-words model
Bag-of-words model allows us to represent text as numerical feature vectors, which can be summaried as follows:
- Create a vocabulary of unique tokens
- Construct a feature vector from the counts of words
Note that the feature vectors are actually sparse as we may have mentioned bofore,which means they consist of mostly zeros.
2.1Transforming words into feature vectors
#调整用n-gram
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()
docs = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)
print(count.vocabulary_)
print(bag.toarray())
--------------------------------------------------
{
'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
[[0 1 0 1 1 0 1 0 0]
[0 1 0 0 0 1 1 0 1]
[2 3 2 1 1 1 2 1 1]]
3.Accessing word relevancy via term frequency-inverse document frequency
tf-idf, short for term frequency-inverse document frequency, is defined as the product as follows:
t f i d f ( t , d ) = t f ( t , d ) × i d f ( t , d ) tf idf(t,d)=tf(t,d)\times idf(t,d) tfidf(t,d)=tf(t,d)×idf(t,d)
where t f ( t , d ) tf(t,d) tf(t,d) means the number of times a term t t t occurs in a document d d d.
And here, i d f ( t , d ) idf(t,d) idf(t,d) can be calculated as :
i d f ( t , d ) = l o g n d 1 + d f ( d , t ) idf(t,d)=log\frac {n_d}{1+df(d,t)} idf(t,d)=log1+df(d,t)nd
where n d n_d ndis the total number of documents, and d f (