机器学习读书笔记（八）：Sentiment Analysis (Natural Language Processing i.e. NLP)

最新推荐文章于 2020-04-08 20:41:18 发布

VIP文章 Flying Squirrel

最新推荐文章于 2020-04-08 20:41:18 发布

阅读量959

点赞数 1

分类专栏：机器学习人工智能算法文章标签： python 人工智能 nlp 自然语言处理机器学习

本文链接：https://blog.csdn.net/weixin_45783752/article/details/104076653

版权

Sentiment Analysis

1.Obtaining IMDb movie review dataset
2.Bag-of-words model
- 2.1Transforming words into feature vectors
3.Accessing word relevancy via term frequency-inverse document frequency
4.Cleaning text data
5.Training a logistic regression model for document classification
Working with bigger data - online algorithms and out-of-core learning

学习目标：

Cleaning and preparing text data
Building feature vectors from text document
Training a machine learning model to classify positive and negative reviews
Working with large text datasets using $o u t$ - $o f$ - $c o r e$ learning

1.Obtaining IMDb movie review dataset

Introductions:

Sentiment analysis is sometimes also called opinion mining
IMDb(Internet Movie Database) collected by Maas et al.

Actually, the author Sebastian Raschka uses a quite fancy way to downlowd the files and comments conveniently through os method, which may be not so convenient to program… Therefore we just skip this process and get our movie_data.csv as follow:
在这里插入图片描述

2.Bag-of-words model

Bag-of-words model allows us to represent text as numerical feature vectors, which can be summaried as follows:

Create a vocabulary of unique tokens
Construct a feature vector from the counts of words

Note that the feature vectors are actually sparse as we may have mentioned bofore,which means they consist of mostly zeros.

2.1Transforming words into feature vectors

#调整用n-gram
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)
print(count.vocabulary_)
print(bag.toarray())
--------------------------------------------------
{
   'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}
[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]

3.Accessing word relevancy via term frequency-inverse document frequency

tf-idf, short for term frequency-inverse document frequency, is defined as the product as follows:
$idf(t,d)=tf(t,d)\times idf(t,d)$
where $t f (t, d)$ means the number of times a term $t$ occurs in a document $d$ .
And here, $i d f (t, d)$ can be calculated as :
$idf(t,d)=log\frac {n_d}{1+df(d,t)}$
where $n_d$ is the total number of documents, and