AI-NLP-3.Word2Vec实战案例课

最新推荐文章于 2024-09-30 21:34:11 发布

花熊

最新推荐文章于 2024-09-30 21:34:11 发布

阅读量2.2k

点赞数 1

分类专栏： AI

本文链接：https://blog.csdn.net/hgy413/article/details/81704185

版权

AI 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

第一种方式：bag_of_words_model

用pandas读入训练数据

对影评数据做预处理，大概有以下环节：

清洗数据添加到dataframe里

抽取bag of words特征(TF，用sklearn的CountVectorizer)

和第一个ipython notebook一样做数据的预处理

用gensim训练词嵌入模型

看看训练的词向量结果如何

第二种方式续:使用Word2Vec训练后的数据

和之前的操作一致

读入之前训练好的Word2Vec模型

我们可以根据word2vec的结果去对影评文本进行编码

用随机森林构建分类器

清理占用内容的变量

预测测试集结果并上传kaggle

2.中文应用：chinese-sentiment-analysis

安装notebooks

ipynb，顾名思义，ipython notebook

C:\Users\Administrator>python -m pip install jupyter notebook
Collecting jupyter
  Downloading https://files.pythonhosted.org/packages/83/df/0f5dd132200728a86190397e1ea87cd76244e42d39ec5e88efd25b2abd7e/jupyter-1.0.0-py2.py3-none-any.whl
Collecting notebook
  Downloading https://files.pythonhosted.org/packages/5e/7c/7fd8e9584779d65dfcad9fa2e09c76131a41f999f853a9c7026ed8585586/notebook-5.6.0-py2.py3-none-any.whl (8.9MB)
    100% |████████████████████████████████| 8.9MB 227kB/s

之后cmd中输入jupyter notebook会打开一个页面，先upload这个.ipynb后缀的文件

C:\Users\Administrator>jupyter notebook
[I 16:47:35.666 NotebookApp] Writing notebook server cookie secret to C:\Users\Administrator\AppData\Roaming\jupyter\runtime\notebook_cookie_secret

然后点击上传后的.ipynb文件，点击下面的红色方框中的第一个按钮，运行，运行后，网页的下面部分会输出结果。

选中一个框，方框变成蓝色，表示选中。

如果鼠标点击代码，方框变成绿色，表示处于编辑状态。

选中方框变蓝色后，按下键盘上的小写L可以显示行数。

在cell中可以直接按tab键，可以自动补全，超级实用

1.文本情感分析英文 && 中文

第一个示例:https://www.kaggle.com/c/word2vec-nlp-tutorial/data
bag of words meets bags of popcorn

1.基本的文本预处理技术（网页解析，文本抽取，正则表达式等）
2.word2vec词向量编码与机器学习建模情感分析

数据

Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

File descriptions

labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.
testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one.
unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review.
sampleSubmission - A comma-delimited sample submission file in the correct format.

Data fields

id - Unique ID of each review
sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
review - Text of the review

第一种方式：bag_of_words_model

import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup
import warnings #过滤掉sklearn的警告
warnings.filterwarnings(action='ignore',category=UserWarning,module='sklearn')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

pandas可以把csv转换成excel类似的，BeautifulSoup用于解析网页，sklearn用于抽取文本特征。

import nltk
#nltk.download()
from nltk.corpus import stopwords

用pandas读入训练数据

datafile = os.path.join('E:/AI/NLP/NLTK/Python/3/', 'data', 'labeledTrainData.tsv')
df = pd.read_csv(datafile, sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
df.head()

Number of reviews: 25000

Out[19]:

	id	sentiment	review
0	5814_8	1	With all this stuff going down at the moment w...
1	2381_9	1	"The Classic War of the Worlds" by Timothy Hin...

简单看下第一条评论:

df['review'][0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

对影评数据做预处理，大概有以下环节：

去掉html标签
移除标点
切分成词/token
去掉停用词
重组为新的句子

def display(text, title):
    print(title)
    print("\n----------我是分割线-------------\n")
    print(text)

我们来继续显示第2个review的数据:

raw_example = df['review'][1]
display(raw_example, '原始数据')

原始数据

----------我是分割线-------------

"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.

去掉HTML标签的数据

example = BeautifulSoup(raw_example, 'html.parser').get_text()
display(example, '去掉HTML标签的数据')

去掉HTML标签的数据

----------我是分割线-------------

"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those who watched his film with me, appreciated the fact that it was not the standard, predictable Hollywood fare that comes out every year, e.g. the Spielberg version with Tom Cruise that had only the slightest resemblance to the book. Obviously, everyone looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate a movie on more important bases,like being entertained, which is why most people never agree with the "critics". We enjoyed the effort Mr. Hines put into being faithful to H.G. Wells' classic novel, and we found it to be very entertaining. This made it easy to overlook what the "critics" perceive to be its shortcomings.
Press any key to continue . . .

去掉标点的数据

example_letters = re.sub(r'[^a-zA-Z]', ' ', example)
display(example_letters, '去掉标点的数据')

去掉标点的数据

----------我是分割线-------------

 The Classic War of the Worlds  by Timothy Hines is a very entertaining film that obviously goes to great effort and lengths to faithfully recreate H  G  Wells  classic book  Mr  Hines succeeds in doing so  I  and those who watched his film with me  appreciated the fact that it was not the standard  predictable Hollywood fare that comes out every year  e g  the Spielberg version with Tom Cruise that had only the slightest resemblance to the book  Obviously  everyone looks for different things in a movie  Those who envision themselves as amateur  critics  look only to criticize everything they can  Others rate a movie on more important bases like being entertained  which is why most people never agree with the  critics   We enjoyed the effort Mr  Hines put into being faithful to H G  Wells  classic novel  and we found it to be very entertaining  This made it easy to overlook what the  critics  perceive to be its shortcomings

纯词列表数据

words = example_letters.lower().split()#全转成小写并分隔
display(words, '纯词列表数据')

纯词列表数据

----------我是分割线-------------

['the', 'classic', 'war', 'of', 'the', 'worlds', 'by', 'timothy', 'hines', 'is', 'a', 'very', 'entertaining', 'film', 'that', 'obviously', 'goes', 'to', 'great', 'effort', 'and', 'lengths', 'to', 'faithfully', 'recreate', 'h', 'g', 'wells', 'classic', 'book', 'mr', 'hines', 'succeeds', 'in', 'doing', 'so', 'i', 'and', 'those', 'who', 'watched', 'his', 'film', 'with', 'me', 'appreciated', 'the', 'fact', 'that', 'it', 'was', 'not', 'the', 'standard', 'predictable', 'hollywood', 'fare', 'that', 'comes', 'out', 'every', 'year', 'e', 'g', 'the', 'spielberg', 'version', 'with', 'tom', 'cruise', 'that', 'had', 'only', 'the', 'slightest', 'resemblance', 'to', 'the', 'book', 'obviously', 'everyone', 'looks', 'for', 'different', 'things', 'in', 'a', 'movie', 'those', 'who', 'envision', 'themselves', 'as', 'amateur', 'critics', 'look', 'only', 'to', 'criticize', 'everything', 'they', 'can', 'others', 'rate', 'a', 'movie', 'on', 'more', 'important', 'bases', 'like', 'being', 'entertained', 'which', 'is', 'why', 'most', 'people', 'never', 'agree', 'with', 'the', 'critics', 'we', 'enjoyed', 'the', 'effort', 'mr', 'hines', 'put', 'into', 'being', 'faithful', 'to', 'h', 'g', 'wells', 'classic', 'novel', 'and', 'we', 'found', 'it', 'to', 'be', 'very', 'entertaining', 'this', 'made', 'it', 'easy', 'to', 'overlook', 'what', 'the', 'critics', 'perceive', 'to', 'be', 'its', 'shortcomings']

去掉停用词数据

from nltk.corpus import stopwords
words_nostop = [w for w  in words if w not in stopwords.words('english')]

去掉停用词数据

----------我是分割线-------------

['classic', 'war', 'worlds', 'timothy', 'hines', 'entertaining', 'film', 'obviously', 'goes', 'great', 'effort', 'lengths', 'faithfully', 'recreate', 'h', 'g', 'wells', 'classic', 'book', 'mr', 'hines', 'succeeds', 'watched', 'film', 'appreciated', 'fact', 'standard', 'predictable', 'hollywood', 'fare', 'comes', 'every', 'year', 'e', 'g', 'spielberg', 'version', 'tom', 'cruise', 'slightest', 'resemblance', 'book', 'obviously', 'everyone', 'looks', 'different', 'things', 'movie', 'envision', 'amateur', 'critics', 'look', 'criticize', 'everything', 'others', 'rate', 'movie', 'important', 'bases', 'like', 'entertained', 'people', 'never', 'agree', 'critics', 'enjoyed', 'effort', 'mr', 'hines', 'put', 'faithful', 'h', 'g', 'wells', 'classic', 'novel', 'found', 'entertaining', 'made', 'easy', 'overlook', 'critics', 'perceive', 'shortcomings']

前面的所有操作可以写成一个清洗函数:

eng_stopwords = set(stopwords.words('english'))
def clean_text(text):
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    words = [w for w in words if w not in eng_stopwords]
    return ' '.join(words)
display(clean_text(raw_example),'清洗数据')

清洗数据

----------我是分割线-------------

classic war worlds timothy hines entertaining film obviously goes great effort lengths faithfully recreate h g wells classic book mr hines succeeds watched film appreciated fact standard predictable hollywood fare comes every year e g spielberg version tom cruise slightest resemblance book obviously everyone looks different things movie envision amateur critics look criticize everything others rate movie important bases like entertained people never agree critics enjoyed effort mr hines put faithful h g wells classic novel found entertaining made easy overlook critics perceive shortcomings

清洗数据添加到dataframe里

对每一行review进行清洗数据

df['clean_review'] = df.review.apply(clean_text)
print(df.head())

       id                        ...                                                               clean_review
0  5814_8                        ...                          stuff going moment mj started listening music ...
1  2381_9                        ...                          classic war worlds timothy hines entertaining ...
2  7759_3                        ...                          film starts manager nicholas bell giving welco...
3  3630_4                        ...                          must assumed praised film greatest filmed oper...
4  9495_8                        ...                          superbly trashy wondrously unpretentious explo...

抽取bag of words特征(TF，用sklearn的CountVectorizer)

vectorizer = CountVectorizer(max_features = 5000) #对所有关键词的term frequency(tf)进行降序排序，只取前5000个做为关键词集
train_data_features = vectorizer.fit_transform(df.clean_review).toarray()#fit_transform转换成文档词频矩阵,toarray转成数组
print(train_data_features.shape)

(25000, 5000)

也就是5000行，每行25000个.

train_data_features

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

训练分类器

forest = RandomForestClassifier(n_estimators=100)#随机森林分类
forest = forest.fit(train_data_features, df.sentiment)#开始数据训练

在训练集上做个predict看看效果如何

print (confusion_matrix(df.sentiment, forest.predict(train_data_features)))

array([[12500,     0],
       [    0, 12500]], dtype=int64)

删除不用的占内容变量

del df
del train_data_features

读取测试数据进行预测

datafile = os.path.join('..', 'data', 'testData.tsv')
df = pd.read_csv(datafile, sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
df['clean_review'] = df.review.apply(clean_text)
df.head()

test_data_features = vectorizer.transform(df.clean_review).toarray()
test_data_features.shape

result =  forest.predict(test_data_features)#预测结果
output = pd.DataFrame({'id':df.id, 'sentiment':result})
print(output.head())

output.to_csv(os.path.join('..', 'data', 'Bag_of_Words_model.csv'), index=False)
del df
del test_data_features

Number of reviews: 25000

Out[84]:

	id	review	clean_review
0	12311_10	Naturally in a film who's main themes are of m...	naturally film main themes mortality nostalgia...
1	8348_2	This movie is a disaster within a disaster fil...	movie disaster within disaster film full great...
2	5828_4	All in all, this is a movie for kids. We saw i...	movie kids saw tonight child loved one point k...

Bag_of_Words_model.csv如下：

第二种方式:Word2Vec

import os
import re
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings(action='ignore',category=UserWarning,module="gensim")
from bs4  import BeautifulSoup
from gensim.models.word2vec import Word2Vec

定义一个读取CSV的函数

def load_dataset(name, nrows=None):
    datasets = {
        'unlabeled_train': 'unlabeledTrainData.tsv',
        'labeled_train': 'labeledTrainData.tsv',
        'test': 'testData.tsv'
        }
    if name not in datasets:
        raise ValueError(name)
    data_file = os.path.join('..', 'data', datasets[name])
    df = pd.read_csv(data_file, sep='\t', escapechar='\\', nrows=nrows)
    print('Number of reviews: {}'.format(len(df)))
    return df

读入无标签数据

用于训练生成word2vec词向量

df = load_dataset('unlabeled_train')
print(df.head())

Number of reviews: 50000
        id                                             review
0   9999_0  Watching Time Chasers, it obvious that it was ...
1  45057_0  I saw this film about 20 years ago and remembe...
2  15561_0  Minor Spoilers<br /><br />In New York, Joan Ba...
3   7161_0  I went to see this film with a great deal of e...
4  43971_0  Yes, I agree with everyone on this site this m...

和第一个ipython notebook一样做数据的预处理

稍稍有一点不一样的是，我们留了个候选，可以去除停用词，也可以不去除停用词

from nltk.corpus import stopwords
eng_stopwods = set(stopwords.words('english'))

def clean_text(text, remove_stopwords=False):
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    if remove_stopwords:
        words = [w for w in words if w not in eng_stopwods]
    return words

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def print_call_counts(f):
    n = 0
    def wrapped(*args, **kwargs):
        nonlocal n
        n += 1
        if n % 1000 == 1:
            print('method {} called {} times'.format(f.__name__, n))
        return f(*args, **kwargs)
    return wrapped

@print_call_counts#当解释器读到@的这样的修饰符之后，会先解析@后的内容，直接就把@下一行的函数或者类作为@后边的函数的参数，然后将返回值赋值给下一行修饰的函数对象。 
def split_sentences(review):
    raw_sentences = tokenizer.tokenizer(review.strip())#strip删除字符串中开头、结尾处，位于 rm删除序列的字符
    sentences = [clean_text(s) for s in raw_sentences if s]
    return sentences

sentences = sum(df.review.apply(split_sentences), [])

结果：

................
method split_sentences called 46001 times
method split_sentences called 47001 times
method split_sentences called 48001 times
method split_sentences called 49001 times