《自然语言处理学习之路》10 基于bag of words 和 word2Vec 的影评情绪分类

最新推荐文章于 2022-12-03 20:57:17 发布

驭风少年君

最新推荐文章于 2022-12-03 20:57:17 发布

阅读量216

点赞数

分类专栏：自然语言处理文章标签： word2vec 自然语言处理 python

本文链接：https://blog.csdn.net/qq_44951759/article/details/120452784

版权

自然语言处理专栏收录该内容

20 篇文章 3 订阅

订阅专栏

书山有路勤为径，学海无涯苦作舟

一、数据预处理

1.1 数据清洗

导入库

import os
import re
import numpy as np
import pandas as pd

from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn. linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords

nltk.download()   #安装 ，nltk除了不能对中文进行分词，别的都还行

用pandas读取训练数据

df = pd.read_csv("  ",sep="\t",escapechar="\\")
print("Number oof reviews: {}".format(len(df)))
df.head()

Number of reviews: 25000

在这里插入图片描述
对影评数据做预处理，大概有以下环节:

1.去掉html标签
2.移除标点
3.切分成词/token
4.去掉停用词
5.重组为新的句子

df['review' ][1000]

在这里插入图片描述

#举例-去掉HTML标签的数据
example = BeautifulSoup(df['review'][1000],'html.parser').get_text()
example

在这里插入图片描述

#去掉标点符号
example_letters = re.sub(r'[^a-zA-Z]',' ',example) #去除非字母的
example_letters

在这里插入图片描述

#大小写转换
words = example_letters.lower().split()
words

#去除停用词
stopwords = {}.fromkeys([line.strip() for line in open(" 停用词表 ")])
words_nostop = [w for w in words if w not in stopwords]
words_nostop

将上面的函数封装

eng_stopwords = set(stopwords) #构建英文停用词表，且去重复

def clearn_text(text):
	text = BeautifulSoup(text,"html.parser").get_text()
	text = re.sub(r"r[a-zA-z]"," ",text)
	words = text.lower().split()
	words = [w for w in words if words not in eng_stopwords]
	return " ".join(words)

df['review'][1000]
clearn_text(df['review'][1000])

1.2 清洗的数据添加到dataframe里面


df['clean_review'] = df.review.apply(clean_text) #应用自定义函数到每条数据
df.head()

在这里插入图片描述

二、文本特征 bag of words词袋模型

词袋模型就基于词频出发

抽取bag of words特征(用sklearn的CountVectorizer)

vectorizer = CountVectorizer(max_features= 50000) #文本太长，可以筛选基于词频，前50000的词汇，避免过于稀疏
train_data_features = vectorizer.fit_transform(df.clean_review).toarray()
train_data_features

划分训练测试集

fron sklearn.cross_validation import train_test_split
X_train，X_test，y_train, y_test = train_test_split(train_data_features, df.sentiment, test_size = 0.2, random_state=0)

绘制混淆矩阵图


import matplotlib. pyplot as plt
import itertools
def plot_confusion_matrix(cm,classes,
	title='Confusion matrix',cmap=plt.cm. Blues) :
	"
	This function prints and plots the confusion matrix
	"
	
	plt.imshow(cm,interpolation='nearest', cmap=cmap)plt.title(title)
	plt.colorbar()
	tick_marks = np. arange(len(classes))
	plt.xticks(tick_marks,classes，rotation=O)plt.yticks(tick_marks,classes)

训练分类器

LR_model = LogisticRegression()
LR_model = LR_mode1.fit(X_train,y_train)
y _pred = LR_mode1.predict(X_test)
cnf_ matrix = confusion_matrix(y_test, y _pred)

print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])
print("accuracy metric in the testing dataset: ",(cnf _matrix[1,1]+cnf_matrix[0,0])/(cnf_matrix[0,0]+cnf_matrix[1,1])

# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion matrix(cnf_matrix
	,classes=class_names
	,title='Confusion matrix')
plt.show()

Recall metric in the testing dataset:  0.853181076672
accuracy metric in the testing dataset:0.8454

混淆矩阵：主对角线的值越大越好

精度：2135+2092 / （2135+2092+413+360）
召回率：2092 / （360+2092）
在这里插入图片描述

三、基于word2vec 的文本建模

3.1 word2vec的输入数据处理

word2vec是基于词建模，不是基于句子建模

先分词，word2vec要求输入的格式是list of list格式的数据，传进来的数据是一个个的词

清洗再分词

import warnings
warnings.filterwarnings("ignore")

tokenizer = iltk. data.load('tokenizers/punkt/english. pickle')  #load nltk中的英文分词器,中文用Jieba分词器

def split_sentences(review):
	raw_sentences = tokenizer.tokenize(review.strip() #分词
	sentences = [clean_text(s)for s in raw_sentences if s] #清洗
	return sentences
	
sentences = sum(review_part.apply(split_sentences),[])
print('reviews ->[}sentences' .format(len(review_part),len(sentences)))

处理后是一个句子的list of list格式，需要处理成单个词的list of list格式

sentences_list =[]
for line in sentences:
	sentences list.append(nltk.word tokenize(line))

3.2 训练word2vec模型

word2vec 的参数

sentences:可以是一个list
sg:用于设置训练算法，默认为0，对应CBOW算法; sg=1则采用skip-gram算法。
size:是指特征向量空间的维度，默认为100。大的size需要更多的训练数据,但是效果会更好.推荐值为几十到几百。
window:表示当前词与预测词在一个句子中的最大距离是多少，移动窗口的大小
alpha:是学习速率
seed:用于随机数发生器。与初始化词向量有关。
min_count:可以对字典做截断.词频少于min_count次数的单词会被丢弃掉,默认值为5
max_vocab_size:设置词向量构建期间的RAM限制。如果所有独立单词个数超过这个，则就消除掉其中最不频繁的一个。每一千万个单词需要大约1GB的RAM。设置成None则没有限制。
workers参数控制训练的并行数。
hs:如果为1则会采用hierarchica-softmax技巧。如果设置为O（(defau’t),则negative sampling会被使用。
negative:如果>0,则会采用negativesamp-ing，用于设置多少个noise words
iter:迭代次数，默认为5

# 设定词向量训练的参数
    num_features = 300    # Word vector dimensionality
    min_word_count = 40   # Minimum word count
    num_workers = 4       # Number of threads to run in parallel
    context = 10          # Context window size
    model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)
    model = Word2Vec(sentences_list,workers=num_workers,size=num_features,min_count=min_word_count,window=context)
    # model.save(os.path.join('..', 'models', model_name))
    model.init_sims(replace=True)

驭风少年君

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
《自然语言处理学习之路》10 基于bag of words 和 word2Vec 的影评情绪分类

书山有路勤为径，学海无涯苦作舟一、数据预处理1.1 数据清洗导入库import osimport reimport numpy as npimport pandas as pdfrom bs4 import BeautifulSoupfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.met
复制链接

扫一扫