书山有路勤为径,学海无涯苦作舟
一、数据预处理
1.1 数据清洗
导入库
import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn. linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords
nltk.download() #安装 ,nltk除了不能对中文进行分词,别的都还行
用pandas读取训练数据
df = pd.read_csv(" ",sep="\t",escapechar="\\")
print("Number oof reviews: {}".format(len(df)))
df.head()
Number of reviews: 25000
对影评数据做预处理,大概有以下环节:
- 1.去掉html标签
- 2.移除标点
- 3.切分成词/token
- 4.去掉停用词
- 5.重组为新的句子
df['review' ][1000]
#举例-去掉HTML标签的数据
example = BeautifulSoup(df['review'][1000],'html.parser').get_text()
example
#去掉标点符号
example_letters = re.sub(r'[^a-zA-Z]',' ',example) #去除非字母的
example_letters
#大小写转换
words = example_letters.lower().split()
words
#去除停用词
stopwords = {}.fromkeys([line.strip() for line in open(" 停用词表 ")])
words_nostop = [w for w in words if w not in stopwords]
words_nostop
将上面的函数封装
eng_stopwords = set(stopwords) #构建英文停用词表,且去重复
def clearn_text(text):
text = BeautifulSoup(text,"html.parser").get_text()
text = re.sub(r"r[a-zA-z]"," ",text)
words = text.lower().split()
words = [w for w in words if words not in eng_stopwords]
return " ".join(words)
df['review'][1000]
clearn_text(df['review'][1000])
1.2 清洗的数据添加到dataframe里面
df['clean_review'] = df.review.apply(clean_text) #应用自定义函数到每条数据
df.head()
二、 文本特征 bag of words词袋 模型
词袋模型就基于词频出发
抽取bag of words特征(用sklearn的CountVectorizer)
vectorizer = CountVectorizer(max_features= 50000) #文本太长,可以筛选基于词频,前50000的词汇,避免过于稀疏
train_data_features = vectorizer.fit_transform(df.clean_review).toarray()
train_data_features
划分训练测试集
fron sklearn.cross_validation import train_test_split
X_train,X_test,y_train, y_test = train_test_split(train_data_features, df.sentiment, test_size = 0.2, random_state=0)
绘制混淆矩阵图
import matplotlib. pyplot as plt
import itertools
def plot_confusion_matrix(cm,classes,
title='Confusion matrix',cmap=plt.cm. Blues) :
"
This function prints and plots the confusion matrix
"
plt.imshow(cm,interpolation='nearest', cmap=cmap)plt.title(title)
plt.colorbar()
tick_marks = np. arange(len(classes))
plt.xticks(tick_marks,classes,rotation=O)plt.yticks(tick_marks,classes)
训练分类器
LR_model = LogisticRegression()
LR_model = LR_mode1.fit(X_train,y_train)
y _pred = LR_mode1.predict(X_test)
cnf_ matrix = confusion_matrix(y_test, y _pred)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1])
print("accuracy metric in the testing dataset: ",(cnf _matrix[1,1]+cnf_matrix[0,0])/(cnf_matrix[0,0]+cnf_matrix[1,1])
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion matrix(cnf_matrix
,classes=class_names
,title='Confusion matrix')
plt.show()
Recall metric in the testing dataset: 0.853181076672
accuracy metric in the testing dataset:0.8454
混淆矩阵:主对角线的值越大越好
精度:2135+2092 / (2135+2092+413+360)
召回率:2092 / (360+2092)
三、基于word2vec 的文本建模
3.1 word2vec的输入数据处理
word2vec是基于词建模,不是基于句子建模
先分词,word2vec要求输入的格式是list of list格式的数据 ,传进来的数据是一个个的词
清洗再分词
import warnings
warnings.filterwarnings("ignore")
tokenizer = iltk. data.load('tokenizers/punkt/english. pickle') #load nltk中的英文分词器,中文用Jieba分词器
def split_sentences(review):
raw_sentences = tokenizer.tokenize(review.strip() #分词
sentences = [clean_text(s)for s in raw_sentences if s] #清洗
return sentences
sentences = sum(review_part.apply(split_sentences),[])
print('reviews ->[}sentences' .format(len(review_part),len(sentences)))
处理后是一个句子的list of list格式,需要处理成单个词 的list of list格式
sentences_list =[]
for line in sentences:
sentences list.append(nltk.word tokenize(line))
3.2 训练word2vec模型
word2vec 的参数
- sentences:可以是一个list
- sg:用于设置训练算法,默认为0,对应CBOW算法; sg=1则采用skip-gram算法。
- size:是指特征向量空间的维度,默认为100。大的size需要更多的训练数据,但是效果会更好.推荐值为几十到几百。
- window:表示当前词与预测词在一个句子中的最大距离是多少,移动窗口的大小
- alpha:是学习速率
- seed:用于随机数发生器。与初始化词向量有关。
- min_count:可以对字典做截断.词频少于min_count次数的单词会被丢弃掉,默认值为5
- max_vocab_size:设置词向量构建期间的RAM限制。如果所有独立单词个数超过这个,则就消除掉其中最不频繁的一个。每一千万个单词需要大约1GB的RAM。设置成None则没有限制。
- workers参数控制训练的并行数。
- hs:如果为1则会采用hierarchica-softmax技巧。如果设置为O((defau’t),则negative sampling会被使用。
- negative:如果>0,则会采用negativesamp-ing,用于设置多少个noise words
- iter:迭代次数,默认为5
# 设定词向量训练的参数
num_features = 300 # Word vector dimensionality
min_word_count = 40 # Minimum word count
num_workers = 4 # Number of threads to run in parallel
context = 10 # Context window size
model_name = '{}features_{}minwords_{}context.model'.format(num_features, min_word_count, context)
model = Word2Vec(sentences_list,workers=num_workers,size=num_features,min_count=min_word_count,window=context)
# model.save(os.path.join('..', 'models', model_name))
model.init_sims(replace=True)