NLP学习笔记(一) : 数据预处理(关键词:词袋,简单)

0. 前言

本次使用的数据是kaggle教程 Bag of Words Meets Bags of Popcorn内数据,该比赛要求通过电影评论预测情感,附下载链接:https://www.kaggle.com/c/word2vec-nlp-tutorial/data 

标题中的词袋指的是,本次预处理未涉及到语序等其他语义,单单从单词本身这个方向切入进行处理。它就像一个袋子,里面装的全是我们训练模型所需要的词汇。

代码编辑推荐使用交互性良好的jupyter notebook(内核为python,当然这里我使用的python,jupyter本身还可以用R等语言)

我们首先载入训练数据,并初步查看

import pandas as pd

train_df = 
 pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3) 
# header=0表示文件的第一行包含列名,delimiter='\t'表示数据之间使用tab分隔的,quoting=3告诉python无视双引号,否则在读取文件的时候可能会报错。
train_df.shape
(25000, 3)

 

train_df.head()

 

 

id

sentiment

review

0

"5814_8"

1

"With all this stuff going down at the moment ...

1

"2381_9"

1

"\"The Classic War of the Worlds\" by Timothy ...

2

"7759_3"

0

"The film starts with a manager (Nicholas Bell...

3

"3630_4"

0

"It must be assumed that those who praised thi...

4

"9495_8"

1

"Superbly trashy and wondrously unpretentious ..

1. 处理HTML标签

数据文本样例:

(已将文档中的需要处理的部分用粗体标记)

print(train_df['review'][0])
 

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci\'s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ\'s music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ\'s bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i\'ve gave this subject....hmmm well i don\'t know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."'

HTML tags比如 <br /><br /> 用 BeautifulSoup 来进行清理。

example1 = BeautifulSoup(train_df['review'][0], 'lxml')
print(example1.get_text())
 

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

可以看到 <br /><br /> 已经被处理完毕。

2. 处理标点符号和数字

通过正则表达式来处理,其中[]表示组成员,^表示not, re.sub()的意思是找到不是a-z,以及A-Z的,然后用空格进行替换,所以文本中标点符号和数字会被变成空格。

import re
letters_only = re.sub('[^a-zA-Z]', # The pattern to search for
                      ' ',  # The pattern to replace with
                      example1.get_text()) # The text to search

letters_only
 

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord  Why he wants MJ dead so bad is beyond me  Because MJ overheard his plans  Nah  Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno  maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence  Also  the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line  this movie is for people who like MJ on one level or another  which i think is most people   If not  then stay away  It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl  Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty  Well  with all the attention i ve gave this subject    hmmm well i don t know because people can be different behind closed doors  i know this for a fact  He is either an extremely nice but stupid guy or one of the most sickest liars  I hope he is not the latter  '

3. 小写化与分词

我们先把所有单词变成小写,然后进行切割分词。

lower_case = letters_only.lower() # Convert to lower case

lower_case
 

 ' with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for    minutes or so excluding the smooth criminal sequence and joe pesci is convincing as a psychopathic all powerful drug lord  why he wants mj dead so bad is beyond me  because mj overheard his plans  nah  joe pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno  maybe he just hates mj s music lots of cool things in this like mj turning into a car and a robot and the whole speed demon sequence  also  the director must have had the patience of a saint when it came to filming the kiddy bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene bottom line  this movie is for people who like mj on one level or another  which i think is most people   if not  then stay away  it does try and give off a wholesome message and ironically mj s bestest buddy in this movie is a girl  michael jackson is truly one of the most talented people ever to grace this planet but is he guilty  well  with all the attention i ve gave this subject    hmmm well i don t know because people can be different behind closed doors  i know this for a fact  he is either an extremely nice but stupid guy or one of the most sickest liars  i hope he is not the latter  '

words = lower_case.split()

words
['with',
 'all',
 'this',
 'stuff',
 'going',
 'down',
 'at',
 'the',
 'moment',
 'with',
 'mj',
 'i',

...
 'hope',
 'he',
 'is',
 'not',
 'the',
 'latter']

注意 split() 返回的是一个list。

4. 处理停用词

停用词即在文本中经常出现的词,但是又没有什么实际意义,类似'a', 'and', 'is', 'the'等词。我们使用NLTK来进行操作。

from nltk.corpus import stopwords

stopwords.words('english')[:20] # 打印前20个停用词

附:这里可能会遇到  Resource 'corpora/stopwords' not found.  报错

若遇到请使用 nltk.download('stopwords'),之后就不会发生报错了 

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his']

从words中取出非停用词,这个words是刚刚我们split()处理完成的变量

words_non_stop = [w for w in words if w not in stopwords.words('english') ]

words_non_stop
['stuff',
 'going',
 'moment',
 'mj',
 'started',
 'listening',
 'music',
 'watching',
 'odd',
...
 'stupid',
 'guy',
 'one',
 'sickest',
 'liars',
 'hope',
 'latter']

5. 总结

我们将上面的所有操作可以整合成一个函数:

def review2words(raw_review):
    
''' Function to convert a raw review to a string of words
     The input is a single string (a raw movie review), and 
     the output is a single string (a preprocessed movie review)'''
    
    # 1. 去HTML标签(BeautifulSoup)
    review_text = BeautifulSoup(raw_review, "lxml").get_text()
    
    # 2. 去掉标点符号以及数字等(正则表达式)
    letters_only = re.sub("[^a-zA-Z]", " ", review_text)

    # 3. 将句子转换为小写,并进行分词
    words = letters_only.lower().split()

    # 4. 去掉停用词
    stop_words = set(stopwords.words("english"))
    words_non_stop = [w for w in words if w not in stop_words]

    # 5. 将其重写为一个string用空格来分隔并返回
    return(" ".join(words_non_stop))

 这里将stop_words变为了一个set是因为在set中搜索要比在list中搜索更快。

好,接下来让我们试一下对整个数据进行操作

from tqdm import tqdm

clean_reviews = []

for i in tqdm(range(0, len(train_df['review']))):

    clean_reviews.append(review2words(train_df['review'][i]))
100%|██████████████████████████████████████████████████████████████████████████| 25000/25000 [00:17<00:00, 1389.74it/s]

 tqdm是一个进度条显示神器,能让我们实时看到数据跑到哪个阶段了。

clean_reviews
['stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working one kid let alone whole bunch performing complex dance scene bottom line movie people like mj one level another think people stay away try give wholesome message ironically mj bestest buddy movie girl michael jackson truly one talented people ever grace planet guilty well attention gave subject hmmm well know people different behind closed doors know fact either extremely nice stupid guy one sickest liars hope latter',
 'classic war worlds timothy hines entertaining film obviously goes great effort lengths faithfully recreate h g wells classic book mr hines succeeds watched film appreciated fact standard predictable hollywood fare comes every year e g spielberg version tom cruise slightest resemblance book obviously everyone looks different things movie envision amateur critics look criticize everything others rate movie important bases like entertained people never agree critics enjoyed effort mr hines put faithful h g wells classic novel found entertaining made easy overlook critics perceive shortcomings',
...
 'move tv last night guess time filler sucked bad movie excuse show tits ass start somewhere half way bad tits ass though story ridiculous words wolf call hardly shown fully save teeth fully view clearly see interns working cgi wolf runs like running treadmill cgi fur looks like waxed shiny movie full gore blood easily spot going get killed slashed eaten next even like kind splatter movies disappointed good job even get started actors corny lines girls scream everything every seconds someone asked bad acting give bucks hey sign overall boring laughable horror',
 ...]

至此,我们的预处理就算告一段落了。 

  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值