html标签对word2vec,情感语义分析实战：如何在IMDB电影评论数据集上应用word2vec进行情感分析...

最新推荐文章于 2022-12-30 08:50:01 发布

黄明轩

最新推荐文章于 2022-12-30 08:50:01 发布

阅读量419

点赞数

文章标签： html标签对word2vec

删除html标签： BeautifulSoup工具包

首先我们需要利用BeautifulSoup来删除html标签。安装方法如下：

[mw_shl_code=shell,true]sudo pip install BeautifulSoup4[/mw_shl_code]

20171207150825676.jpeg (88.37 KB, 下载次数: 7)

2017-12-9 18:33 上传

很明显，html标签已经消失了，有些人或许说用正则表达式我也能做到这种效果，的确正则表达式也可以，但是html标签太多了，用正则表达式比较繁琐。

删除标点、数字、停用词：NLTK包和正则表达式

在我们对文本进行清理前，我们应该思考我们尝试去解决的问题。为什么这样说呢？那是因为对于不同的任务，对文本清理的要求是不一样的，比如说对很多任务，清除标点是很有意义的。但是在我们这次情感分析任务中，标点”!!!”、”: (“很有可能承载着情感信息的，因此这些标点应该特殊被当作单词。在本次实战中，我们为了简化问题就直接去除表点了，但是如果你有兴趣，想进一步优化解决方案，可以尝试从这个角度入手。

在这里我们通过python内置正则表达式模块来去除标点和数字。

[mw_shl_code=python,true]import re

# Use regular expressions to do a find-and-replace

letters_only = re.sub("[^a-zA-Z]", # The pattern to search for

" ", # The pattern to replace it with

example1.get_text() ) # The text to search

print letters_only[/mw_shl_code]

20171207152607655.jpeg (93.09 KB, 下载次数: 3)

2017-12-9 18:34 上传

其中代码部分

[mw_shl_code=text,true][^a-zA-Z] :[ ]指成员关系，^指取反

[/mw_shl_code]

接下来，将文本字幕全部转换成小写，并分割成单词：

[mw_shl_code=python,true]lower_case = letters_only.lower() # Convert to lower case

words = lower_case.split() # Split into words[/mw_shl_code]

20171207153257538.jpeg (16.76 KB, 下载次数: 2)

2017-12-9 18:35 上传

最后我们还需要考虑去删除一些经常出现但没啥用的词语。这些单词我们称之为停用词：“a”、“and”、“the”。幸运地是，咱们有个python包叫做Natural Language Toolkit (NLTK)，这个工具包里面包含了一些常用的停用词。安装方法如下：

[mw_shl_code=shell,true]pip install -U nltk[/mw_shl_code]

安装完成后，我们导入工具包，并下载文本数据集停用词。

[mw_shl_code=python,true]import nltk

nltk.download() # Download text data sets, including stop words[/mw_shl_code]

接下来我们显示停用词：

[mw_shl_code=python,true]from nltk.corpus import stopwords # Import the stop word list

print(stopwords.words("english")) [/mw_shl_code]

20171207160326621.jpeg (46.84 KB, 下载次数: 2)

2017-12-9 18:37 上传

去除停用词：

[mw_shl_code=python,true]# Remove stop words from "words"

words = [w for w in words if not w in stopwords.words("english")]

print(words)[/mw_shl_code]

20171207160338347.jpeg (66.7 KB, 下载次数: 2)

2017-12-9 18:38 上传

到目前为止，我们已经对review第一行做了数据清理及文本的预处理，接下来我们需要对整个数据集进行处理。

我们定义并实现一个函数来专门来做这个任务。

[mw_shl_code=python,true]def review_to_words( raw_review ):

# Function to convert a raw review to a string of words

# The input is a single string (a raw movie review), and

# the output is a single string (a preprocessed movie review)

#

# 1. Remove HTML

review_text = BeautifulSoup(raw_review).get_text()

#

# 2. Remove non-letters

letters_only = re.sub("[^a-zA-Z]", " ", review_text)

#

# 3. Convert to lower case, split into individual words

words = letters_only.lower().split()

#

# 4. In Python, searching a set is much faster than searching

# a list, so convert the stop words to a set

stops = set(stopwords.words("english"))

#

# 5. Remove stop words

meaningful_words = [w for w in words if not w in stops]

#

# 6. Join the words back into one string separated by space,

# and return the result.

return( " ".join( meaningful_words )) [/mw_shl_code]

这个函数里面有两处代码值得注意下：

stops = set(stopwords.words(“english”)) 停用词存储在集合中而不是列表里，这是因为在python中搜索集合的速度要比列表快的多。

return( ” “.join( meaningful_words )) 这句代码将单词再一次组成一段文本，这是为了接下来在词袋模型中能更好地应用它。

接下里利用循环，将每行评论进行数据清理与文本预处理：

[mw_shl_code=python,true]# Get the number of reviews based on the dataframe column size

num_reviews = train["review"].size

# Initialize an empty list to hold the clean reviews

clean_train_reviews = []

# Loop over each review; create an index i that goes from 0 to the length

# of the movie review list

for i in range( 0, num_reviews ):

# Call our function for each one, and add the result to the list of

# clean reviews

clean_train_reviews.append( review_to_words( train["review"] ) )[/mw_shl_code]

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。