1.#提取出原始数据中的第一行review列中的文本数据,并用display函数显示
display(df["review"],"原始数据")
输出:
0 With all this stuff going down at the moment w...
1"The Classic War of the Worlds" by Timothy Hin...
2 The film starts with a manager (Nicholas Bell)...
3 It must be assumed that those who praised this...
4 A friend of mine bought this film for £1, and ...
5 <br /><br />This movie is full of references. ...
--------------------------------------------
display(df["review"][1],"原始数据")
输出:
"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that
obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic book.
Mr. Hines succeeds in doing so...
2.#用BeautifulSoup将第四步中获取到的数据中的html标签去除
df_01 = df["review"][1]
df_02 = BeautifulSoup(df_01,"lxml")
[s.extract() for s in df_02('script')]
df_03 = df_02.get_text()
display(df_03, "去掉HTML标签的数据")
输出:
"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that obviously
goes to great effort and lengths to faithfully recreate H. G. Wells' classic book.
Mr. Hines succeeds in doing so.
1.3 将数据中的标点符号去掉(正则)
df_04 = df_03.replace(",", "").replace(".", "").replace('"', '').replace('\'', '')
df_04
输出:
'The Classic War of the Worlds by Timothy Hines is a very entertaining film that
obviously goes to great effort and lengths to faithfully recreate H G Wells
classic book Mr Hines succeeds in doing so I and those who watched his film withme appreciated the fact that it was not the standard predictable Hollywood...'
#去掉上步数据中的英文停用词"""
first = [1,2,3,4,5,6]
second = {}.fromkeys([4,5])
[w for w in first if w not in second]
"""#加载英文停用词
stopwords = {}.fromkeys([line.rstrip() for line in open('nlp/stopwords.txt')])
#用加载的英文停用词,去除第七部数据中的英文停用词
words_nostop = [w for w in str_03 if w notin stopwords]
display(words_nostop, '去掉停用词数据')
#为确保所加载的英文停用词没有重复数据 set()去重
eng_stopwords = set(stopwords)
1.5 定义函数实现(1.2-1.4)的文本处理
defclean_text(text):
text = BeautifulSoup(text, 'html.parser').get_text() #去除网页标签
text = re.sub(r'[^a-zA-Z]', ' ', text) #去除文本中的特殊字符:‘’ ." 、'
words = text.lower().split() #文字转成小写词
words = [w for w in words if w notin eng_stopwords] #去除停用词return' '.join(words) #词语用空格分开