爬取中国医生的短评
今天早上,学妹发来信息问:中国医生好不好看。像我这种这喜欢看黑丝怎么会看正经电影呢。但学妹既然问了,我怎么着也得给他答复。于是我看看了短评,褒贬不一。为了不能糊弄学妹,于是我打算用爬虫爬取影评制作词云来分析。
观察网址
这是要爬取的网址:
‘https://movie.douban.com/subject/35087699/comments?start=0&limit=20&status=P&sort=new_score’
经过我的的观察,第一页start=0,第二有页start=20,依次。
这样就好办了
直接上代码
import requests
from lxml import etree
import jieba
import wordcloud
import itertools
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
}
pl_list=[]#存放影评
cut_pllist=[]#存放用jieba分割后的影评
for i in range(0,20*20,20):#这里就线爬取20页
url = 'https://movie.douban.com/subject/35087699/comments?start={}&limit=20&status=P&sort=new_score'.format(i)
r = requests.get(url=url,headers=headers).text
tree = etree.HTML(r)
pl = tree.xpath('//span[@class="short"]/text()')
pl_list.append(pl)
pl_list = list(itertools.chain.from_iterable(pl_list))#一行式展平列表
#分词
for j in pl_list:
a= jieba.lcut(j)
cut_pllist.append(a)
cut_pllist = list(itertools.chain.from_iterable(cut_pllist))
pl_text = ' '.join(cut_pllist)
#制作词云
pl1 = wordcloud.WordCloud( font_path="msyh.ttc",width=1000,height=700,max_words=50)
pl1.generate(pl_text)
pl1.to_file('5.png')
这是运行出来的图片:
我一看,怎么这么多无关内容呢。在优化一下。
于是,我将这些没用的词语在列表中删除:
def delet(alist,str1):
for i in alist:
if i ==str1:
alist.remove(i)
return alist
cut_pllist = delet(cut_pllist,'的')
cut_pllist = delet(cut_pllist,'了')
cut_pllist = delet(cut_pllist,'电影')
cut_pllist = delet(cut_pllist,'我')
cut_pllist = delet(cut_pllist,'是')
cut_pllist = delet(cut_pllist,'和')
cut_pllist = delet(cut_pllist,'在')
cut_pllist = delet(cut_pllist,'我们')
cut_pllist = delet(cut_pllist,'很')
cut_pllist = delet(cut_pllist,'都')
cut_pllist = delet(cut_pllist,'人')
cut_pllist = delet(cut_pllist,'也')
再次运行:
发现好多了。这下可以交代了。跟学妹说了之后,她向我发出了感谢:
若果你觉得你对你有用就点个赞呗。
关注我,分享更多爬虫知识。