Python爬取唐人街探案3豆瓣短评并生成词云

爬取唐人街探案3短评过程
要爬取的URL:
https://movie.douban.com/subject/27619748/comments?start=20&limit=20&status=P&sort=new_score
在这里插入图片描述

url = 'https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P % (movie_id, (i - 1) * 20)
其中i代表当前页码,从0开始。

在谷歌浏览器中按F12进入开发者调试模式,查看源代码,找到短评的代码位置,查看位于哪个div,哪个标签下
在这里插入图片描述
分析源码
可以看到评论在div[id=‘comments’]下的div[class=‘comment-item’]中的第一个span[class=‘short’]中,使用正则表达式提取短评内容,即代码为:

url = ‘https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P’ \ % (movie_id, (i - 1) * 20)req = requests.get(url, headers=headers)req.encoding = ‘utf-8’comments = re.findall(’(.*)’, req.text)
使用jieba分词,jieba按照中文习惯把很多文字进行分词
with open(file_name, ‘r’, encoding=‘utf8’) as f:word_list = jieba.cut(f.read())result = " ".join(word_list) # 分词用 隔开
生成wordcloud词云:
if icon_name is not Null and len(icon_name) > 0:gen_stylecloud(text=result, icon_name=icon_name, font_path=‘simsun.ttc’, output_name=pic)else:gen_stylecloud(text=result, font_path=‘simsun.ttc’, output_name=pic)
完整代码:
#分析豆瓣唐探3的影评,生成词云# https://movie.douban.com/subject/27619748/comments?start=20&limit=20&status=P&sort=new_score# url = 'https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P '# % (movie_id, (i - 1) * 20)import requestsfrom stylecloud import gen_stylecloudimport jiebaimport refrom bs4 import BeautifulSoup

headers = { ‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0’}def jieba_cloud(file_name, icon):with open(file_name, ‘r’, encoding=‘utf8’) as f:word_list = jieba.cut(f.read())result = " ".join(word_list) # 分词用 隔开# 制作中文词云icon_name = " "if icon “1”:icon_name = ‘‘elif icon “2”:icon_name = "fas fa-dragon"elif icon “3”:icon_name = "fas fa-dog"elif icon “4”:icon_name = "fas fa-cat"elif icon “5”:icon_name = "fas fa-dove"elif icon == “6”:icon_name = "fab fa-qq"pic = str(icon) + ‘.png’if icon_name is not Null and len(icon_name) > 0:gen_stylecloud(text=result, icon_name=icon_name, font_path=‘simsun.ttc’, output_name=pic)else:gen_stylecloud(text=result, font_path=‘simsun.ttc’, output_name=pic)return pic# 爬取短评def spider_comment(movie_id, page):comment_list = []with open(“douban.txt”, “a+”, encoding=‘utf-8’) as f:for i in range(1,page+1):url = ‘https://movie.douban.com/subject/%s/comments?start=%s&limit=20&sort=new_score&status=P’ \ % (movie_id, (i - 1) * 20)req = requests.get(url, headers=headers)req.encoding = ‘utf-8’comments = re.findall(’(.*)’, req.text)f.writelines(’\n’.join(comments))print(comments)# 主函数if name == ‘main’:movie_id = '27619748’page = 10spider_comment(movie_id, page)jieba_cloud(“douban.txt”, “1”)jieba_cloud(“douban.txt”, “2”)jieba_cloud(“douban.txt”, “3”)jieba_cloud(“douban.txt”, “4”)jieba_cloud(“douban.txt”, “5”)jieba_cloud(“douban.txt”, “6”)
生成的 douban.txt (部分):
在这里插入图片描述

生成的词云:

在这里插入图片描述

在这里插入图片描述
文章部分内容源于网络,联系侵删*

已标记关键词 清除标记
相关推荐