python3爬虫实战一：爬取豆瓣最新上映电影及画出词云分布

最新推荐文章于 2024-08-02 18:01:01 发布

HEERY551

最新推荐文章于 2024-08-02 18:01:01 发布

阅读量2.1k

点赞数 1

分类专栏： python网络爬虫文章标签： python爬虫

本文链接：https://blog.csdn.net/HEERY551/article/details/79817801

版权

本文介绍了使用Python3爬取豆瓣电影最新上映电影的短评，并进行词云分布画图的步骤。涉及requests、jieba、wordcloud等库的使用，包括网页抓取、数据清洗、分词处理、去除停用词和绘制词云图等操作。

摘要由CSDN通过智能技术生成

参考：http://python.jobbole.com/88325/

任务：

1. 豆瓣电影主页抓取最新上映的全部电影id号与电影名

2. 进入每部电影具体详情页面提取首页热门短评

3. 对每部电影短评进行词云分布画图

python 版本 3.5

准备工作:

1.第三方库：

requests，jieba，wordcloud，pandas，matplotlib，BeautifulSoup，numpy，re

wordcloud的pip安装会出错，下载whl文件再安装下载地址: https://www.lfd.uci.edu/~gohlke/pythonlibs/

2.其他文件

1）simhei.tff 字体文件：百度搜索下载即可

2）stopwords.txt 停用词文件百度搜索下载即可

一、豆瓣电影主页抓取最新上映的全部电影id号与电影名

第一步：对网页进行访问，获取html网页

import requests

root_url="https://movie.douban.com/cinema/nowplaying/nanjing/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'}
response = requests.get(root_url,headers = headers)
html = response.text

以上即可获取网页的html网页、也可以通过直接访问:https://movie.douban.com/cinema/nowplaying/nanjing/ 检查来查看html网页结构，如下图所示：

第二步对html网页进行分析，解析我们所需要的信息

我们需要获取所有正在热映的电影，由以上html图片可知，全部电影信息均存储在div id='nowplaying'的内部，而每部电影都是在 li class='list-item'标签内所以我们可以通过BeautifulSoup来提取信息。

soup = BeautifulSoup(html,'lxml')
nowplaying_movie = soup.find_all('div',id = 'nowplaying')
nowplaying_movie_list = nowplaying_movie[0].find_all('li',class_ = 'list-item')

因为我们需要提取每部电影的id号和电影名，我们则需要通过分析html网页找到每部电影内部的id号和电影名

我们可以发现直接可以提取其每部电影的id号，而电影名存储在li class='stitle' 的a内部 .所以我们可以创建一个空白列表去循环存储所有电影的id和电影名构建的所有字典组。

nowplaying_list = []        
for item in nowplaying_movie_list:
    nowplaying_dict = {}
    nowplaying_dict['id'] = item ['id']
    for tag_img_item in item.find_all('li',class_ = 'stitle'):
        nowplaying_dict['name'] = tag_img_item.a.text.strip()
    nowplaying_list.append(nowplaying_dict)

我们可以通过print(nowplaying_list)查看获取的一个列表，其包含正在热映的所有电影id、电影名构建的字典组。结果如下图所示：

二、进入每部电影具体详情页面提取首页热门短评

第一步获取每部电影网页地址

试着打开第一部电影（这里是：头号玩家），发现其url均为 https://movie.douban.com/subject/4920389/?from=playing_poster，打开其他电影的具体网址，也是相同结构，只是id号不一样，所以我们只需要更换到相应电影的id号即可得到其网页地址。id号可以通过循环以上所有电影的列表获取每部的id号。这里我们以第一部电影为例。

首先获取第一部电影（头号玩家）html网页

first_movie = nowplaying_list[0]
comment_url = "https://movie.douban.com/subject/"+ first_movie['id']+"/?from=playing_poster"
resp = requests.get(comment_url,headers = headers)
first_movie_html = resp.text

第二步解析网页获取所有评论

查看评z论页面的html结构，如下图所示

只加载了可以显示的热门短评，由于评论过多，我们只选取了首页的短评内容（完整代码中包含提取前十页的短评内容）

发现评论均存在于div class=comment 下部的p内解析代码如下示：

soup1 = BeautifulSoup(first_movie_html,'lxml')
comment_div_list = soup1.find_all('div',class_='comment')

first_movie_comment_list = []
for item in comment_div_list:
    if item.find_all('p')[0].string is not None:
        first_movie_comment_list.append(item.find_all('p')[0].string)

通过print(first_movie_comment_list)我们即可得到关于第一部电影的所有评论信息如下图所示：