运用Selenium Webdriver模块进行电影票房的爬取，并进行简单的绘图分析

最新推荐文章于 2024-07-17 14:24:38 发布

中意灬

最新推荐文章于 2024-07-17 14:24:38 发布

阅读量1.7k

点赞数 5

分类专栏： python爬虫学习笔记文章标签： selenium python 爬虫

本文链接：https://blog.csdn.net/qq_55977554/article/details/122544581

版权

python爬虫学习笔记专栏收录该内容

13 篇文章 4 订阅

订阅专栏

文章目录

1.Selenium介绍

首先我们先来了解一下什么是Selenium。Selenium是一个自动化测试工具，它模拟人的行为来操作浏览器，对于一些动态网页或者内容加密的网页，运用Selenium Webdriver则可以较好的实现内容的抓取。

2.Selenium Webdriver的优缺点

优点

易于学习，抓取过程类似于其他流行的工具，例如 BeautifulSoup。
与 BeautifulSoup 等其他 Web 抓取库不同，Selenium Webdriver 打开真正的 Web 浏览器窗口。它会看到我们看到的一切。这对于没有浏览器无法捕获的大量 JavaScript 的现代网站非常有用。
Selenium Webdriver 支持类似人的交互，例如单击按钮。

缺点

它很慢。当浏览器完成加载网站时，它才开始抓取。
并行化不如其他库有效。因为它需要为每个页面打开真正的 Web 浏览器窗口，这会很快耗尽你的内存。

3.Selenium Webdriver的一些基础操作

from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import  Keys
web=Chrome()#创建网页对象
web.get('网址')#打开一个网址
web.fin_element()#找到一个元素
web.find_elements()#找到一堆元素
web.find_element(By.Xpath,'路径').click()#通过xpath定位到某个位置并点击（还有其他方式，比如By.TAG_NAME,By.CLASS_NAME等方式，用法差不多）
web.find_element（By.Xpath,'路径'）.end_keys('内容',Keys.ENTER)#通过xpath定位到某个位置，并输入内容，回车
web.close()#关闭当前页面
web.switch_to_window(web.window_handles[-1])#到新打开的页面
web.switch_to_window(web.window_handles[0])#到初始页面
#设置运行页面不显示的参数
opt=Options()
opt.add_argument('--headless')
opt.add_argument('--disable-gpu')

常用的函数大概就上面的，大家可以去自行学习和查看更多的函数

4.准备步骤

我所用到的python版本是3.10版本，同时需要大家自己去下载一个浏览器驱动，将浏览器驱动安装到python终端所在的文件夹下，我所下载的是Chrome，网址为https://npm.taobao.org/mirrors/chromedriver（下载和当前浏览器版本相同的驱动）
我们所用到的包涉及到这些selenium（用于爬取内容），pandas，csv（用于数据存储和读取），pyecharts（用于绘图），time（用于浏览器休眠），没有的包，大家可以自行下载。
所爬取的网页为：https://www.endata.com.cn/BoxOffice/BO/Year/index.html

5.步骤

我们先分析网页
在这里插入图片描述

然后就可以开始用Selenium Webdriver模拟我们人的操作去拿去数据了

导入模块

import time
from pyecharts.charts import Bar,Grid
from pyecharts import options as opts
import pandas as pd
import csv
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import Chrome#你所用的是什么浏览器就导什么
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import  Keys
from selenium.webdriver.support.select import Select

获取数据并保存

lis=[]#路径保存
#配置参数，使其运行不显现
opt=Options()
opt.add_argument('--headless')
opt.add_argument('--disable-gpu')
#1.创建浏览器对象
web=Chrome(options=opt)
#2.打开指定网址
web.get('https://www.endata.com.cn/BoxOffice/BO/Year/index.html')
options=web.find_element(By.XPATH,'//*[@id="OptionDate"]')#定位到下拉列表
sel=Select(options)#对元素包装成下拉列表
for i in range(len(sel.options)):#每个下拉选项的索引位置
    year=2022-i
    lis.append(f'艺恩电影票房/{year}艺恩电影票房数据.csv')
    with open(f'艺恩电影票房/{year}艺恩电影票房数据.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        sel.select_by_index(i)#按照索引进行切换
        time.sleep(2)
        trs=web.find_elements(By.XPATH,'//*[@id="TableList"]/table/tbody/tr')
        writer.writerow([1,2])
        for tr in trs:
            title=tr.find_element(By.TAG_NAME,'td.movie-name').text#通过标签获取到影片名字
            num=tr.find_element(By.TAG_NAME,'td:nth-child(4)').text.replace(',','')#通过标签获取到票房
            num=int(num)
            writer.writerow([title,num])

绘图

def darw(path):
    title=path[-16:-12]
    # 导入数据
    data_raw = pd.read_csv(path, encoding='utf-8',index_col=0, header=0)
    data_raw.head()
    grid=Grid()
    # 条形图
    bar = (
        Bar(init_opts=opts.InitOpts(width="1350px"))
            .add_xaxis(list(data_raw.index))  # x轴数据
            .add_yaxis(f'{title}票房', list(data_raw['2']))  # y轴数据
            .set_global_opts(xaxis_opts=opts.AxisOpts(name_rotate=30,axislabel_opts={"rotate": 315}))  # 设置一些标题，坐标轴参数
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False)))  # 是否显示数据值
    grid.add(bar,grid_opts=opts.GridOpts(pos_bottom="25%"))
    grid.render(f'{title}票房数据.html')
if __name__ == '__main__':
 	 for i in lis:
        darw(i)

最终的运行结果

在这里插入图片描述

中意灬

关注

5
点赞
踩
21

收藏

觉得还不错? 一键收藏
0
评论
运用Selenium Webdriver模块进行电影票房的爬取，并进行简单的绘图分析

文章目录1.Selenium介绍2.Selenium Webdriver的优缺点优点缺点3.Selenium Webdriver的一些基础操作4.准备步骤5.步骤获取数据并保存绘图最终的运行结果1.Selenium介绍首先我们先来了解一下什么是Selenium。Selenium是一个自动化测试工具，它模拟人的行为来操作浏览器，对于一些动态网页或者内容加密的网页，运用Selenium Webdriver则可以较好的实现内容的抓取。2.Selenium Webdriver的优缺点优点易于学习，抓取过
复制链接

扫一扫