基于python的selenium库的多页面自动爬取

最新推荐文章于 2024-12-03 15:01:36 发布

come-昂-

最新推荐文章于 2024-12-03 15:01:36 发布

阅读量4.4k

点赞数 1

分类专栏：笔记文章标签： python selenium

本文链接：https://blog.csdn.net/weixin_45038397/article/details/107447542

版权

笔记专栏收录该内容

13 篇文章 0 订阅

订阅专栏

Selenium Python 绑定提供了一个简单的 API，用于使用 Selenium WebDriver 编写功能/验收测试。通过 Selenium Python API，您可以直观地访问 Selenium WebDriver 的所有功能。
selenium为我们提供了一个自动操作浏览器进行爬取功能的功能，通过接入selenium库再加上诸如xpath等方法可以便捷的实现多页面的内容爬取。

Selenium 官方参考文档：http://selenium-python.readthedocs.io/index.html
使用selenium库调用浏览器需要事先下载相应浏览器。

安装Firefox geckodriver

安装firefox最新版本，添加Firefox可执行程序到系统环境变量。记得关闭firefox的自动更新

firefox下载地下：https://github.com/mozilla/geckodriver/releases

将下载的geckodriver.exe 放到path路径下 D:\Python\Python36\（以自己实际为准）

安装ChromeDriver

http://chromedriver.storage.googleapis.com/index.html
注意要找到自己对应的版本号
在谷歌浏览器的设置页中查看

今天我们希望测试的内容是打开猫眼电影官网，检索热门电影排行榜并逐个爬取其内容简介。
猫眼电影热门电影排行网址https://maoyan.com/board/7
由于测试过程可能需要重复多次，为了不被美团官网屏蔽IP，可以设置代理IP进行访问。
selenium提供设置代理IP的方法

#如果网站设置了反爬措施，就需要采用代理ip，先调用 FirefoxProfile 方法
profile = webdriver.FirefoxProfile()
#然后用 set_preference 方法设置代理IP，格式为（“HTTP”，“端口号”）
profile.set_preference('171.35.169.220','9999')
#调用 webdriver 时引入 firefox_profile
firefox = webdriver.Firefox(firefox_profile=profile)

以下是主要代码

firefox.get('https://maoyan.com/board/7')
i = 1
while i <= 10:
    #查找的地址格式化
    contextxpath = '//dd[{}]//p[@class="name"]'.format(i)
    context = firefox.find_element_by_xpath(contextxpath).click()
    # 在新打开的页面中继续寻找想要的结果，注意此时依然通过 firefox 来查找，而不是用 context 来查找
    brief = firefox.find_element_by_xpath('//div/span[@class="dra"]')
    with open('热门电影介绍.txt','a+',encoding='utf-8') as c:
        c.write(brief.text +'\n')
    firefox.back()
    i = i + 1

selenium提供多种方法定位到网页中的位置

单元素定位

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

多元素定位

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
（在单元素定位的基础上加上-s）

注意

文本必须完全匹配才可以，所以这并不是一个很好的匹配方式。
在用 xpath 的时候还需要注意的如果有多个元素匹配了 xpath，它只会返回第一个匹配的元素。如果没有找到，那么会抛出 NoSuchElementException 的异常。

本次实验采用xpath定位
在这里插入图片描述在开发者模式中找到我们希望定位的位置，采用格式化查找的方法定位页面中十项电影

contextxpath = '//dd[{}]//p[@class="name"]'.format(i)

通过点击进入下一级页面，再次通过xpath定位到希望爬取的位置
在这里插入图片描述

brief = firefox.find_element_by_xpath('//div/span[@class="dra"]')

将内容保存在文档中

    with open('热门电影介绍.txt','a+',encoding='utf-8') as c:
        c.write(brief.text +'\n')

爬取完成一项电影后需将页面返回上一级

 firefox.back()

所有任务完成，将浏览器关闭

firefox.close()

爬取结果如下，成功：

在这里插入图片描述

完整代码

from selenium import webdriver
# from selenium.webdriver.common.proxy import Proxy
# from selenium.webdriver.common.proxy import ProxyType

#如果网站设置了反爬措施，就需要采用代理ip，先调用 FirefoxProfile 方法
profile = webdriver.FirefoxProfile()
#然后用 set_preference 方法设置代理IP，格式为（“HTTP”，“端口号”）
profile.set_preference('171.35.169.220','9999')
#调用 webdriver 时引入 firefox_profile
firefox = webdriver.Firefox(firefox_profile=profile)

#接下来就是正常流程
firefox.get('https://maoyan.com/board/7')
i = 1
while i <= 10:
    #查找的地址格式化
    contextxpath = '//dd[{}]//p[@class="name"]'.format(i)
    context = firefox.find_element_by_xpath(contextxpath).click()
    # 在新打开的页面中继续寻找想要的结果，注意此时依然通过 firefox 来查找，而不是用 context 来查找
    brief = firefox.find_element_by_xpath('//div/span[@class="dra"]')
    with open('热门电影介绍.txt','a+',encoding='utf-8') as c:
        c.write(brief.text +'\n')
    firefox.back()
    i = i + 1

firefox.close()