利用selenium进行网站动态数据爬取（以天气预报为例）

最新推荐文章于 2024-04-19 15:16:06 发布

种瓜猹

最新推荐文章于 2024-04-19 15:16:06 发布

阅读量456

点赞数 1

文章标签： selenium python

本文链接：https://blog.csdn.net/Skriften/article/details/134179201

版权

一、前言

本文将以天气预报网站 https://weather.com 的每小时模块为例展示动态数据爬取过程。使用浏览器为chrome，利用python-selenium库。

二、准备

1. python-selenium库（pip install selenium即可）；

2. chrome.driver

（1）确认chrome版本，在浏览器“设置”中找到“关于Chrome”，在以下界面找到版本号

（2）114之前版本可在 https://chromedriver.storage.googleapis.com/index.html 找到相应版本的driver安装包，116及以上版本可在 https://googlechromelabs.github.io/chrome-for-testing/ 找到相应版本的安装包；

（3）安装包下载后解压在chrome安装目录下；

（4）将chrome的安装目录路径添加到系统属性环境变量下，完成环境配置；

三、操作

1、首先进行driver的配置：

from selenium import webdriver
from selenium.webdriver.common.by import By

option = webdriver.ChromeOptions()
option.add_argument("--headless")  # 无窗口打开浏览器
option.add_experimental_option('excludeSwitches', ['enable-logging'])  # 忽略一些提醒
driver = webdriver.Chrome(options=option)

2.drver获取所需爬取的网站网址：

# 此处获取的地址是 weather.com 的每小时预报模块的网址
driver.get("https://weather.com/zh-CN/weather/hourbyhour/l
            /66e11121164bc0b202cf593d748940191571c8bba857e3a5905cbfeda13aaf7b")

3. 所需数据的定位：

（1）使用Xpath进行定位：

在浏览器打开网址，选中所需爬取的元素，右键选择“检查”，如图所示：

根据在“Elements”中的相应内容，找到所需的数据位置，右键选择“Copy”，选择“Copu full Xpath”，在剪切板可以看到一串字符；

在代码端使用“driver.find_element”函数获取根据Xpath定位到的内容：

# 使用path定位数据
rainpor = driver.find_element(By.XPATH, "/html/body/div[1]/main/div[2]/main/div[1]/section/div[2]/div[2]/
details[1]/div/div[2]/ul/li[6]/div/span[2]/span[1]")  # 降水量数据
pretime = driver.find_element(By.XPATH, "/html/body/div[1]/main/div[2]/main/div[1]/section/div[2]/div[2]/
details[1]/summary/div/div/h3")  # 预报时间
unit = driver.find_element(By.XPATH, "/html/body/div[1]/main/div[2]/main/div[1]/section/div[2]/div[2]/
details[1]/div/div[2]/ul/li[6]/div/span[2]/span[2]")  # 降水量单位

打印text内容，可以看到成功获取网站内容：

print(rainpor.text, unit.text, pretime.text)

运行结果：

0 厘米 16:00

（2）使用class_name进行定位：

在实践过程中发现，在有降水与无降水情况下网站数据的Xpath存在差别，故采用class_name定位的方式；同样地在浏览器中使用检查功能查找所需元素的class：

可以看到，降水量数据的class_name为：“DetailsTable--value--2YD0-”，但同时可发现，在同一时间的小菜单下，所有数字数据的class_name都相同，故使用list存放同一class_name的数据，再在list中获取到降水量数据：

list_details = driver.find_elements(By.CLASS_NAME, "DetailsTable--value--2YD0-")
rainpor = list[5]  # 此数据包含数字及单位

同理获得预报时间的数据

pretime = driver.find_element(By.CLASS_NAME, "DetailsSummary--daypartName--kbngc")

打印后结果与Xpath定位方式获取到的数据相同，且可同时适应有降水与无降水情况。

四、完整代码

from selenium import webdriver
from selenium.webdriver.common.by import By


def data_get():
    # 配置driver
    option = webdriver.ChromeOptions()
    option.add_argument("--headless")
    option.add_experimental_option('excludeSwitches', ['enable-logging'])
    driver = webdriver.Chrome(options=option)
    # driver获取指定网站
    driver.get("https://weather.com/zh-CN/weather/hourbyhour/l"
               "/66e11121164bc0b202cf593d748940191571c8bba857e3a5905cbfeda13aaf7b")

    # 使用path定位数据
    # rainpor = driver.find_element(By.XPATH, "/html/body/div[1]/main/div[2]/main/div[1]/section/div[2]/div[2]/"
    #                                         "details[1]/div/div[2]/ul/li[6]/div/span[2]/span[1]"
    # pretime = driver.find_element(By.XPATH, "/html/body/div[1]/main/div[2]/main/div[1]/section/div[2]/div[2]/"
    #                                         "details[1]/summary/div/div/h3")
    # unit = driver.find_element(By.XPATH, "/html/body/div[1]/main/div[2]/main/div[1]/section/div[2]/div[2]/"
    #                                      "details[1]/div/div[2]/ul/li[6]/div/span[2]/span[2]")
   
    # 使用class_name定位数据
    list_details = driver.find_elements(By.CLASS_NAME, "DetailsTable--value--2YD0-")
    pretime = driver.find_element(By.CLASS_NAME, "DetailsSummary--daypartName--kbngc")
    rainpor = list_details[5]
   
    print(rainpor.text, pretime.text)
    
    driver.quit()


if __name__ == '__main__':
    data_get()

以上。

种瓜猹

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
利用selenium进行网站动态数据爬取（以天气预报为例）

本文将以天气预报网站 https://weather.com 的每小时模块为例展示动态数据爬取过程。使用浏览器为chrome，利用python-selenium库。
复制链接

扫一扫