Python爬取中国天气网实时气温数据

最新推荐文章于 2024-07-24 14:36:42 发布

Fl_Sn

最新推荐文章于 2024-07-24 14:36:42 发布

阅读量7.3k

点赞数 2

分类专栏： Python 文章标签： selenium python

本文链接：https://blog.csdn.net/bangdi12/article/details/83994528

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Python爬取中国天气网实时气温数据

小程序目标

小程序目标

使用Python简单编写一个爬虫，爬取中国天气网的气温数据。

前期的一些尝试

requests + re

使用正则表达式匹配requests返回的数据，初学过程中借鉴了网上大神的一段代码：https://www.cnblogs.com/Rhythm-/p/9255255.html

import requests
import re


def get_weather(url):
    response = requests.get(url)
    response.encoding = 'utf-8'

	# 抓取当天气温（非实时）
    aim = re.findall('<input type="hidden" id="hidden_title" value="(.*?)月(.*?)日(.*?)时 (.*?)  (.*?)  (.*?)"',
                     response.text, re.S)
    print("今日气温：%s" % aim[0][5])


if __name__ == "__main__":
    url_bj = "http://www.weather.com.cn/weather1d/101010100.shtml"
    get_weather(url_bj)

输出如下：

今日气温：14/2°C

该段代码获取了下图中标红的标签：
在这里插入图片描述
可以看到，该值为 “当天的最高气温与最低气温” 与实时气温并不相同。

requests + bs4

先找到 实时气温 的标签：
在这里插入图片描述
使用bs4，创建一个BeautifulSoup对象，再使用find_all方法去搜索标签及内容：

from bs4 import BeautifulSoup
import requests


def real_time_weather(url):
	response = requests.get(url)
    response.encoding = 'utf-8'
    html = BeautifulSoup(response.text, "html.parser")
    tem = html.find_all("div", class_="tem")
    print(tem)


if __name__ == "__main__":
    url_bj = "http://www.weather.com.cn/weather1d/101010100.shtml"
    real_time_weather(url_bj)

执行程序输出如下：

[<div class="tem">
</div>]

可以看到，我们找到了标签，但是并没有输出标签中的内容。
之后在网上查找原因，从表象上看，大概是因为中国天气网使用的是shtml，造成有些内容使用requests或者urllib不可显。具体的原理我还没有查。

使用selenium爬取shtml内容

selenium会通过打开浏览器获取代码。安装selenium过程不再赘述，通过pycharm或者pip都可以安装。
因为会有打开浏览器的过程，所以该方法会显得比较耗时，后面我会再寻找其他的方法尝试。

selenium + bs4

from bs4 import BeautifulSoup
from selenium import webdriver


def real_time_weather(url):
	
	browser = webdriver.Chrome()
    browser.get(url)
    content = browser.page_source
    browser.close()
    
	html = BeautifulSoup(content, "html.parser")
	tem = html.find_all("div", class_="tem")
	# 经检查find_all方法返回的tem第一组数据为想要获取的数据
	# span区域为实时气温的数值，em区域为实时气温的单位
    result = tem[0].span.text + tem[0].em.text

    print("实时气温：" + result)


if __name__ == "__main__":
    url_bj = "http://www.weather.com.cn/weather1d/101010100.shtml"
    real_time_weather(url_bj)

执行后返回结果如下：

实时气温：4℃

浏览器驱动问题

在使用selenium的过程中，需要加载选用浏览器的驱动（本人使用的是chrome），这些需要我们提前下载好。否则调用的过程中会抛出如下异常：

selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home