简单上手python爬虫实战：阜阳市历史天气数据爬取

山海不见君

已于 2024-12-08 23:40:03 修改

阅读量1.3k

点赞数 34

分类专栏：简单上手Python爬虫实战文章标签： python 爬虫开发语言

于 2024-11-13 21:54:36 首次发布

本文链接：https://blog.csdn.net/2301_77408198/article/details/143752053

版权

简单上手Python爬虫实战专栏收录该内容

5 篇文章

订阅专栏

这里我们学校开始了见习，搞的是阜阳市历史天气数据看板，加了点大数据方面的技术栈，我这里就不讲了，出一期非常简单的爬虫代码吧。

1 数据来源

这里我们用的网站是天气后报里的，网站如下：历史天气查询|天气记录|天气预报|气温查询|过去天气_天气后报http://tianqihoubao.com/ 然后选择安徽->阜阳，你就会发现如下页面。

2 数据的爬取

然后下滑找到历史数据，我们随便找一个月份的数据，点进去，开始检查！

这里我们找到页面的url，选择网络，我们发现他后面的数字是202301，难道这是一个有规律的页面？在此，我们在随便打开一个月份，看看怎么样！

看来跟我们想的一样，这确实是一个有规律的页面，网站末尾跟着查询的年份与月份，那我们再把UA复制粘贴一下这个代码不就完成了吗（年轻人你还是太嫩了！）现在我们把初步的代码写出来

for year in range(2020, 2025):
    for j in range(1, 13):
        month = str(j).zfill(2)
        k = f"{year}{month}"
        url = 'http://tianqihoubao.com/lishi/fuyang/month/{}.html'.format(k)
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/130.0.0.0 Safari/537.36 Edg/130.0.0.0'
        }
        re = requests.get(url, headers=headers).text
        print(re)

最后也是成功获取到了页面，最后别忘了加break，不然你懂的。我们现在就可以看看我们需要的数据在哪吧！

然后就是获取xpath路径了，让我们一键复制吧！（正则表达式写不了一点）

然后开始写xpath代码，如下所示。

        tree = etree.HTML(re)
        list_li = tree.xpath('//*[@id="content"]/table/tbody/tr')
        print(list_li)

你会突然发现咋啥都没有呢，是不是xpath路径有问题？还是代码有问题？其实这也就是最狗的一点，我现在就告诉你们，因为这也是反爬的一种机制，因为他的源代码里是没有tbody这个标签的。

我们删掉tbody再试试看，就会发现能获取里面的数据

然后我们开始遍历tr里面的每一行内容，看到我们是不需要第一行标签的，剩下的数据都在td下面，并且都是四个值。

开始获取文档内容，并打印出来。代码如下：

        for li in list_li:
            date = li.xpath('./td[1]/text()')
            print(date)
            weather_condition = li.xpath('./td[2]/text()')
            print(weather_condition)
            max_min_temperature = li.xpath('./td[3]/text()')
            print(max_min_temperature)
            wind_direction = li.xpath('./td[4]/text()')
            print(wind_direction)

我们可以发现数据是爬出来了，但是里面的内容咋这么多乱七八糟的东西，那我们把两边的符号都去掉在看看，

        for li in list_li:
            date = li.xpath('./td[1]/text()')
            date = [i.strip().replace('\r\n', '') for i in date]
            print(date)
            weather_condition = li.xpath('./td[2]/text()')
            weather_condition = [i.strip().replace('\r\n', '') for i in weather_condition]
            print(weather_condition)
            max_min_temperature = li.xpath('./td[3]/text()')
            max_min_temperature = [i.strip().replace('\r\n', '') for i in max_min_temperature]
            print(max_min_temperature)
            wind_direction = li.xpath('./td[4]/text()')
            wind_direction = [i.strip().replace('\r\n', '') for i in wind_direction]
            print(wind_direction)

但他中间还是有很多空格存在，这里我们就采用替代，将里面的空格替代成无就行了，最后处理完成的代码如下，

        for li in list_li:
            date = li.xpath('./td[1]/text()')
            date = [i.strip().replace('\r\n', '').replace(' ', '') for i in date]
            print(date)
            weather_condition = li.xpath('./td[2]/text()')
            weather_condition = [i.strip().replace('\r\n', '').replace(' ', '') for i in weather_condition]
            print(weather_condition)
            max_min_temperature = li.xpath('./td[3]/text()')
            max_min_temperature = [i.strip().replace('\r\n', '').replace(' ', '') for i in max_min_temperature]
            print(max_min_temperature)
            wind_direction = li.xpath('./td[4]/text()')
            wind_direction = [i.strip().replace('\r\n', '').replace(' ', '') for i in wind_direction]
            print(wind_direction)

现在我们就先把数据保存在pandas里面，代码如下，

result = pd.DataFrame(columns=['date', 'weather_condition', 'max_min_temperature', 'wind_direction'])

for year in range(2020, 2025):
    for j in range(1, 13):
        r = pd.DataFrame(columns=['date', 'weather_condition', 'max_min_temperature', 'wind_direction'])
        ...
        for li in list_li:
            ...
            data_to_add = {
                'date': date,
                'weather_condition': weather_condition,
                'max_min_temperature': max_min_temperature,
                'wind_direction': wind_direction
            }
            new_row = pd.DataFrame(data_to_add)
            r = pd.concat([r, new_row], ignore_index=True)
            print(r)
        break
    break

最后也是成功保存下来了，但是他的第一行是空白，这种我们在他遍历完成之后就可以去掉第一行就OK了，然后在拼接一下就可以得到完整的数据了。

        r = r.iloc[1:]
        result = pd.concat([result, r], ignore_index=True)

3 数据的保存

这里我因为要对接mysql的，所以我就保存在了虚拟机上的mysql了，大家按照自己的需求保存数据吧。

config = {
    'user': 'root',
    'password': '...',
    'host': 'hadoop101',
    'database': 'weather_db',
    'raise_on_warnings': True
}
engine = create_engine(
    f"mysql+pymysql://{config['user']}:{config['password']}@{config['host']}/{config['database']}")
result.to_sql('history_weather', con=engine, if_exists='replace', index=False)

4 完整代码

import requests
from lxml import etree
import pandas as pd
from sqlalchemy import create_engine

result = pd.DataFrame(columns=['date', 'weather_condition', 'max_min_temperature', 'wind_direction'])

for year in range(2020, 2025):
    for j in range(1, 13):
        r = pd.DataFrame(columns=['date', 'weather_condition', 'max_min_temperature', 'wind_direction'])
        if year == 2024 and j == 12:
            continue
        month = str(j).zfill(2)
        k = f"{year}{month}"
        url = 'http://tianqihoubao.com/lishi/fuyang/month/{}.html'.format(k)
        headers = {
            'User-Agent': '...'
        }
        re = requests.get(url, headers=headers).text
        tree = etree.HTML(re)
        list_li = tree.xpath('//*[@id="content"]/table/tr')
        for li in list_li:
            date = li.xpath('./td[1]/text()')
            date = [i.strip().replace('\r\n', '').replace(' ', '') for i in date]
            weather_condition = li.xpath('./td[2]/text()')
            weather_condition = [i.strip().replace('\r\n', '').replace(' ', '') for i in weather_condition]
            max_min_temperature = li.xpath('./td[3]/text()')
            max_min_temperature = [i.strip().replace('\r\n', '').replace(' ', '') for i in max_min_temperature]
            wind_direction = li.xpath('./td[4]/text()')
            wind_direction = [i.strip().replace('\r\n', '').replace(' ', '') for i in wind_direction]
            data_to_add = {
                'date': date,
                'weather_condition': weather_condition,
                'max_min_temperature': max_min_temperature,
                'wind_direction': wind_direction
            }
            new_row = pd.DataFrame(data_to_add)
            r = pd.concat([r, new_row], ignore_index=True)
        r = r.iloc[1:]
        result = pd.concat([result, r], ignore_index=True)
        print(result)
    break


config = {
    'user': 'root',
    'password': '...',
    'host': 'hadoop101',
    'database': 'weather_db',
    'raise_on_warnings': True
}
engine = create_engine(
    f"mysql+pymysql://{config['user']}:{config['password']}@{config['host']}/{config['database']}")
result.to_sql('history_weather', con=engine, if_exists='replace', index=False)