2345天气网爬取历史气温

最新推荐文章于 2025-02-03 20:56:03 发布

意hongdouble绿

最新推荐文章于 2025-02-03 20:56:03 发布

阅读量1.2k

点赞数 5

文章标签： python

本文链接：https://blog.csdn.net/weixin_42960991/article/details/141388028

版权

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
一、网页信息分析
二、数据爬取
- 1.设置访问headers
- 2.爬取数据
总结

前言

数据分析经常使用行政区历史天气，中国气象局的历史天气服务经常在维修，无法访问，找了平替。2345天气网的天气数据是近13年的（2011-2024），虽然不多，但也能凑活用用，目标是导出为excel。整体很简单。

一、网页信息分析

现有教程都比较老，分析网页说在js文件中封装了页面数据，但最新的网页中似乎有所更改。
目前在Fetch/XHR中，preview可以看到数据以json形式存储在其中，我们只需爬取该内容并解析为标准格式。
在这里插入图片描述
分析request URL，可以看到year和month。前面是行政区划的编码信息。

二、数据爬取

1.设置访问headers

需要模拟浏览器等进行访问，预先设置headers，代码如下：

设置请求的headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Connection': 'keep-alive',
    'Referer': 'https://tianqi.2345.com/',
    'Cookie': 'your_cookie_here',
}

2.爬取数据

设置了一个嵌套循环爬取特定年份与月份的数据，并分年份导出为excel。


#设置循环
year=2011
month=3

for year in range(2011,2015):
    # 创建一个空列表来存储该年所有数据
    data = []

    for month in range(3,6):

        # 设置请求的URL
        url = 'https://tianqi.2345.com/Pc/GetHistory?areaInfo%5BareaId%5D=60239&areaInfo%5BareaType%5D=2&date%5B' \
              'year%5D='+str(year)+'&date%5Bmonth%5D='+str(month)
        # 发送GET请求
        response = requests.get(url, headers=headers)

        # 解析JSON数据
        json_data = response.json()

        # 获取嵌套在JSON中的HTML内容
        html_content = json_data['data']

        # 对HTML内容进行转义处理
        html_content = html_content.replace('\\"', '"').replace('\\/', '/').replace('\\n', '')

        # 使用BeautifulSoup解析HTML内容
        soup = BeautifulSoup(html_content, 'html.parser')

        # 提取表格数据
        table = soup.find('table', class_='history-table')
        rows = table.find_all('tr')

        # 处理每一行数据
        for row in rows[1:]:  # 忽略标题行
            cells = row.find_all('td')

            # 确保行中有足够的单元格
            if len(cells) >= 6:
                date = cells[0].text.strip()
                high_temp = cells[1].text.strip()
                low_temp = cells[2].text.strip()
                weather = cells[3].text.strip()
                wind = cells[4].text.strip()
                aqi = cells[5].text.strip()
            else:
                # 如果单元格数目不够，使用空字符串或其他合适的默认值
                date = cells[0].text.strip() if len(cells) > 0 else ''
                high_temp = cells[1].text.strip() if len(cells) > 1 else ''
                low_temp = cells[2].text.strip() if len(cells) > 2 else ''
                weather = cells[3].text.strip() if len(cells) > 3 else ''
                wind = cells[4].text.strip() if len(cells) > 4 else ''
                aqi = ''  # 默认空字符串

            data.append([date, high_temp, low_temp, weather, wind, aqi])

        # 将数据转换为Pandas DataFrame
    df = pd.DataFrame(data, columns=['日期', '最高温度', '最低温度', '天气', '风力风向', '空气质量指数'])

    # 导出为Excel文件
    file_name = f'weather_data_{year}.xlsx'
    df.to_excel(file_name, index=False)

    # 打印每年数据文件的保存信息
    print(f'{file_name} 保存成功')