手把手教你用Python爬取天气数据（附可视化技巧）_爬取天气网站上的天气数据-CSDN博客

本文链接：https://blog.csdn.net/fluxcode/article/details/147980029

文章目录

一、前言：爬虫其实可以很简单！

最近后台收到好多小伙伴私信：“天气数据怎么爬？”、“网页结构看不懂怎么办？”（挠头.jpg）今天就给各位老铁安排一个超实用的天气数据爬取教程！不需要高深的技术，只要会点Python基础就能上手，还能把数据变成酷炫的图表，你懂的~（文末有完整代码）

二、准备工作：别急着写代码！

2.1 选对目标网站很重要！

这里推荐中国天气网（www.weather.com.cn）作为数据源，毕竟官方数据靠谱又稳定。不过要注意看网站的robots.txt文件（在网站地址后加/robots.txt就能看到），确认允许爬取的目录。

2.2 必备工具三件套

安装这三个库就能开工：

pip install requests  # 网页请求神器
pip install beautifulsoup4  # 解析HTML小能手
pip install pyecharts  # 可视化大杀器

三、实战环节：爬虫五步走！

3.1 分析网页结构（关键！）

按F12打开开发者工具，发现北京天气的URL是：

http://www.weather.com.cn/weather/101010100.shtml

观察DOM树发现，每天的天气数据都在<li class="sky">标签里。这里有个坑：class名称可能会变，建议用正则表达式匹配！

3.2 发送HTTP请求

import requests

url = 'http://www.weather.com.cn/weather/101010100.shtml'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get(url, headers=headers)
response.encoding = 'utf-8'  # 防止中文乱码

3.3 解析HTML数据

用BeautifulSoup提取关键信息：

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')
weather_list = soup.find_all('li', class_='sky')

for item in weather_list:
    date = item.find('h1').text
    temp = item.find('p', class_='tem').text.replace('\n', '')
    print(f'{date}: {temp}')

3.4 数据存储（两种方案）

方案一：CSV文件

import csv

with open('weather.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['日期', '温度', '天气'])
    # 写入数据...

方案二：MySQL数据库（适合大量数据）

import pymysql

conn = pymysql.connect(host='localhost', user='root', password='123456', db='weather')
cursor = conn.cursor()
sql = "INSERT INTO data (date, temp) VALUES (%s, %s)"
# 执行插入操作...

3.5 数据可视化（超酷炫！）

使用pyecharts生成交互式图表：

from pyecharts.charts import Line

line = Line()
line.add_xaxis(dates)  # X轴日期
line.add_yaxis("最高温度", max_temps)
line.add_yaxis("最低温度", min_temps)
line.render("weather.html")  # 生成网页版图表

四、避坑指南（血泪经验！）

反爬机制破解：遇到403错误？试试：
- 添加Referer请求头
- 使用代理IP池（推荐快代理等付费服务）
- 降低请求频率（加time.sleep(2)）

数据清洗小技巧：

# 去除空白字符
text = '  25℃~30℃  '.strip() 
# 正则提取温度
import re
temps = re.findall(r'\d+', text)  # 得到['25', '30']

定时任务设置（每天自动爬）：

import schedule
import time

def job():
    print("开始爬取数据...")
    
schedule.every().day.at("08:00").do(job)
while True:
    schedule.run_pending()
    time.sleep(1)