使用 Python 爬取空气质量数据（以哈尔滨为例）(学习自用)_python爬取全国空气质量完整代码-CSDN博客

本文链接：https://blog.csdn.net/MinL1ghT/article/details/140087612

在数据科学和分析领域，网络爬虫是一项重要的技能。本文将介绍如何使用 Python 爬取哈尔滨市的空气质量数据，并将其保存为 CSV 文件，以便进一步分析。

所需工具

我们将使用以下 Python 库：

requests：用于发送 HTTP 请求。
BeautifulSoup：用于解析 HTML 和 XML 文档。
pandas：用于数据操作和分析。
chardet：用于检测网页的编码格式。

确保你已经安装了这些库。如果没有安装，可以使用以下命令进行安装：

pip install requests beautifulsoup4 pandas chardet

爬取后的数据如下表所示。

Date	Quality Level	AQI Index	Ranking	PM2.5	PM10	SO2	NO2	CO	O3
2023/2/1	中度污染	155	310	117	135	24	48	1.09	49
2023/2/2	良	51	147	34	48	16	28	0.67	49
2023/2/3	轻度污染	104	308	78	103	26	50	1.04	35
2023/2/4	轻度污染	140	314	106	129	21	44	1.1	44
2023/2/5	良	97	205	71	98	21	30	0.71	56
2023/2/6	重度污染	209	317	165	184	20	50	1.07	44
2023/2/7	良	96	227	70	94	20	45	0.9	39
2023/2/8	良	83	213	60	84	20	44	0.96	35
2023/2/9	良	73	248	51	75	20	44	0.97	40
2023/2/10	良	71	212	51	58	14	29	0.75	57
2023/2/11	轻度污染	110	292	83	93	12	34	0.96	57
2023/2/12	良	88	289	65	75	19	37	0.81	64
2023/2/13	轻度污染	114	333	84	102	16	46	1	49
2023/2/14	优	40	178	27	38	17	25	0.56	56
2023/2/15	良	71	260	51	69	19	40	0.86	43

爬虫代码

下面是完整的爬虫代码，用于从 http://www.tianqihoubao.com/aqi/haerbin- 网站上抓取2023年2月到2023年3月的空气质量数据。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import chardet

# Define the base URL and the range of months to scrape
base_url = 'http://www.tianqihoubao.com/aqi/haerbin-'
months = pd.date_range('2023-02', '2023-03', freq='MS').strftime("%Y%m").tolist()

# Initialize empty lists to store the data
dates = []
quality_levels = []
aqi_indices = []
rankings = []
pm25_values = []
pm10_values = []
so2_values = []
no2_values = []
co_values = []
o3_values = []

# Function to extract data from a given URL
def extract_data(url):
    retries = 3  # Number of retries
    for attempt in range(retries):
        try:
            response = requests.get(url)
            # Detect encoding if not set to 'utf-8'
            if response.encoding == 'ISO-8859-1':
                detected_encoding = chardet.detect(response.content)['encoding']
                response.encoding = detected_encoding if detected_encoding else 'utf-8'
            soup = BeautifulSoup(response.text, 'html.parser')
            table = soup.find('table', {'class': 'b'})
            if table:
                for row in table.find_all('tr')[1:]:
                    columns = row.find_all('td')
                    dates.append(columns[0].text.strip())
                    quality_levels.append(columns[1].text.strip())
                    aqi_indices.append(columns[2].text.strip())
                    rankings.append(columns[3].text.strip())
                    pm25_values.append(columns[4].text.strip())
                    pm10_values.append(columns[5].text.strip())
                    so2_values.append(columns[6].text.strip())
                    no2_values.append(columns[7].text.strip())
                    co_values.append(columns[8].text.strip())
                    o3_values.append(columns[9].text.strip())
            break  # Exit the retry loop if the request is successful
        except (requests.exceptions.ConnectionError, requests.exceptions.ChunkedEncodingError) as e:
            print(f"Error occurred: {e}. Retrying {attempt + 1}/{retries}...")
            time.sleep(5)  # Wait for 5 seconds before retrying
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            break

# Iterate through each month and extract data
for month in months:
    url = base_url + month + '.html'
    extract_data(url)

# Create a DataFrame to store the data
data = {
    'Date': dates,
    'Quality Level': quality_levels,
    'AQI Index': aqi_indices,
    'Ranking': rankings,
    'PM2.5': pm25_values,
    'PM10': pm10_values,
    'SO2': so2_values,
    'NO2': no2_values,
    'CO': co_values,
    'O3': o3_values
}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

# Save the DataFrame to a CSV file with utf-8 encoding
df.to_csv('harbin_air.csv', index=False, encoding='utf-8-sig')