在数据科学和分析领域,网络爬虫是一项重要的技能。本文将介绍如何使用 Python 爬取哈尔滨市的空气质量数据,并将其保存为 CSV 文件,以便进一步分析。
所需工具
我们将使用以下 Python 库:
requests
:用于发送 HTTP 请求。BeautifulSoup
:用于解析 HTML 和 XML 文档。pandas
:用于数据操作和分析。chardet
:用于检测网页的编码格式。
确保你已经安装了这些库。如果没有安装,可以使用以下命令进行安装:
pip install requests beautifulsoup4 pandas chardet
爬取后的数据如下表所示。
Date | Quality Level | AQI Index | Ranking | PM2.5 | PM10 | SO2 | NO2 | CO | O3 |
2023/2/1 | 中度污染 | 155 | 310 | 117 | 135 | 24 | 48 | 1.09 | 49 |
2023/2/2 | 良 | 51 | 147 | 34 | 48 | 16 | 28 | 0.67 | 49 |
2023/2/3 | 轻度污染 | 104 | 308 | 78 | 103 | 26 | 50 | 1.04 | 35 |
2023/2/4 | 轻度污染 | 140 | 314 | 106 | 129 | 21 | 44 | 1.1 | 44 |
2023/2/5 | 良 | 97 | 205 | 71 | 98 | 21 | 30 | 0.71 | 56 |
2023/2/6 | 重度污染 | 209 | 317 | 165 | 184 | 20 | 50 | 1.07 | 44 |
2023/2/7 | 良 | 96 | 227 | 70 | 94 | 20 | 45 | 0.9 | 39 |
2023/2/8 | 良 | 83 | 213 | 60 | 84 | 20 | 44 | 0.96 | 35 |
2023/2/9 | 良 | 73 | 248 | 51 | 75 | 20 | 44 | 0.97 | 40 |
2023/2/10 | 良 | 71 | 212 | 51 | 58 | 14 | 29 | 0.75 | 57 |
2023/2/11 | 轻度污染 | 110 | 292 | 83 | 93 | 12 | 34 | 0.96 | 57 |
2023/2/12 | 良 | 88 | 289 | 65 | 75 | 19 | 37 | 0.81 | 64 |
2023/2/13 | 轻度污染 | 114 | 333 | 84 | 102 | 16 | 46 | 1 | 49 |
2023/2/14 | 优 | 40 | 178 | 27 | 38 | 17 | 25 | 0.56 | 56 |
2023/2/15 | 良 | 71 | 260 | 51 | 69 | 19 | 40 | 0.86 | 43 |
爬虫代码
下面是完整的爬虫代码,用于从 http://www.tianqihoubao.com/aqi/haerbin-
网站上抓取2023年2月到2023年3月的空气质量数据。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import chardet
# Define the base URL and the range of months to scrape
base_url = 'http://www.tianqihoubao.com/aqi/haerbin-'
months = pd.date_range('2023-02', '2023-03', freq='MS').strftime("%Y%m").tolist()
# Initialize empty lists to store the data
dates = []
quality_levels = []
aqi_indices = []
rankings = []
pm25_values = []
pm10_values = []
so2_values = []
no2_values = []
co_values = []
o3_values = []
# Function to extract data from a given URL
def extract_data(url):
retries = 3 # Number of retries
for attempt in range(retries):
try:
response = requests.get(url)
# Detect encoding if not set to 'utf-8'
if response.encoding == 'ISO-8859-1':
detected_encoding = chardet.detect(response.content)['encoding']
response.encoding = detected_encoding if detected_encoding else 'utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table', {'class': 'b'})
if table:
for row in table.find_all('tr')[1:]:
columns = row.find_all('td')
dates.append(columns[0].text.strip())
quality_levels.append(columns[1].text.strip())
aqi_indices.append(columns[2].text.strip())
rankings.append(columns[3].text.strip())
pm25_values.append(columns[4].text.strip())
pm10_values.append(columns[5].text.strip())
so2_values.append(columns[6].text.strip())
no2_values.append(columns[7].text.strip())
co_values.append(columns[8].text.strip())
o3_values.append(columns[9].text.strip())
break # Exit the retry loop if the request is successful
except (requests.exceptions.ConnectionError, requests.exceptions.ChunkedEncodingError) as e:
print(f"Error occurred: {e}. Retrying {attempt + 1}/{retries}...")
time.sleep(5) # Wait for 5 seconds before retrying
except Exception as e:
print(f"An unexpected error occurred: {e}")
break
# Iterate through each month and extract data
for month in months:
url = base_url + month + '.html'
extract_data(url)
# Create a DataFrame to store the data
data = {
'Date': dates,
'Quality Level': quality_levels,
'AQI Index': aqi_indices,
'Ranking': rankings,
'PM2.5': pm25_values,
'PM10': pm10_values,
'SO2': so2_values,
'NO2': no2_values,
'CO': co_values,
'O3': o3_values
}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
# Save the DataFrame to a CSV file with utf-8 encoding
df.to_csv('harbin_air.csv', index=False, encoding='utf-8-sig')