伪代码
\begin{lstlisting}[language=Python, caption={Air Quality Data Crawler Algorithm}, label=alg:spider]
// Required Libraries
- selenium, pymysql, pandas, time, requests, re, BeautifulSoup, sqlalchemy
Function spider(URL):
Initialize browser = new ChromeDriver(options)
While True:
Try:
browser.get(URL)
Extract HTML table data from browser.page_source -> df
Delay for 1.5 seconds
If df is not empty:
Remove columns: NO2, NO2.1, NO2.3
Return df
Else:
Continue loop
Catch Exception e:
Print("Error occurred while crawling " + URL + ": " + str(e))
Continue loop
Main Program:
// Initialize settings and parameters
Set base_url = 'https://www.aqistudy.cn/...'
Set city = ['青岛']
Set dates = ['202401', ..., '202412']
// Initialize browser options
Initialize options = new ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
// Initialize browser
Set browser = new ChromeDriver(options)
Execute browser.script("window.navigator.webdriver = false")
// Main loop to crawl data
For each date in dates:
Construct current_url = base_url + city[0] + '&month=' + date
Call spider(current_url) -> df
Delay for 1.5 seconds
Save df to CSV file 'data.csv' with mode append
// Termination
Close browser
Print("Data crawling completed successfully")
\end{lstlisting}
代码:
# coding=utf-8
from selenium import webdriver
import pymysql
import pandas as pd
import time
import requests
import re
from bs4 import BeautifulSoup
from sqlalchemy.exc import IntegrityError
def spider(url):
while True:
try:
browser.get(url)
df = pd.read_html(browser.page_source, header=0)[0] # 返回第一个Dataframe
# print(browser.page_source)
time.sleep(1.5)
if not df.empty:
# print(df)
df.drop(['NO2', 'NO2.1', 'NO2.3'], axis=1, inplace=True)
# print(df)
# df.to_csv('data.csv', mode='a', index=None)
return df
else:
continue
except Exception as e:
print(f"爬取{url}时发生错误: {str(e)}")
continue
if __name__ == '__main__':
print('开始爬取数据...\n')
url = 'https://www.aqistudy.cn/historydata/daydata.php?city=%E9%9D%92%E5%B2%9B&month=202401'
base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='
# 声明浏览器对象
option = webdriver.ChromeOptions()
option.add_argument("start-maximized")
# option.add_argument("--headless")
option.add_argument("--disable-blink-features=AutomationControlled")
option.add_experimental_option("excludeSwitches", ["enable-automation"])
option.add_experimental_option("useAutomationExtension", False)
browser = webdriver.Chrome(options=option)
browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument",{
'source':'''Object.defineProperty(navigator, 'webdriver', {
get: () =>false'''
})
time.sleep(2)
city = ['青岛']
dates = ['202401', '202402', '202403', '202404', '202405', '202406', '202407', '202408', '202409', '202410', '202411', '202412']
list_data = []
list_row = []
for date in dates:
url = base_url + city[0] + '&month=' + date
df = spider(url)
time.sleep(1.5)
df.to_csv('data.csv', mode='a', index=None)
browser.close()
print('所有数据爬取已完成!\n')
伪代码:
# 伪代码:空气质量数据爬虫算法
1. 浏览器初始化与配置
1.1 创建浏览器实例
1.2 设置浏览器参数:
- 窗口最大化
- 禁用自动化检测标志
- 排除自动化扩展
1.3 注入JavaScript脚本隐藏webdriver特征
2. 数据爬取算法 spider(url)
2.1 循环执行直到成功获取数据
2.2 访问目标URL
2.3 解析页面HTML内容
2.4 提取表格数据为DataFrame格式
2.5 数据有效性验证:
- 检查DataFrame是否为空
- 移除冗余列(NO2, NO2.1, NO2.3)
2.6 返回处理后的数据或继续尝试
3. 主控制流程
3.1 定义基础URL模板
3.2 初始化城市列表
3.3 设置日期范围(2024年各月份)
3.4 对于每个日期:
- 构造完整URL
- 调用spider()获取数据
- 追加数据到CSV文件
- 添加城市标识列
3.5 资源释放:
- 关闭浏览器实例
3.6 输出爬取完成状态信息
846

被折叠的 条评论
为什么被折叠?



