全国各城市(网页上有的)空气质量爬虫,包括城市名称,AQI等信息,保存为.csv文件格式
网址首页:https://www.aqistudy.cn/historydata/index.php
首先是获取城市名称模块,实质就是从网页上得到可检索的所有“城市”字符串列表;
import requests
from lxml import etree
import time
from urllib import parse
import pandas as pd
from selenium import webdriver
import urllib.parse
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.204 Safari/537.36'
}
url = "https://www.aqistudy.cn/historydata/"
response = requests.get(url, headers=headers)
text = response.content.decode('utf-8')
html = etree.HTML(text)
city_set = list()
citys = html.xpath("//div[@class='all']/div/ul")
for city in citys:
messages = city.xpath(".//li")
for message in messages:
city_name = message.xpath(".//a/text()")
city_name = "".join(city_name)
# print(city_name)
city_set.append(city_name)
print(len(city_set))#输出可爬取的城市数量
print(city_set)#打印所有爬取的城市列表
然后确定所有城市的哪几年的哪几个月空气质量信息需要爬取;
def get_month_set():
month_set = list()
for i in range(1, 10):
month_set.append(('2018-0%s' % i))# 这里目前只获取2018年前10个月
for i in range(10, 13):
month_set.append(('2018-%s' % i))# 这里获取2018年后2个月
return month_set # 一页能看到一个月的数据,所以循环以月为单位即可
month_set = get_month_set()
month_set.reverse()
最后是数据爬取模块,目的从确定的城市和确定的年月中爬取数据;
driver = webdriver.PhantomJS(r'E:\phantomjs-2.1.1-windows\bin\phantomjs.exe')
base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='
file_name = 'AQI_2018.csv'
fp = open(file_name, 'w', encoding='utf-8-sig')
fp.write('%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n'%('city','date','AQI','grade','PM25','PM10','SO2','CO','NO2','O3_8h'))#表头
for ct in range(0, len(city_set)):
for i in range(len(month_set)):
str_month = month_set[i]
weburl = ('%s%s&month=%s' % (base_url, parse.quote(city_set[ct]), str_month))
driver.get(weburl)
time.sleep(1)
dfs = pd.read_html(driver.page_source,header=0)[0]
time.sleep(1) # 防止页面一带而过,爬不到内容
if len(dfs) != 0:
for j in range(0,len(dfs)):
date = dfs.iloc[j,0]
aqi = dfs.iloc[j,1]
grade = dfs.iloc[j,2]
pm25 = dfs.iloc[j,3]
pm10 = dfs.iloc[j,4]
so2 = dfs.iloc[j,5]
co = dfs.iloc[j,6]
no2 = dfs.iloc[j,7]
o3 = dfs.iloc[j,8]
fp.write(('%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n' % (city_set[ct],date,aqi,grade,pm25,pm10,so2,co,no2,o3)))
print('yes---%s,%s---DONE' % (city_set[ct], str_month))
localtime = time.asctime(time.localtime(time.time()))
print("time :", localtime)
else:
print('%s,%s--error' % (city_set[ct], str_month))
localtime = time.asctime(time.localtime(time.time()))
print("time :", localtime)
fp.close()
driver.quit()
print("已完成,谢谢!")
三部分模块连在一起即是完整的爬取全国城市空气质量的code;
其中,phantomjs.exe(百度搜索)及python包需要自行下载;
大部分代码是修改别人的,我在此基础上做了些优化,增加了功能,相关信息显示的会更加完善;
遇到疑似反爬虫的情况(run到一半就不动了),建议更换网络。
参考:
https://zhuanlan.zhihu.com/p/132496133 # 网页城市列表获取
https://blog.csdn.net/jancydc/article/details/107511400 #将城市循环,获取AQI
https://blog.csdn.net/weixin_40651515/article/details/84592530 #网络爬虫&获取AQI主体部分