思路:
1、空气质量在线监测平台 https://www.aqistudy.cn/;
2、分析网站,找到历史数据查询入口:https://www.aqistudy.cn/historydata/,首页为各城市入口,二层统计各月份数据,三层为每日数据;
3、使用crawspider获取每月url;
4、xpath解析数据,保存至csv和redis数据库中。
出现的问题:
除首页外,网页使用了动态JS,无法直接解析获取需要的数据。
解决:
改写下载中间件,使用webdrive发送get请求,获取数据。
一、准备工作
创建一个scrapy project:
scrapy startproject AQI
创建spider file
scrapy genspider -t crawl aqi aqistudy.cn
二、构建框架
(1)items.py / 定义item
import scrapy
#获取城市名称、日期、AQI等数据
class AirItem(scrapy.Item):
Title = scrapy.Field()
Day = scrapy.Field()
AQI = scrapy.Field()
Quality_level = scrapy.Field()
PM_2_5 = scrapy.Field()
PM_10 = scrapy.Field()
SO2 = scrapy.Field()
CO = scrapy.Field()
NO2 = scrapy.Field()
O3_8H = scrapy.Field()
(2) spider.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from AIR.items import AirItem
class AirSpider(CrawlSpider):
name = 'aqi'
allowed_domains = ['aqistudy.cn']
start_urls = ['https://www.aqistudy.cn/historydata/']
#以获取武汉2019年数据举例,rule-1获取武汉url,rule-2获取2019年url
rules = (
Rule(LinkExtractor(allow=r'city=武汉$')),
#无callback默认follow为True
Rule(LinkExtractor(allow=r'month=2019'), callback='parse_item', follow=False),
)
def parse_item(self, response):
item = AirItem()
item['Title'] = response.xpath('//*[@id="title"]/text()').extract_first()[8:-11]
air_list = response.xpath('//tbody/tr')
for air in air_list[1:]:
item['Day'] = air.xpath('./td[1]/text()').extract_first()
item['AQI'] = air.xpath('./td[2]/text()').extract_first()
item['Quality_level'] = air.xpath('./td[3]/span/text()').extract_first()
item['PM_2_5'] = air.xpath('./td[4]/text()').extract_first()
item['PM_10'] = air.xpath('./td[5]/text()').extract_first()
item['SO2'] = air.xpath('./td[6]/text()').extract_first()
item['CO'] = air.xpath('./td[7]/text()').extract_first()
item['NO2'] = air.xpath('./td[8]/text()').extract_first()
item['O3_8H'] = air.xpath('./td[9]/text()').extract_first()
print(item['Day'])
yield item
(3) middlewares.py
改写下载中间件,使用webdriver发送请求
import scrapy
from selenium import webdriver
import time
class ChromeMiddleware(object):
#先思考一下发送request请求需要的几步:url,driver、获取数据、返回需要的响应对象
def process_request(self, request, spider):
#1、url
url = request.url
if url != 'https://www.aqistudy.cn/historydata/':
#webdriver获取数据速度较慢,不需要使用的页面,使用if判断语句跳过
#2、设置driver对象
driver = webdriver.Chrome()
#3、使用driver发生get请求
driver.get(url)
#考虑网速,手动延迟3秒
time.sleep(3)
#4、获取数据
data = driver.page_source
#5、关闭浏览器
driver.quit()
#6、返回scrapy框架需要的响应对象
#这里查看源码,http下的response,HtmlResponse继承父类TextResponse,默认编码ASCII需要改为“utf-8”
#查看父类response,需要的参数有:(url, status=200, headers=None, body=b'', flags=None, request=None)
#获取的数据为字符串格式,需要编码为二进制格式
return scrapy.http.HtmlResponse(
url=url,
status=200,
body=data.encode('utf-8'),
encoding = 'utf-8'
)
(4)pipelines.py
from scrapy.exporters import CsvItemExporter
import redis
import json
#管道一:存储到CSV格式
class AirPipeline(object):
def open_spider(self,spider):
self.file = open('airdata.csv','wb')
self.exporter = CsvItemExporter(self.file)
self.exporter.start_exporting()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self,spider):
self.exporter.finish_exporting()
self.file.close()
#管道二:存储redis数据库
class RedisPipeline(object):
def open_spider(self,spider):
self.redis = redis.StrictRedis(host='127.0.0.1',port=6379)
self.redis_key = 'spider_air'
def process_item(self,item,spider):
#将获取到的数据转为json字符串格式
self.redis.lpush(self.redis_key,json.dumps.(dict(item)))
return item
5)设置setting
一般写好一部分代码就开启相应的设置,以防忘记
BOT_NAME = 'AQI'
SPIDER_MODULES = ['AQI.spiders']
NEWSPIDER_MODULE = 'AQI.spiders'
LOG_FILE = 'AQI.log'
LOG_LEVEL = 'WARNING'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DOWNLOADER_MIDDLEWARES = {
'AIR.middlewares.ChromeMiddleware': 543,
}
ITEM_PIPELINES = {
'AIR.pipelines.AirPipeline': 300,
'AIR.pipelines.RedisPipeline': 400,
}
三、运行spider
(1)打开Redis服务器,启动客户端
redis-server
redis-cli
(2)运行spider
scrapy crawl aqi
四、完成以上几步即可等待查看数据啦