Scrapy实时爬取数据 python,这里用scrapy爬取数据,因为数据直接显示在页面中,即
并且request.content返回值中有需要的数据,因此用scrapy爬虫直接获取
实时通过写脚本并挂载在服务器上crontab -e
scrapy startproject 文件名
cd 文件名
# 项目名和文件名不能相同
scrapy genspider 项目名 项目网站
运行爬虫
scrapy crawl 项目名
生成了类似的文件结构
其中__init__.py是运行文件,没有什么需要变动的地方
紧接着项目文件,需要对返回的数据进行处理
先是对item.py进行处理,item对各个变量名进行定义,按照注释格式
紧接着settings.py有几处地方要改
紧接着中间件middlewares不需要进行处理
处理项目.py,在parse中调用item,再用xpath直接进行获取需要的列表,传给pipelines进行处理
class XianSpider(scrapy.Spider):
name = ''
# 用来输出url
def start_requests(self):
headers = {
'User-Agent': '',
'Connection': '',
'Cookie': ''
}
# 这里还有站点id没有显示
site=[610100,,...]
for i in site:
url = '...cityCode={}'.format(i)
# 返回值是爬虫request得到的数据
yield scrapy.Request(url, self.parse, headers=headers)
def parse(self, response):
html = response.text
item = Sxair2Item()
# json.loads将爬取到的数据转化成json数据格式
if json.loads(html):
for data in json.loads(html):
item['TIMEPOINT'] = data['TIMEPOINT']
item['AREA'] = data['AREA']
item['POSITIONNAME']=data['POSITIONNAME']
item['STATIONCODE'] = data['STATIONCODE']
item['SO2'] = data['SO2']
item['NO2'] = data['NO2']
item['CO'] = data['CO']
item['O3'] = data['O3']
item['PM10'] = data['PM10']
item['PM2_5'] = data['PM2_5']
item['AQI'] = data['AQI']
item['PRIMARYPOLLUTANT'] = data['PRIMARYPOLLUTANT']
item['QUALITY'] = data['QUALITY']
item['ATT'] = data['ATT']
yield item
pipelines.py
# Define your item pipelines here
# pipelines在设置中记得打开
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
class Sxair2Pipeline(object):
def process_item(self, item, spider):
with open('shanxi2.csv', 'a') as f:
item['TIMEPOINT'] = item.get('TIMEPOINT')
item['AREA'] = item.get('AREA')
item['POSITIONNAME'] = item.get('POSITIONNAME')
item['STATIONCODE'] = item.get('STATIONCODE')
item['SO2'] = item.get('SO2')
item['NO2'] = item.get('NO2')
item['CO'] = item.get('CO')
item['O3'] = item.get('O3')
item['PM10'] = item.get('PM10')
item['PM2_5'] = item.get('PM2_5')
item['AQI'] = item.get('AQI')
item['PRIMARYPOLLUTANT'] = item.get('PRIMARYPOLLUTANT')
item['QUALITY'] = item.get('QUALITY')
item['ATT'] = item.get('ATT')
# 将获取到的数据转化成行
txt = str.format('{},{},{},{},{},{},{},{},{},{},{},{},{},{}\n', item['TIMEPOINT'], item['AREA'], item['POSITIONNAME'],
item['STATIONCODE'], item['SO2'],item['NO2'], item['CO'], item['O3'], item['PM10'], item['PM2_5'], item['AQI'],
item['PRIMARYPOLLUTANT'],item['QUALITY'],item['ATT'])
f.write(txt)
print('写入成功')
return item
至此,scrapy爬取页面返回值为json格式的数据完成。