本文记录Scrapy基本使用方法,不涉及框架底层原理说明。
基本命令行语法
创建项目:scrapy startproject xxx
进入项目:cd xxx
创建爬虫:scrapy genspider xxx(爬虫名) xxx.com (爬取域)
生成文件:scrapy crawl xxx -o xxx.json (生成某种类型的文件)
运行爬虫:scrapy crawl XXX
列出所有爬虫:scrapy list
获得配置信息:scrapy settings [options]
下面以一个爬取天气数据的项目为实例进行说明
(1)【命令行输入】
> scrapy startproject weather
>cd weather
>scrapy genspider tianqi tianqihoubao.com
(2)【项目文件初始结构及说明】
文件结构: weatheer --|| weather --|| spider --|| tianqi.py
--|| items.py
--|| middlewares.py
--|| piplines.py
--|| settings.py
--||scrapy.cfg
各个文件以及关键代码:
tianqi.py:
class TianqiSpider(scrapy.Spider):
name = 'tianqi'
allowed_domains = ['tianqihoubao.com']
start_urls = ['http://www.tianqihoubao.com/'] #爬虫初次入口
def parse(self, response):#网页解析
infoSelector = response.xpath('')##使用选择器 :xpath() /css()
info=infoSelector
item=WeatherItem(info=info)
yield item #提交item
###爬取下一页
next_url='';
if(next_url):
yield scrapy.Request(url=next_url,callback=self.parse) #对新url爬取
items.py:
class WeatherItem(scrapy.Item): #来自tianqi.py的数据封装 。变量名与tianqi.py中传入item对象的一致
# define the fields for your item here like:
# name = scrapy.Field()
pass
piplines.py:
# #持久化存储item
# # 另外需要在settings.py中激活item pipelines:
# ITEM_PIPELINES = { 'weather.pipelines.WeatherPipeline': 300,}
class WeatherPipeline(object):
def process_item(self, item, spider):
##填入 数据清洗/文件储存代码
return item
(3)最后完整项目源码:
项目功能:爬取江西近一月各地的天气信息,保存JSON格式
tianqi.py:
import scrapy
import re
from weather.items import WeatherItem
class TianqiSpider(scrapy.Spider):
name = 'tianqi'
allowed_domains = ['tianqihoubao.com'] #
start_urls = ['http://www.tianqihoubao.com/weather/province.aspx?id=360000'] #爬虫初次入口
def parse(self, response):#网页解析1
names = response.xpath('//tr/td/a/text()').extract()##使用选择器 :xpath() /css()
urls = response.xpath('//tr/td/a/@href').extract()
for i in range(0,len(names)):
place=names[i]
next_url = 'http://www.tianqihoubao.com/weather/'+urls[i]
if (next_url):
yield scrapy.Request(url=next_url, callback=self.parse_detail) # 对新url爬取
def parse_detail(self,response):
place=response.xpath('//table/tr[3]/td[1]/b/text()').extract_first()
print("================================开始爬取【",place,"】天气=================================")
tr = response.xpath('//table/tr')
weatherList=[]
for i in range(2,len(tr)):
wea = {}
wea['date']=tr[i].xpath('./td[2]/b/a/text()').extract_first()
s=tr[i].xpath('./td[3]/text()').extract_first()
pattern = re.compile(r'([\u4E00-\u9FA5]{1,2})')
try:
match = pattern.search(s)
wea['type'] = match.group(1)
except Exception:
print("================!!!!!!!!!!!!!\n",s,"!!!!!!!!!!!!!!===============\n",i)
wea['wind']=tr[i].xpath('./td[4]/text()').extract_first()
high_temp = tr[i].xpath('./td[5]/text()').extract_first()
s = tr[i].xpath('./td[8]/text()').extract_first()
pattern = re.compile(r'(-?\d*℃)')
low_temp="-999℃"
try:
match = pattern.search(s)
low_temp= match.group(1)
except Exception:
print("================!!!!!!!!!!!!!\n", s, "!!!!!!!!!!!!!!============\n", i)
wea['temperature']=low_temp+"-"+high_temp
weatherList.append(wea)
print("========================================√√√success!====================================")
item=WeatherItem()
info={}
info['name']=place
info['weather']=weatherList
item['info']=info
yield item
items.py:
import scrapy
class WeatherItem(scrapy.Item): #来自tianqi.py的数据封装 。变量名与tianqi.py中传入item对象的一致
# define the fields for your item here like:
info = scrapy.Field()
pass
piplines.py:
import json
# #持久化存储item
class WeatherPipeline(object):
def process_item(self, item, spider):
##填入 数据清洗/文件储存代码
with open('tianqi.json', 'a+',encoding='utf-8') as fp:
json.dump(item['info'], fp=fp, skipkeys=True, indent=4, ensure_ascii=False)
return item
总结
总而言之,Scrapy框架比较容易上手,项目建立之后只需要编写3个文件的代码,唯一的较难点是选择器的编写(详见https://www.w3school.com.cn/xpath/index.asp)