如何编写你自己的item
当Item在Spider中被收集之后,它将会被传递到Item Pipeline,一些组件会按照一定的顺序执行对Item的处理。
以下是item pipeline
的一些典型应用:
- 清理HTML数据
- 验证爬取的数据(检查item包含某些字段)
- 查重(并丢弃)
- 将爬取结果保存到数据库中
编写你自己的item pipeline
很简单,每个item pipiline
组件是一个独立的Python类,同时必须实现以下方法:
process_item(item, spider)
- 每个item pipeline 组件都需要调用该方法, 这个方法必须返回一个 item (或任何继承类)对象, 或是抛出 DropItem异常, 被丢弃的item将不会被之后的pipeline组件所处理
- params: item(Item 对象) - 被爬取的item
spider(Spider 对象) - 爬取该item的spider
此外, 也可以实现以下的方法
open_spider(spider)
- 当spider被开启的时候, 这个方法被调用.
- params: spider(Spider
对象) - 被开启的spider
close_spider(spider)
- 当spider被关闭时, 这个方法被调用
- params: spider(Spider
对象) - 被关闭的spider
class SaveToMongoPipeline(object):
def __init__(self, mongo_url, db_name):
self.mongo_url = mongo_url
self.db_name = db_name
self.client = None
self.db = None
# 被classmethod的修饰的类方法可以通过cls调用类的初始化方法, 什么时候需要crawler都可以写这个方法, 只要拿到crawler就可以拿到项目里面的所有的东西, 这种设计理念叫做依赖注入
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings.get('MONGO_URL'),
crawler.settings.get('MONGO_DB'))
————————————-以下为爬虫文件具体代码操作——————-
项目文件树如下:
爬取360图片
- 在settings.文件中设置如下参数:
# 设置文件下载路径
IMAGES_STORE = './resources'
ITEM_PIPELINES = {
'image360.pipelines.SaveImagePipeline': 300,
}
- items.py文件设置如下代码:
import scrapy
class BeautyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
tag = scrapy.Field()
width = scrapy.Field()
height = scrapy.Field()
url = scrapy.Field()
- image.py爬虫文件设置如下代码:
from json import loads
from urllib.parse import urlencode
import scrapy
from image360.items import BeautyItem
class ImageSpider(scrapy.Spider):
name = 'image'
allowed_domains = ['image.so.com']
# start_urls = ['http://image.so.com/'] 效果被下面的start_requests函数代替
# 该方法是重写父类的方法, 该函数主要用来设置开始的请求
def start_requests(self):
# 定义基础的url
base_url = 'http://image.so.com/zj?'
# 构建请求参数的字典
param = {'ch': 'beauty', 'listtype': 'new','temp': 1}
# 构建翻页参数
for page in range(10):
# 当sn的值为30的倍数的时候, 页面自动进入下一页
param['sn'] = page * 30
# 拼接处完整的url, urlencode(param)可以将字典转换为标准的请求格式 如: ch=beauty&listtype=new&temp=1
full_url = base_url + urlencode(param)
# yield scrapy.Requst(url=url)方法会自动回调self.parser
# 相当于 : yield scrapy.Request(url=full_url, callback=self.parse)
yield scrapy.Request(url=full_url)
def parse(self, response):
model_dict = loads(response.text)
for elem in model_dict['list']:
item = BeautyItem()
item['title'] = elem['group_title']
item['tag'] = elem['tag']
item['width'] = elem['cover_width']
item['height'] = elem['cover_height']
item['url'] = elem['qhimg_url']
yield item
爬取淘宝搜索数据(结合selenium)
- settings.py文件代码设置如下:
MONGO_URL = 'mongodb://116.196.107.20:27017'
MONGO_DB = 'image360'
DOWNLOADER_MIDDLEWARES = {
'image360.middlewares.TaobaoDownloaderMiddleware':544,
}
ITEM_PIPELINES = {
'image360.pipelines.SaveToMongoPipeline':301,
}
- pipelines.py文件代码如下:
from pymongo import MongoClient
from scrapy import Request
class SaveToMongoPipeline(object):
def __init__(self, mongo_url, db_name):
self.mongo_url = mongo_url
self.db_name = db_name
self.client = None
self.db = None
def process_item(self, item, spider):
coll = self.db['taobao']
title = item['title']
price = item['price']
deal = item['deal']
coll.insert_one({
'title':title,
'deal':deal,
'price':price
})
return item
def open_spider(self, spider):
self.client = MongoClient(self.mongo_url)
# 可以用[]来命名数据库和集合
self.db = self.client[self.db_name]
def close_spider(self, spider):
self.client.close()
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings.get('MONGO_URL'),
crawler.settings.get('MONGO_DB'))
- items.py文件中代码设置如下:
import scrapy
class TobaoItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
deal = scrapy.Field()
- 在middlewares.py中添加如下中间件并在settings.py中进行注册
class TaobaoDownloaderMiddleware(object):
def __init__(self, timeout=None):
self.timeout = timeout
# 将浏览器打开放在初始化方法里面,不会导致浏览器每次去调用方法都重新打开
self.browser = webdriver.Chrome()
self.browser.set_window_size(1000, 600)
self.browser.set_page_load_timeout(self.timeout)
def __del__(self):
self.browser.close()
def process_request(self, request, spider):
# 处理请求, 用selenium获取动态内容
try:
# 注意:
# 1. HtmlResponse对象是response对象的子类, 所以可以返回,
# 2. 在这里只能返回 None, request, response, 和引发process_error
self.browser.get(request.url)
return HtmlResponse(url=request.url, body=self.browser.page_source, request=request, encoding='utf-8', status=200)
except TimeoutError:
return HtmlResponse(url=request.url, status=500, request=request)
def process_response(self, request, response, spider):
return response
def process_exception(self, request, exception, spider):
pass
# 依赖注入
@classmethod
def from_crawler(cls, crawler):
return cls(timeout=10)
- 在tobao.py爬虫文件中代码设置如下:
import re
from io import StringIO
from urllib.parse import urlencode
import scrapy
from image360.items import TobaoItem
class TobaoSpider(scrapy.Spider):
name = 'tobao'
allowed_domains = ['www.taobao.com']
def start_requests(self):
base_url = 'https://s.taobao.com/search?'
params = {}
for keyword in ['ipad','iphone','小米手机']:
params['q'] = keyword
for page in range(10):
params['s'] = page * 44
full_url = base_url + urlencode(params)
yield scrapy.Request(url=full_url, callback=self.parse, dont_filter=True)
def parse(self, response): # 爬虫对数据进行解析以后返回一个item,会进入管道里面继续进行过滤, 或者进行持久化操作
# 通过过滤上面start_requests的方法的中间件以后, 如果返回是的 response对象, 则返回到这里
goods_list = response.xpath('//*[@id="mainsrp-itemlist"]/div/div/div[1]/div')
for goods in goods_list:
item = TobaoItem()
data_title = goods.xpath('div[2]/div[2]/a[1]/text()').extract()
item['price'] = goods.xpath('div[2]/div[1]/div[1]/strong/text()').extract_first()
item['deal'] = goods.xpath('div[2]/div[1]/div[2]/text()').extract_first()
title = StringIO()
for data in data_title:
# 对数据进行遍历, 用正则表达式匹配其中的空格,替换成空字符串, 用StringIO(可变字符串)进行写入操作, 性能比用字符串拼接要好很多
title.write(re.sub('\s', '', data))
item['title'] = title.getvalue()
yield item