文章目录
1、pipeline简介
spider爬取数据后,把数据送到pipeline(管道)。pipeline是实现了几个固定方法的类,这些方法决定提取的数据的去向。
pipeline作用
- 清理html数据
- 校验爬取的数据
- 检查重复数据并丢弃
- 存入数据库(文件或者缓存等)
2、pipeline结构
属性或方法名称 | 参数 | 描述 |
---|---|---|
process_item | self,item,spider | 处理数据,每个pipeline必须实现。返回类型item object或者Deferred或者DropItem异常 |
open_spider | self,spider | 打开spider,非必须 |
close_spider | self,spider | 关闭spider,释放资源,非必须,一般和open_spider配对使用 |
from_crawler | cls,crawler | 类方法,提供访问其他scrapy主要核心组件的通道,同时吧自己的功能挂载到scrapy |
3、校验检查数据
3.1、校验并丢弃不符合要求数据
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class PricePipeline:
vat_factor = 1.15
def process_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter.get('price'):
if adapter.get('price_excludes_vat'):
adapter['price'] = adapter['price'] * self.vat_factor
return item
else:
raise DropItem(f"Missing price in {item}")
校验item对象如何没有price和price_excludes_vat属性,就丢弃改数据
3.2、检查重复数据并丢弃
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class DuplicatesPipeline:
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
if adapter['id'] in self.ids_seen:
raise DropItem(f"Duplicate item found: {item!r}")
else:
self.ids_seen.add(adapter['id'])
return item
判断item对象的唯一标志id如果重复,认定该item重复,丢弃。
4、存储数据
4.1、存入文件
下面是一个以JSON格式,吧数据存入文件的实例:
import json
from itemadapter import ItemAdapter
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(ItemAdapter(item).asdict()) + "\n"
self.file.write(line)
return item
提示:该示例是为了展示如何书写pipeline,如果想把数据以JSON格式存入文件,应该使用Feed exports。
4.2、存入数据库
下面是一个把数据存入mongodb数据库的示例:
import pymongo
from itemadapter import ItemAdapter
class MongoPipeline:
collection_name = 'scrapy_items'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert_one(ItemAdapter(item).asdict())
return item
关于mongodb数据的相关知识自行查阅相关文档。
5、激活启用pipeline
通过去掉settings.py中的ITEM_PIPELINES注释来开启pipeline,示例如下:
ITEM_PIPELINES = {
'myproject.pipelines.PricePipeline': 300,
'myproject.pipelines.JsonWriterPipeline': 800,
}
- myproject:项目名称
- PricePipeline:在pipeline.py中定义的Pipeline类名
- 300:优先级,数字越小,优先级越高,取值范围0~1000
6、实例
以上一章案例为例,我们现在要吧构建的item永久化存入mysql,pipeline代码如下:
import pymysql
class CsdnMysqlPipeline:
def __init__(self, host, port, db, user, password):
self.host = host
self.port = port
self.db = db
self.user = user
self.password = password
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST', '127.0.0.1'),
port=crawler.settings.get('MYSQL_PORT', 3306),
db=crawler.settings.get('MYSQL_DB'),
user=crawler.settings.get('MYSQL_USER', 'root'),
password=crawler.settings.get('MYSQL_PASSWORD', 'root')
)
def open_spider(self, spider):
self.conn = pymysql.connect(
host=self.host,
port=self.port,
db=self.db,
user=self.user,
password=self.password
)
self.cur = self.conn.cursor()
def close_spider(self, spider):
self.cur.close()
self.conn.close()
def process_item(self, item, spider):
if 'unlie' not in item.fields:
item['unlike'] = 0
keys = item.keys()
values = list(item.values())
sql = 'insert into blog({}) values ({}) '.format(
','.join(keys),
','.join(['%s'] * len(values))
)
# sql = 'insert into blog(title, publish, approval, unlike, `comment`, collection) values (%s, %s, %s, %s, %s, %s)'
self.cur.execute(sql, values)
self.conn.commit()
print(self.cur._last_executed)
return item
完整代码地址在文章最后,存入数据库效果图示:
7、总结
参考视频
- https://www.bilibili.com/video/BV1R7411F7JV p561~563
代码仓库:https://gitee.com/gaogzhen/python-study.git
QQ群:433529853