系统环境:
Anaconda 3, windows 10 64bit, Python 3.7
Python packages:
Python packages:
import pymongo
import pymysql
from scrapy import Request
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
1、MongoDB保存数据Pipeline:
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
self.db[item.collection].insert(dict(item))
return item
def close_spider(self, spider):
self.client.close()
代码解释:
在代码中,通过from_crawler函数的crawler参数来获取settings.py文件中定义的项目全局配置,如MongoDB的地址:MONGO_URI和数据库:MONGO_DB;open_spider函数在spider启动的时候就配置好MongoDB的连接,然后通过process_item函数来保存item数据。
2、 Mysql保存数据Pipeline:
class MysqlPipeline():
def __init__(self, host, database, user, password, port):
self.host = host
self.database = database
self.user = user
self.password = password
self.port = port
@classmethod
def from_crawler(cls, crawler):
return cls(
host=crawler.settings.get('MYSQL_HOST'),
database=crawler.settings.get('MYSQL_DATABASE'),
user=crawler.settings.get('MYSQL_USER'),
password=crawler.settings.get('MYSQL_PASSWORD'),
port=crawler.settings.get('MYSQL_PORT'),
)
def open_spider(self, spider):
self.db = pymysql.connect(self.host, self.user, self.password, self.database, charset='utf8', port=self.port)
self.cursor = self.db.cursor()
def close_spider(self, spider):
self.db.close()
def process_item(self, item, spider):
data = dict(item)
keys = ', '.join(data.keys())
values = ', '.join(['%s'] * len(data))
sql = 'insert into %s (%s) values (%s)' % (item.table, keys, values)
self.cursor.execute(sql, tuple(data.values()))
self.db.commit()
return item
代码解释:
在代码中,通过from_crawler函数的crawler参数来获取settings.py文件中定义的项目全局配置,如Mysq数据库的地址:MYSQL_HOST和数据库:MYSQL_DATABASE等;open_spider函数在spider启动的时候就配置好Mysql的连接,然后通过process_item函数来保存item数据。
3、图片下载Pipeline:
class ImagePipeline(ImagesPipeline):
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Download Failed.')
return item
def get_media_requests(self, item, info):
yield Request(item['url'])
代码解释:
在代码中,通过get_media_requests函数来获取图片下载的Request,file_path函数获取图片的保存名字,item_completed函数获取图片的保存路径,同时事先要在settings文件中设置图片保存的文件夹路径:
IMAGES_STORE = './images'
注意:IMAGES_STORE这个名字不能改成其他的名字,这是框架默认的命名,不能更改。