scrapy-pipline

本文详细介绍了如何在Scrapy爬虫项目中设置和使用ImagePipeline进行图片下载,并配置MysqlPipline将爬取的数据存储到MySQL数据库。通过`settings.py`配置存储路径和数据库连接参数,`piplines.py`中定义了图片下载和数据保存的具体逻辑,确保图片成功下载且数据准确入库。
摘要由CSDN通过智能技术生成

pipline

  • Image Pipline(爬取图片)

    # settings.py
    IMAGES_STORE = './images'
    
    # piplines.py
    from scrapy import Request
    from scrapy.exceptions import DropItem
    from scrapy.piplines.images import ImagesPipline
    class ImagePipline(ImagesPipline):
    	# 接收spider生成的item,取出url生成Request对象
    	def get_media_requests(self,item,info):
    		yield Requests(item['url'])
    	
    	# 返回保存的文件名
    	def file_path(self,request,response=None,info=None):
    		url = request.url
    		file_name = url.split('/')[-1]
    		return file_name
    	
    	# item下载完成时的处理方法
    	# results为该item对应的下载结果
    	def item_completed(self,results,item,info):
    		image_paths = [x['path'] for ok,x in results if ok]
    		if not image_paths:
    			raise DropItem('Image Download Failed')
    		return item
    
  • MysqlPipline(Mysql数据库)

    # settings.py
    MYSQL_HOST='localhost'
    MYSQL_DATABASE= 'database'
    MYSQL_PORT = 3306
    MYSQL_USER  = 'root'
    MYSQL_PASSWORD = 'root'
    
    # piplines.py
    import pymysql
    
    class MysqlPipline():
    	def __init__(self,host,database,user,password,port):
    		self.host = host
    		self.database = database
    		self.user = user
    		self.password = password
    		self.port = port
    	
    	# 拿去到settings.py中与mysql相关的参数
    	@classmethod
    	def from_crawler(cls,crawler):
    		return cls(
    			host = crawler.settings.get('MYSQL_HOST'),
    			database = crawler.settings.get('MYSQL_DATABASE'),
    			user = crawler.settings.get('MYSQL_USER'),
    			password = crawler.settings.get('MYSQL_PASSWORD'),
    			port = crawler.settings.get('MYSQL_PORT'),
    		)
    	
    	# spider开启时被自动调用
    	def open_spider(self,spider):
    		self.db = pymysql.connect(self.host,self.user,self.password,self.database,charset='utf-8',port=self.port)
    		self.cursor= self.db.cursor()
    		
    	# spider关闭时被自动调用
    	def close_spider(self,spider):
    		self.db.close()
    		
    	# process_item是必须要实现的方法,pipline会默认用这个方法对item进行处理
    	def process_item(self,item,spider):
    		data = dict(item)
    		keys = ','.join(data.keys())
    		values = ','.join(['%s']*len(data))
    		sql = ''
    		self.cursor.execute(sql,tuple(data.vaules()))
    		self.db.commit()
    		return item
    
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值