Scrapy爬虫入门系列3 将抓取到的数据存入数据库与验证数据有效性

最新推荐文章于 2024-04-24 09:07:52 发布

github.com/starRTC

最新推荐文章于 2024-04-24 09:07:52 发布

阅读量920

点赞数

分类专栏： scrapy 前端技术

本文链接：https://blog.csdn.net/elesos/article/details/78490771

版权

scrapy 同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

前端技术

3 篇文章 0 订阅

订阅专栏

抓取到的item 会被发送到Item Pipeline进行处理

Item Pipeline常用于

cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database

写一个自己的item pipeline

就是写一个Python类，并且实现process_item(item, spider)方法

must either return a Item (or any descendant子孙 class) object or raise a DropItem exception.

Price validation and dropping items with no prices

adjusts the price attribute for those items that do not include VAT (price_excludes_vat attribute), and drops those items which don’t contain a price:

如果没有price则丢掉，如果没有price_excludes_vat，调整价格值。

 
  from scrapy.exceptions import DropItem
 
class PricePipeline(object):
 
    vat_factor =1.15
 
    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price']= item['price'] * self.vat_factor 
                      return item
        else:
            raise DropItem("Missing price in %s" % item) 
 

写到JSON文件中

 
  import json
 
class JsonWriterPipeline(object):
 
    def__init__(self):
        self.file=open('items.jl','wb')
 
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n" 
          self.file.write(line) 
          return item 
 

Duplicates filter

A filter that looks for duplicate items, and drops those items that were already processed. Let say that our items have an unique id, but our spider returns multiples items with the same id:

 
  from scrapy.exceptionsimport DropItem
 
class DuplicatesPipeline(object):
 
    def__init__(self):
        self.ids_seen=set()
 
    def process_item(self, item, spider):
        if item['id']in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item) 
                     else:
            self.ids_seen.add(item['id']) 
    return item 
 

Activating激活 an Item Pipeline component

在settings.py中加入如下代码：

 
  ITEM_PIPELINES ={'myproject.pipelines.PricePipeline': 300,'myproject.pipelines.JsonWriterPipeline': 800,} 
 

我们在Scrapy爬虫入门系列2:示例教程的基础上，支持json输出

1，先写好pipeline

 
  import json
 
class TutorialPipeline(object):
 
    def__init__(self):
        self.file=open('output.json','wb')
 
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n" 
  self.file.write(line) 
  return item 
 

2，然后在settings.py中加入

 
  ITEM_PIPELINES={'tutorial.pipelines.TutorialPipeline':400,} 
 

最后运行scrapy crawl dmoz会生成output.json。

存入数据库

打开pipelines.py输入如下：

 
  # -*- coding: utf-8 -*-
 
# Define your item pipelines here 
  # 
  # Don't forget to add your pipeline to the ITEM_PIPELINES setting 
  # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 
 
# -*- coding: utf-8 -*- 
  from scrapy import log
from twisted.enterprise import adbapi
from scrapy.httpimport Request
from scrapy.selectorimport HtmlXPathSelector
import urllib
 
import MySQLdb
import MySQLdb.cursors
 
class TutorialPipeline(object):
	def__init__(self):
	self.dbpool= adbapi.ConnectionPool('MySQLdb',db ='scrapy',user='root',passwd ='pwd',
		cursorclass = MySQLdb.cursors.DictCursor,
		charset ='utf8',
		use_unicode =False) 
           def process_item(self, item, spider):
		query =self.dbpool.runInteraction(self._conditional_insert, item)
		query.addErrback(self.handle_error) 
                 return item
	def _conditional_insert(self,tx,item):
		tx.execute("select * from item where title = %s",(item['title']))
		result=tx.fetchone() 
           #       log.msg(result,level=log.DEBUG)#print result 
               if result:
			log.msg("Item already stored in db:%s" % item,level=log.DEBUG) 
               else:
			tx.execute("insert into item (title) values (%s)",(item['title'])) 
  
 
  def handle_error(self, e):
		log.err(e) 
 

请注意python的缩进，不然会报错。

然后在settings.py里加上：

 
  ITEM_PIPELINES={'tutorial.pipelines.TutorialPipeline':400,} 
 

运行scrapy crawl dmoz，会发现数据成功插入到数据库中：

如果报错：

No module named MySQLdb

解决：

yum install MySQL-python
pip install mysql-python

源码下载：艺搜下载

[编辑]艺搜参考

http://doc.scrapy.org/en/latest/topics/item-pipeline.html

http://stackoverflow.com/questions/10845839/writing-items-to-a-mysql-database-in-scrapy

http://www.cnblogs.com/lchd/p/3820968.html

github.com/starRTC

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录