scrapy-2:scrapy的一些组件

最新推荐文章于 2024-10-12 12:26:23 发布

dyeee

最新推荐文章于 2024-10-12 12:26:23 发布

阅读量140

点赞数

分类专栏： scrapy 文章标签： python json shell

本文链接：https://blog.csdn.net/dyeee/article/details/84817451

版权

scrapy 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

京东商品页面

[root@localhost pytest]# cat jdspider.py
#!/usr/bin/env python
# coding=utf-8
import scrapy
class JdSpider(scrapy.Spider):
    name='jd'
    start_urls=['http://list.jd.com/list.html?cat=737,794,798']
    def parse(self,response):
        for href in response.css('#plist .p-name a::attr(href)'):
            full_url=response.urljoin(href.extract())
            yield scrapy.Request(full_url,callback=self.parse_goods)

    def parse_goods(self,response):
        yield{
            'title':response.css('.sku-name::text').extract()[0],
            'link':response.url,
        }

运行

[root@localhost pytest]# scrapy runspider jdspider.py -o abc.csv

结果
[root@localhost pytest]# less abc.csv 
link,title
http://item.jd.com/1927536.html,长虹（CHANGHONG）55U3C 55英寸双64位4K安卓智能LED液晶电视(黑色)
http://item.jd.com/1589946.html,创维（Skyworth）55M6 55英寸 4K超高清智能酷开网络液晶电视（黑色）
http://item.jd.com/1366436.html,飞利浦（PHILIPS）55PFL6840/T3 55英寸 流光溢彩 4K超高清智能电视（京东微联APP控制）
http://item.jd.com/1612016.html,创维（Skyworth）58M6 58英寸 4K超高清智能酷开网络液晶电视（黑色）

组建:

选择器(Selectors)

http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/selectors.html#topics-selectors

使用选择器(selectors)

我们将使用 Scrapy shell (提供交互测试)和位于Scrapy文档服务器的一个样例页面，来解释如何使用选择器：

http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

这里是它的HTML源码:

 
    <html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>
 
   

首先, 我们打开shell:

scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html

接着，当shell载入后，您将获得名为 response 的shell变量，其为响应的response，并且在其response.selector 属性上绑定了一个selector。

因为我们处理的是HTML，选择器将自动使用HTML语法分析。

那么，通过查看 HTML code 该页面的源码，我们构建一个XPath来选择title标签内的文字:

 
    >>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

由于在response中使用XPath、CSS查询十分普遍，因此，Scrapy提供了两个实用的快捷方式:response.xpath() 及 response.css():

 
    >>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

如你所见， .xpath() 及 .css() 方法返回一个类 SelectorList 的实例, 它是一个新选择器的列表。这个API可以用来快速的提取嵌套数据。

为了提取真实的原文数据，你需要调用 .extract() 方法如下:

 
    >>> response.xpath('//title/text()').extract()
[u'Example website']

注意CSS选择器可以使用CSS3伪元素(pseudo-elements)来选择文字或者属性节点:

 
    >>> response.css('title::text').extract()
[u'Example website']

现在我们将得到根URL(base URL)和一些图片链接:

 
    >>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']
 
   

嵌套选择器(selectors)

选择器方法( .xpath() or .css() )返回相同类型的选择器列表，因此你也可以对这些选择器调用选择器方法。下面是一个例子:

 
    >>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.extract()
[u'<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 u'<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 u'<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 u'<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 u'<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args

Link number 0 points to url [u'image1.html'] and image [u'image1_thumb.jpg']
Link number 1 points to url [u'image2.html'] and image [u'image2_thumb.jpg']
Link number 2 points to url [u'image3.html'] and image [u'image3_thumb.jpg']
Link number 3 points to url [u'image4.html'] and image [u'image4_thumb.jpg']
Link number 4 points to url [u'image5.html'] and image [u'image5_thumb.jpg']
 
   

结合正则表达式使用选择器(selectors)

Selector 也有一个 .re() 方法，用来通过正则表达式来提取数据。然而，不同于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。所以你无法构造嵌套式的 .re() 调用。

下面是一个例子，从上面的 HTML code 中提取图像名字:

 
    >>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']
 
   

使用相对XPaths

记住如果你使用嵌套的选择器，并使用起始为 / 的XPath，那么该XPath将对文档使用绝对路径，而且对于你调用的 Selector 不是相对路径。

比如，假设你想提取在 <div> 元素中的所有 <p> 元素。首先，你将先得到所有的 <div> 元素:

 
    >>> divs = response.xpath('//div')

开始时，你可能会尝试使用下面的错误的方法，因为它其实是从整篇文档中，而不仅仅是从那些<div> 元素内部提取所有的 <p> 元素:

 
    >>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()

下面是比较合适的处理方法(注意 .//p XPath的点前缀):

 
    >>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()

另一种常见的情况将是提取所有直系 <p> 的结果:

 
    >>> for p in divs.xpath('p'):
...     print p.extract()

更多关于相对XPaths的细节详见XPath说明中的 Location Paths 部分。

Items

http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/items.html#module-scrapy.item

爬取的主要目标就是从非结构性的数据源提取结构性数据，例如网页。 Scrapy提供 Item 类来满足这样的需求。

Item 对象是种简单的容器，保存了爬取到得数据。其提供了类似于词典(dictionary-like) 的API以及用于声明可用字段的简单语法。

声明Item

Item使用简单的class定义语法以及 Field 对象来声明。例如:

 
     import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)
 
    

注解

Item Pipeline

http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/item-pipeline.html#item-pipeline

当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理。

每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被丢弃而不再进行处理。

以下是item pipeline的一些典型应用：

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

Item pipeline 样例

验证价格，同时丢弃没有价格的item

让我们来看一下以下这个假设的pipeline，它为那些不含税(price_excludes_vat 属性)的item调整了price 属性，同时丢弃了那些没有价格的item:

 
      from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)
 
     

将item写入JSON文件

以下pipeline将所有(从所有spider中)爬取到的item，存储到一个独立地 items.jl 文件，每行包含一个序列化为JSON格式的item:

 
      import json

class JsonWriterPipeline(object):

    def __init__(self):
        self.file = open('items.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

注解

JsonWriterPipeline的目的只是为了介绍怎样编写item pipeline，如果你想要将所有爬取的item都保存到同一个JSON文件，你需要使用 Feed exports 。

去重

一个用于去重的过滤器，丢弃那些已经被处理过的item。让我们假设我们的item有一个唯一的id，但是我们spider返回的多个item中包含有相同的id:

 
      from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item
 
     

启用一个Item Pipeline组件

为了启用一个Item Pipeline组件，你必须将它的类添加到 ITEM_PIPELINES 配置，就像下面这个例子:

 
     ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

分配给每个类的整型值，确定了他们运行的顺序，item按数字从低到高的顺序，通过pipeline，通常将这些数字定义在0-1000范围内。

4,

Feed exports

http://scrapy-chs.readthedocs.io/zh_CN/0.24/topics/feed-exports.html#feed-exports

0.10 新版功能.

实现爬虫时最经常提到的需求就是能合适的保存爬取到的数据，或者说，生成一个带有爬取数据的”输出文件”(通常叫做”输出feed”)，来供其他系统使用。

Scrapy自带了Feed输出，并且支持多种序列化格式(serialization format)及存储方式(storage backends)。

序列化方式(Serialization formats)

feed输出使用到了 Item exporters 。其自带支持的类型有:

JSON
JSON lines
CSV
XML

您也可以通过 FEED_EXPORTERS 设置扩展支持的属性。

JSON

FEED_FORMAT: json
使用的exporter: JsonItemExporter
大数据量情况下使用JSON请参见这个警告

JSON lines

FEED_FORMAT: jsonlines
使用的exporter: JsonLinesItemExporter

CSV

FEED_FORMAT: csv
使用的exporter: CsvItemExporter

XML

FEED_FORMAT: xml
使用的exporter: XmlItemExporter

Pickle

FEED_FORMAT: pickle
使用的exporter: PickleItemExporter

Marshal

FEED_FORMAT: marshal
使用的exporter: MarshalItemExporter

存储(Storages)

使用feed输出时您可以通过使用 URI (通过 FEED_URI 设置) 来定义存储端。 feed输出支持URI方式支持的多种存储后端类型。

自带支持的存储后端有:

本地文件系统
FTP
S3 (需要 boto)
标准输出

有些存储后端会因所需的外部库未安装而不可用。例如，S3只有在 boto 库安装的情况下才可使用。

存储URI参数

存储URI也包含参数。当feed被创建时这些参数可以被覆盖:

%(time)s - 当feed被创建时被timestamp覆盖
%(name)s - 被spider的名字覆盖

其他命名的参数会被spider同名的属性所覆盖。例如，当feed被创建时， %(site_id)s 将会被spider.site_id 属性所覆盖。

下面用一些例子来说明:

存储在FTP，每个spider一个目录:
ftp://user:password@ftp.example.com/scraping/feeds/%(name)s/%(time)s.json
存储在S3，每一个spider一个目录:
s3://mybucket/scraping/feeds/%(name)s/%(time)s.json