Learning Scrapy 1

  • ipython 是一个强化python的命令终端,具有语法高亮,自动补全,内置函数等。
    pip install ipython
  • XPath从1开始不是0, …[1]
  • 控制获取数量 scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90

UR2IM process

基本爬虫步骤: UR2IM (URL, Request, Response, Items, More URLs)

  • URL
    scrapy shell是一个scrapy命令终端工具,用来快速测试scrapy。
    通过scrapy shell 'http://scrapy.org'启动
    返回对象,通过ipython操作
>>>$ scrapy shell 'http://scrapy.org' --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x101ade4a8>
[s]   item       {}
[s]   request    <GET http://scrapy.org>
[s]   response   <200 https://scrapy.org/>
[s]   settings   <scrapy.settings.Settings object at 0x1028b09e8>
[s]   spider     <DefaultSpider 'default' at 0x102b531d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
  • request and response
    对response进行操作,
    输出response前50字符
    >>> $ response.body[:50]

  • The item
    提取出response的数据放进对应的item。使用XPath提取。

一个页面如下:具有logo,search boxes,buttons等等元素。
需要的是具体信息,比如姓名,电话等。
通过定位,提取(复制XPath,简化XPath)

使用
response.xpath('//h1/text()').extract()
提取当前所有h1元素,

使用 //h1/text(),只提取文本信息
这里假设只有一个h1元素,一个网站最好只有一个h1元素,为了SEO(Search Engine Optimization) 探索引擎优化策略。

如果页面元素是<h1 itemprop="name" class="space-mbs">...</h1>
也可以通过//*[@itemprop="name"][1]/text()提取
XPath的从1开始不是0

css选择器

response.css('.ad-price')

一般选择需求


Primary fields | XPath expression
title | //*[@itemprop=”name”][1]/text()
price | //*[@itemprop=”price”][1]/text()
description | //*[@itemprop=”description”][1]/text()
address | //*[@itemtype=”http://schema.org/Place”][1]/text()
image_urls | //*[@itemprop=”image”][1]/@src

A Scrapy Project

scrapy startproject properties
目录结构:

├── properties
│ ├── init.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── init.py
└── scrapy.cfg

item规划

规划需要的数据,不一定全部要用到,feel free to add fileds。

from scrapy.item import Item, Field
   class PropertiesItem(Item):
       # Primary fields
       title = Field()
       price = Field()
       description = Field()
       address = Field()
       image_urls = Field()

       # Calculated fields
       images = Field()
       location = Field()

       # Housekeeping fields
       url = Field()
       project = Field()
       spider = Field()
       server = Field()
       date = Field()

爬虫编写

新建爬虫 scrapy genspider mydomain mydomain.com
默认:

import scrapy


class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['web']
    start_urls = ['http://web/']

    def parse(self, response):
        pass

修改后如下:
start_urls 目标url
self 使用内置函数, log()方法 输出所有
self.log("response.xpath('//@src').extract())

 import scrapy
   class BasicSpider(scrapy.Spider):
       name = "basic"
       allowed_domains = ["web"]
       start_urls = (
           'http://web:9312/properties/property_000000.html',
       )
       def parse(self, response):
           self.log("title: %s" % response.xpath(
               '//*[@itemprop="name"][1]/text()').extract())
           self.log("price: %s" % response.xpath(
               '//*[@itemprop="price"][1]/text()').re('[.0-9]+'))
           self.log("description: %s" % response.xpath(
                '//*[@itemprop="description"][1]/text()').extract())
           self.log("address: %s" % response.xpath(
               '//*[@itemtype="http://schema.org/'
               'Place"][1]/text()').extract())
           self.log("image_urls: %s" % response.xpath(
               '//*[@itemprop="image"][1]/@src').extract())

在终端目录通过scrapy crawl启动
或者可以使用scrapy parse
parse 获取给定的URL并使用相应的spider分析处理

填充item

在爬虫basic.py中,导入item
导入from properties.items import PropertiesItem
把item各项接收对应返回

item = PropertiesItem()
item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()

完整如下

import scrapy
from helloworld.items import PropertiesItem

class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['web']
    start_urls = ['https://www.iana.org/domains/reserved']

    def parse(self, response):
        item = PropertiesItem()
        item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()

保存文件

运行爬虫时 保存文件, 指定格式和路径
scrapy crawl basic -o items.json json格式
scrapy crawl basic -o items.xml xml格式
scrapy crawl basic -o items.csv csv格式
scrapy crawl basic -o "ftp://user:pass@ftp.scrapybook.com/items.j1" j1格式
scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"

item loader 简化parse

ItemLoader(item,resonse) 接收item,和XPath

    def parse(self, response):
        l = ItemLoader(item=PropertiesItem(), response=response)

        l.add_xpath('title', '//*[@itemprop="name"][1]/text()')

还有各种处理器
join 多种合一
MapCompose 使用python函数
MapCompose(unicode.strip) Removes leading and trailing whitespace characters.
MapCompose(unicode.strip, unicode.title) Same as Mapcompose, but also gives title cased results.
MapCompose(float) Converts strings to numbers.
MapCompose(lambda i: i.replace(',',''), float) Converts strings to numbers, ignoring possible ‘,’ characters.
MapCompose(lambda i: urlparse.urljoin(response.url, i)) 把相对路径转化为绝对路径url

add_value个item添加当个具体信息

def parse(self, response):
       l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
                   MapCompose(unicode.strip, unicode.title))
       l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
                   MapCompose(lambda i: i.replace(',', ''), float),
                   re='[,.0-9]+')
       l.add_xpath('description', '//*[@itemprop="description"]'
                   '[1]/text()', MapCompose(unicode.strip), Join())
       l.add_xpath('address','//*[@itemtype="http://schema.org/Place"][1]/text()',
                   MapCompose(unicode.strip))
       l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',
                   MapCompose(lambda i: urlparse.urljoin(response.url, i)))

       l.add_value('url', response.url)
       l.add_value('project', self.settings.get('BOT_NAME'))
       l.add_value('spider', self.name)
       l.add_value('server', socket.gethostname())
       l.add_value('date', datetime.datetime.now())

完整爬虫如下:

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from properties.items import PropertiesItem
import datetime
import urlparse
import socket
import scrapy
   class BasicSpider(scrapy.Spider):
       name = "basic"
       allowed_domains = ["web"]
       # Start on a property page
       start_urls = (
           'http://web:9312/properties/property_000000.html',
       )
       def parse(self, response):
           """ This function parses a property page.
           @url http://web:9312/properties/property_000000.html
           @returns items 1
           @scrapes title price description address image_urls
           @scrapes url project spider server date
           """
           # Create the loader using the response
           l = ItemLoader(item=PropertiesItem(), response=response)
           # Load fields using XPath expressions
           l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
                       MapCompose(unicode.strip, unicode.title))
           l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
                       MapCompose(lambda i: i.replace(',', ''),
                       float),
                       re='[,.0-9]+')
           l.add_xpath('description', '//*[@itemprop="description"]'
                       '[1]/text()',
                       MapCompose(unicode.strip), Join())
           l.add_xpath('address',
                       '//*[@itemtype="http://schema.org/Place"]'
                       '[1]/text()',
                       MapCompose(unicode.strip))
           l.add_xpath('image_urls', '//*[@itemprop="image"]'
                       '[1]/@src', MapCompose(
                       lambda i: urlparse.urljoin(response.url, i)))
           # Housekeeping fields
           l.add_value('url', response.url)
           l.add_value('project', self.settings.get('BOT_NAME'))
           l.add_value('spider', self.name)
           l.add_value('server', socket.gethostname())
           l.add_value('date', datetime.datetime.now())
           return l.load_item()

多个URLs

当一个页面出现多页码时,
多个url可以手动一个个输入

 start_urls = (
       'http://web:9312/properties/property_000000.html',
       'http://web:9312/properties/property_000001.html',
       'http://web:9312/properties/property_000002.html',
)

可以把url放在文件里,然后读取

start_urls = [i.strip() for i in
   open('todo.urls.txt').readlines()]

爬虫爬取有两种方向:
- 横向:从index页面顺序到另一个页面,页面布局基本一样
- 纵向:从index页面选中一个具体的item页面,页面布局改变,比如从列表页面到具体的产品页面。

urlparse.urljoin(base, URL)Python语法连接两个url

找出url变量集合,横向爬取

urls = response.xpath('//*[@itemprop="url"]/@href').extract()
//[u'property_000000.html', ... u'property_000029.html']

通过urljoin结合

[urlparse.urljoin(response.url, i) for i in urls]
//[u'http://..._000000.html', ... /property_000029.html']
urls = response.xpath('//*[@itemprop="url"]/@href').extract()
[urlparse.urljoin(response.url, i) for i in urls]
横纵向爬取

获得页码的url和产品的url
只是获得不同url,组合。

def parse(self, response):
    # 获取index页面url
    next_selector = response.xpath('//*[contains(@class,'
                                      '"next")]//@href')
    for url in next_selector.extract():
        yield Request(urlparse.urljoin(response.url, url))
    #获取产品url
    item_selector = response.xpath('//*[@itemprop="url"]/@href')
    for url in item_selector.extract():
        yield Request(urlparse.urljoin(response.url, url),
                      callback=self.parse_item)

scrapy genspider -t crawl webname web.org
生成爬虫

  ...
   class EasySpider(CrawlSpider):
       name = 'easy'
       allowed_domains = ['web']
       start_urls = ['http://www.web/']
       rules = (
           Rule(LinkExtractor(allow=r'Items/'),
   callback='parse_item', follow=True),
       )
       def parse_item(self, response):
           ...

其中rules中,设置callback可以忽略前面url,而执行parse_item

   Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
   Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
            callback='parse_item')
)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Paperback: 270 pages Publisher: Packt Publishing - ebooks Account (January 30, 2016) Language: English ISBN-10: 1784399787 ISBN-13: 978-1784399788 Key Features Extract data from any source to perform real time analytics. Full of techniques and examples to help you crawl websites and extract data within hours. A hands-on guide to web scraping and crawling with real-life problems and solutions Book Description This book covers the long awaited Scrapy v 1.0 that empowers you to extract useful data from virtually any source with very little effort. It starts off by explaining the fundamentals of Scrapy framework, followed by a thorough description of how to extract data from any source, clean it up, shape it as per your requirement using Python and 3rd party APIs. Next you will be familiarised with the process of storing the scrapped data in databases as well as search engines and performing real time analytics on them with Spark Streaming. By the end of this book, you will perfect the art of scarping data for your applications with ease What you will learn Understand HTML pages and write XPath to extract the data you need Write Scrapy spiders with simple Python and do web crawls Push your data into any database, search engine or analytics system Configure your spider to download files, images and use proxies Create efficient pipelines that shape data in precisely the form you want Use Twisted Asynchronous API to process hundreds of items concurrently Make your crawler super-fast by learning how to tune Scrapy's performance Perform large scale distributed crawls with scrapyd and scrapinghub
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值