Python 爬虫笔记(Crawling with Scrapy)

One of the challenges of writing web crawlers is that you’re often performing the same
tasks again and again: find all links on a page, evaluate the difference between internal and
external links, go to new pages. These basic patterns are useful to know about and to be
able to write from scratch, but there are options if you want something else to handle the
details for you.
Scrapy is a Python library that handles much of the complexity of finding and evaluating
links on a website, crawling domains or lists of domains with ease. Unfortunately, Scrapy
has not yet been released for Python 3.x, though it is compatible with Python 2.7.
The good news is that multiple versions of Python (e.g., Python 2.7 and 3.4) usually work
well when installed on the same machine. If you want to use Scrapy for a project but also
want to use various other Python 3.4 scripts, you shouldn’t have a problem doing both.
The Scrapy website offers the tool for download from its website, as well as instructions
for installing Scrapy with third-party installation managers such as pip. Keep in mind that
you will need to install Scrapy using Python 2.7 (it is not compatible with 2.6 or 3.x) and
run all programs using Scrapy with Python 2.7 as well.

安装scrapy

sudo pip install scrapy

创建一scrapy项目

$scrapy    startproject    wikiSpider

看文件目录结构

*scrapy.cfg
  wikiSpider
    *__init.py__
    *items.py
    *pipelines.py
    *settings.py
    spiders
        *__init.py__

这些文件主要是:

scrapy.cfg: 项目配置文件
tutorial/: 项目python模块, 呆会代码将从这里导入
tutorial/items.py: 项目items文件
tutorial/pipelines.py: 项目管道文件
tutorial/settings.py: 项目配置文件
tutorial/spiders: 放置spider的目录

定义Item

Items是将要装载抓取的数据的容器,它工作方式像python里面的字典,但它提供更多的保护,比如对未定义的字段填充以防止拼写错误。

它通过创建一个scrapy.item.Item类来声明,定义它的属性为scrpy.item.Field对象,就像是一个对象关系映射(ORM).
我们通过将需要的item模型化,来控制从dmoz.org获得的站点数据,比如我们要获得站点的名字,url和网站描述,我们定义这三种属性的域。要做到这点,我们编辑在tutorial目录下的items.py文件,我们的Item类将会是这样

#   -*- coding: utf-8   -*-
#   Define  here    the models  for your    scraped items
#
#   See documentation   in:
#   http://doc.scrapy.org/en/latest/topics/items.html
from    scrapy  import  Item,   Field
class   Article(Item):
                #   define  the fields  for your    item    here    like:
                #   name    =   scrapy.Field()
                title   =   Field()

Each Scrapy Item
object represents a single page on the website. Obviously, you can
define as many fields as you’d like ( url , content , header image
, etc.), but I’m simply collecting the title
field from each page, for now.
In your newly created articleSpider.py file, write the following:

from    scrapy.selector import  Selector
from    scrapy  import  Spider
from    wikiSpider.items    import  Article
class   ArticleSpider(Spider):
                name="article"
                allowed_domains =   ["en.wikipedia.org"]
                start_urls  =   ["http://en.wikipedia.org/wiki/Main_Page",
                                                        "http://en.wikipedia.org/wiki/Python_%28programming_language%29"]
                def parse(self, response):
                                item    =   Article()
                                title   =   response.xpath('//h1/text()')[0].extract()
                                print("Title    is: "+title)
                                item['title']   =   title
                                return  item

在项目目录中运行

scrapy  crawl   article
The scraper goes    to  the two pages   listed  as  the  start_urls ,   gathers information,    and then
terminates. Not much    of  a   crawler,    but using   Scrapy  in  this    way can be  useful  if  you have
a   list    of  URLs    you need    scrape. To  turn    it  into    a   fully   fledged crawler,    you need    to  define  a
set of  rules   that    Scrapy  can use to  seek    out new URLs    on  each    page    it  encounters:
from    scrapy.contrib.spiders  import  CrawlSpider,    Rule
from    wikiSpider.items    import  Article
from    scrapy.contrib.linkextractors.sgml  import  SgmlLinkExtractor
class   ArticleSpider(CrawlSpider):
                name="article"
                allowed_domains =   ["en.wikipedia.org"]
                start_urls  =   ["http://en.wikipedia.org/wiki/Python_
                                                                            %28programming_language%29"]
                rules   =   [Rule(SgmlLinkExtractor(allow=('(/wiki/)((?!:).)*$'),),    
                                                                                                                                                            callback="parse_item",  follow=True)]
                def parse_item(self,    response):
                                item    =   Article()
                                title   =   response.xpath('//h1/text()')[0].extract()
                                print("Title    is: "+title)
                                item['title']   =   title
                                return  item

This crawler is run from the command line in the same way as the previous one, but it will
not terminate (at least not for a very, very long time) until you halt execution using Ctrl+C
or by closing the terminal.

Logging with Scrapy

The debug information generated by Scrapy can be useful, but it is often too verbose. You can easily adjust
the level of logging by adding a line to the settings.py file in your Scrapy project:
LOG_LEVEL = ‘ERROR’

There are five levels of logging in Scrapy, listed in order here:

CRITICAL
ERROR
WARNING
DEBUG
INFO

If logging is set to ERROR , only CRITICAL and ERROR logs will be displayed. If logging is set to INFO , all logs
will be displayed, and so on.
To output logs to a separate logfile instead of the terminal, simply define a logfile when running from the
command line:

scrapy  crawl   article -s  LOG_FILE=wiki.log

This will create a new logfile, if one does not exist, in your current directory and output all logs and print
statements to it.

Scrapy uses the Item
objects to determine which pieces of information it should save from
the pages it visits. This information can be saved by Scrapy in a variety of ways, such as a
CSV, JSON, or XML files, using the following commands:

$  scrapy  crawl   article -o  articles.csv    -t  csv
$  scrapy  crawl   article -o  articles.json   -t  json
$  scrapy  crawl   article -o  articles.xml    -t  xml
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值