One of the challenges of writing web crawlers is that you’re often performing the same
tasks again and again: find all links on a page, evaluate the difference between internal and
external links, go to new pages. These basic patterns are useful to know about and to be
able to write from scratch, but there are options if you want something else to handle the
details for you.
Scrapy is a Python library that handles much of the complexity of finding and evaluating
links on a website, crawling domains or lists of domains with ease. Unfortunately, Scrapy
has not yet been released for Python 3.x, though it is compatible with Python 2.7.
The good news is that multiple versions of Python (e.g., Python 2.7 and 3.4) usually work
well when installed on the same machine. If you want to use Scrapy for a project but also
want to use various other Python 3.4 scripts, you shouldn’t have a problem doing both.
The Scrapy website offers the tool for download from its website, as well as instructions
for installing Scrapy with third-party installation managers such as pip. Keep in mind that
you will need to install Scrapy using Python 2.7 (it is not compatible with 2.6 or 3.x) and
run all programs using Scrapy with Python 2.7 as well.
安装scrapy
sudo pip install scrapy
创建一scrapy项目
$scrapy startproject wikiSpider
看文件目录结构
*scrapy.cfg
wikiSpider
*__init.py__
*items.py
*pipelines.py
*settings.py
spiders
*__init.py__
这些文件主要是:
scrapy.cfg: 项目配置文件
tutorial/: 项目python模块, 呆会代码将从这里导入
tutorial/items.py: 项目items文件
tutorial/pipelines.py: 项目管道文件
tutorial/settings.py: 项目配置文件
tutorial/spiders: 放置spider的目录
定义Item
Items是将要装载抓取的数据的容器,它工作方式像python里面的字典,但它提供更多的保护,比如对未定义的字段填充以防止拼写错误。
它通过创建一个scrapy.item.Item类来声明,定义它的属性为scrpy.item.Field对象,就像是一个对象关系映射(ORM).
我们通过将需要的item模型化,来控制从dmoz.org获得的站点数据,比如我们要获得站点的名字,url和网站描述,我们定义这三种属性的域。要做到这点,我们编辑在tutorial目录下的items.py文件,我们的Item类将会是这样
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Item, Field
class Article(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
Each Scrapy Item
object represents a single page on the website. Obviously, you can
define as many fields as you’d like ( url , content , header image
, etc.), but I’m simply collecting the title
field from each page, for now.
In your newly created articleSpider.py file, write the following:
from scrapy.selector import Selector
from scrapy import Spider
from wikiSpider.items import Article
class ArticleSpider(Spider):
name="article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Main_Page",
"http://en.wikipedia.org/wiki/Python_%28programming_language%29"]
def parse(self, response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print("Title is: "+title)
item['title'] = title
return item
在项目目录中运行
scrapy crawl article
The scraper goes to the two pages listed as the start_urls , gathers information, and then
terminates. Not much of a crawler, but using Scrapy in this way can be useful if you have
a list of URLs you need scrape. To turn it into a fully fledged crawler, you need to define a
set of rules that Scrapy can use to seek out new URLs on each page it encounters:
from scrapy.contrib.spiders import CrawlSpider, Rule
from wikiSpider.items import Article
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class ArticleSpider(CrawlSpider):
name="article"
allowed_domains = ["en.wikipedia.org"]
start_urls = ["http://en.wikipedia.org/wiki/Python_
%28programming_language%29"]
rules = [Rule(SgmlLinkExtractor(allow=('(/wiki/)((?!:).)*$'),),
callback="parse_item", follow=True)]
def parse_item(self, response):
item = Article()
title = response.xpath('//h1/text()')[0].extract()
print("Title is: "+title)
item['title'] = title
return item
This crawler is run from the command line in the same way as the previous one, but it will
not terminate (at least not for a very, very long time) until you halt execution using Ctrl+C
or by closing the terminal.
Logging with Scrapy
The debug information generated by Scrapy can be useful, but it is often too verbose. You can easily adjust
the level of logging by adding a line to the settings.py file in your Scrapy project:
LOG_LEVEL = ‘ERROR’
There are five levels of logging in Scrapy, listed in order here:
CRITICAL
ERROR
WARNING
DEBUG
INFO
If logging is set to ERROR , only CRITICAL and ERROR logs will be displayed. If logging is set to INFO , all logs
will be displayed, and so on.
To output logs to a separate logfile instead of the terminal, simply define a logfile when running from the
command line:
scrapy crawl article -s LOG_FILE=wiki.log
This will create a new logfile, if one does not exist, in your current directory and output all logs and print
statements to it.Scrapy uses the Item
objects to determine which pieces of information it should save from
the pages it visits. This information can be saved by Scrapy in a variety of ways, such as a
CSV, JSON, or XML files, using the following commands:
$ scrapy crawl article -o articles.csv -t csv
$ scrapy crawl article -o articles.json -t json
$ scrapy crawl article -o articles.xml -t xml