Scrapy 的学习笔记(一)

使用pip 按装Scrapy

命令: pip install Scrapy

创建一个Scrapy 工程

命令:scrapy startproject tutorial (其中tutorial 是工程名字)

Scrapy 的工程目录结构

tutorial/

scrapy.cfg            # deploy configuration file
						工程配置文件
tutorial/             # project's Python module, you'll import your code from here
						工程的python 模块    可以在这里导入自己的代码
    __init__.py
    items.py          # project items definition file
    						工程项目定义的文件
    middlewares.py    # project middlewares file
    						工程中间件文件
    pipelines.py      # project pipelines file
							项目管限文件
    settings.py       # project settings file
							项目设置文件
    spiders/          # a directory where you'll later put your spiders
    						一个放爬虫的文件夹
        __init__.py

Our first Spider

需要将我们的第一个 Spider 放在 工程目录下的spiders文件夹里面

import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes”
def start_requests(self):
urls = [
http://quotes.toscrape.com/page/1/’,
http://quotes.toscrape.com/page/2/’,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:
你会明白我们的爬虫子类定义了一些属性和方法


name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
name是区分爬虫的属性。在一个project里面他必须是唯一的,你不会看见相同的name属性在不同的爬虫里面


start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
start_request()方法 必须返回一个iterable (迭代器) (你可以返回一个请求列表 或者一个写生成器函数) Spider 从start_request()函数开始爬取,后续请求会从这个初始请求依次生成


parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
parse() 会被调用去处理请求的响应 响应参数是一个包含页面内容的TextResponse实例,并且有进一步有用的方法去处理它


The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值