Scrapy 的学习笔记（一）

最新推荐文章于 2024-08-27 14:31:41 发布

qq_34953652

最新推荐文章于 2024-08-27 14:31:41 发布

阅读量168

点赞数

分类专栏： python scrapy 文章标签： Python scrapy

本文链接：https://blog.csdn.net/qq_34953652/article/details/82862156

版权

python 同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

scrapy

1 篇文章 0 订阅

订阅专栏

Scrapy 的学习笔记（一）

使用pip 按装Scrapy
创建一个Scrapy 工程
Scrapy 的工程目录结构
Our first Spider

使用pip 按装Scrapy

命令： pip install Scrapy

创建一个Scrapy 工程

命令：scrapy startproject tutorial （其中tutorial 是工程名字）

Scrapy 的工程目录结构

tutorial/

scrapy.cfg            # deploy configuration file
						工程配置文件
tutorial/             # project's Python module, you'll import your code from here
						工程的python 模块    可以在这里导入自己的代码
    __init__.py
    items.py          # project items definition file
    						工程项目定义的文件
    middlewares.py    # project middlewares file
    						工程中间件文件
    pipelines.py      # project pipelines file
							项目管限文件
    settings.py       # project settings file
							项目设置文件
    spiders/          # a directory where you'll later put your spiders
    						一个放爬虫的文件夹
        __init__.py

Our first Spider

需要将我们的第一个 Spider 放在工程目录下的spiders文件夹里面

import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes”
def start_requests(self):
urls = [
‘http://quotes.toscrape.com/page/1/’,
‘http://quotes.toscrape.com/page/2/’,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:
你会明白我们的爬虫子类定义了一些属性和方法

name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
name是区分爬虫的属性。在一个project里面他必须是唯一的，你不会看见相同的name属性在不同的爬虫里面

start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
start_request()方法必须返回一个iterable (迭代器) （你可以返回一个请求列表或者一个写生成器函数） Spider 从start_request()函数开始爬取，后续请求会从这个初始请求依次生成

parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
parse() 会被调用去处理请求的响应响应参数是一个包含页面内容的TextResponse实例，并且有进一步有用的方法去处理它

The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.