Scrapy学习笔记（二）——Scrapy项目创建和常用指令

最新推荐文章于 2022-11-20 12:05:56 发布

DivingKitten

最新推荐文章于 2022-11-20 12:05:56 发布

阅读量1.1k

点赞数 1

分类专栏： Python

本文链接：https://blog.csdn.net/weixin_42404727/article/details/96766675

版权

Python 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

一、创建Scrpy项目和项目结构分析

为了避免对全局环境中的包造成影响，因此需要使用虚拟环境进行Scrapy项目的构建，在此使用Pycharm创建一个空的Python项目，再导入有关的包即可。

在Pycharm的Teminal中找到当前项目的根目录，使用startproject指令即可创建一个Scrapy爬虫项目。

scrpay startproject myapp

项目创建完成后，可以看见有关指令提示，在先进入到创建好的Scrapy项目目录中。

cd myapp

项目构建完成后，通过Pycharm的项目文件浏览器分析当前的项目结构：

在使用指令创建的Scrapy项目中，会生成一个与爬虫项目名称形同的文件夹，暂时称第二级的同名文件夹为项目核心目录，同时还有一个scrapy.cfg文件，这个文件是这个新建项目的配置文件。

在核心目录中包含了:

spiders：存放爬虫的目录
__init__.py：项目初始化文件
items.py：数据容器文件，定义获取的数据格式
pipelines.py：管道文件，用来对items中的数据进一步加工和处理
settings.py：项目设置文件

二、Scrapy常用指令

Scrapy的指令分为两类，一类是全局指令，一类是项目指令。

全局指令能够在任意地方运行，不依赖于Scrapy项目。项目指令则必须在Scrapy项目中运行。

1、全局指令

全局指令能够在不进入项目的情况下执行，运行scrapy -h即可查看：

(venv) F:\Workplaces\PycharmWS\scrapy>scrapy -h
Scrapy 1.5.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

（1）fetch指令

用于显示爬虫爬取的过程。在Scrapy项目外调用，则调用Scrapy默认爬虫进行网页爬取，在项目内调用，则会用该项目的爬虫进行爬取。

(venv) F:\Workplaces\PycharmWS\scrapy>scrapy fetch -h
Usage
=====
  scrapy fetch [options] <url>

Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging

Options
=======
--help, -h              show this help message and exit
--spider=SPIDER         use this spider
--headers               print response HTTP headers instead of body
--no-redirect           do not handle HTTP 3xx status codes and print response
                        as-is

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

以下为fetch百度首页，显示相应头信息(url=https://www.baidu.com)

(venv) F:\Workplaces\PycharmWS\scrapy>scrapy fetch http://www.baidu.com --headers --nolog
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.5.0 (+https://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Server: bfe/1.0.8.18
< Date: Sun, 21 Jul 2019 14:16:38 GMT
< Content-Type: text/html
< Last-Modified: Mon, 23 Jan 2017 13:28:23 GMT
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Pragma: no-cache
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/

（2）runspider指令

调用该指令可以不依托Scrapy项目，直接运行一个爬虫文件

（3）settings指令

在Scrapy项目中调用settings指令，则可以查看该项目的配置信息；若是在其他位置调用，则是查看Scrapy的默认配置信息。(即查看settings.py中的配置信息)

（4）shell指令

shell指令则是Scrapy的交互终端，能够实现在不启动Scrapy爬虫的情况下，对网页响应进行调试，同时也编辑Python代码进行调试，进入终端可以看到Python Shell标志的‘>>>’。

以获取百度首页Title标签为例：

(venv) F:\Workplaces\PycharmWS\scrapy>scrapy shell https://www.baidu.com --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000001C4110DAD68>
[s]   item       {}
[s]   request    <GET https://www.baidu.com>
[s]   response   <200 https://www.baidu.com>
[s]   settings   <scrapy.settings.Settings object at 0x000001C413832EB8>
[s]   spider     <DefaultSpider 'default' at 0x1c413aade10>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> ti=sel.xpath("/html/head/title")
>>> print(ti)
[<Selector xpath='/html/head/title' data='<title>百度一下，你就知道</title>'>]
>>>

（5）startproject指令

用于创建项目

（6）version指令

查看Scrapy的版本信息

（7）view指令PID

调用浏览器查看下载的某个网页的内容

2、项目指令

(1)Bench指令

测试本地硬件性能

(2)Genspider指令

基于现有的爬虫模板快速生成一个新的爬虫文件.

(venv) ~/PycharmProjects/scrapy/mapp$ scrapy genspider --h
Usage
=====
  scrapy genspider [options] <name> <domain>

Generate new spider using pre-defined templates

Options
=======
--help, -h              show this help message and exit
--list, -l              List available templates
--edit, -e              Edit spider after creating it
--dump=TEMPLATE, -d TEMPLATE
                        Dump template to standard output
--template=TEMPLATE, -t TEMPLATE
                        Uses a custom template.
--force                 If the spider already exists, overwrite it with the
                        template

现有的爬虫模板有如下几种:

(venv) ~/PycharmProjects/scrapy/mapp$ scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

(3)Check指令

实现对某个爬虫文件进行合同(contract)检查

(4)Crawl指令

通过crawl指令可以启动某个爬虫文件

(5)List指令

现实当前项目可用的爬虫文件

(6)Edit指令

能够对爬虫文件进行编辑,方便在Linux命令行中进行编辑.

(7)Parse指令

获取指定URL网址,使用对应爬虫文件进行处理和分析.

DivingKitten

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Scrapy学习笔记（二）——Scrapy项目创建和常用指令

一、创建Scrpy项目和项目结构分析为了避免对全局环境中的包造成影响，因此需要使用虚拟环境进行Scrapy项目的构建，在此使用Pycharm创建一个空的Python项目，再导入有关的包即可。在Pycharm的Teminal中找到当前项目的根目录，使用startproject指令即可创建一个Scrapy爬虫项目。scrpay startproject myapp项目创建完成后，可以...
复制链接

扫一扫