爬虫笔记（10/4）-------scrapy项目管理

最新推荐文章于 2023-03-24 09:46:49 发布

蜜糖雪兒

最新推荐文章于 2023-03-24 09:46:49 发布

阅读量553

点赞数

分类专栏：笔记文章标签：爬虫

本文链接：https://blog.csdn.net/ydmichelle/article/details/78158433

版权

笔记专栏收录该内容

24 篇文章 0 订阅

订阅专栏

1.爬虫项目

1）创建爬虫项目

scrapy startproject 项目名

scrapy startproject myfirstpjt

2）进入项目

cd 爬虫项目所在目录

..................>cd myfirstpjt

3）scrapy参数

scrapy startproject -h

4）--logfile=FILE用来指定日志文件

日志等级常见值
等级名	含义
CAITICAL	发生最严重的错误
ERROR	发生了必须立即处理的问题
WARNING	出现一些警告信息，存在潜在的错误
INFO	输出一些提示信息
DEBUG	输出一些调试信息，常用于开发阶段

5）全局命令

...............................>scrapy -h
Scrapy 1.4.0 - project: myfirstspjt

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

··fetch命令···显示爬虫爬取过程

.......................>scrapy fetch --headers --nolog http://news.sina.com.cn/
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Server: nginx
< Date: Wed, 04 Oct 2017 04:14:24 GMT
< Content-Type: text/html
< Last-Modified: Wed, 04 Oct 2017 04:12:07 GMT
< Vary: Accept-Encoding
< Expires: Wed, 04 Oct 2017 04:14:21 GMT
< Cache-Control: max-age=60
< X-Powered-By: shci_v1.03
< Age: 32
< Via: http/1.1 ctc.ningbo.ha2ts4.81 (ApacheTrafficServer/4.2.1.1 [cHs f ]), http/1.1 ctc.ningbo.ha2ts4.106 (ApacheTrafficServer/4.2.1.1 [cRs f ])
< X-Cache: HIT.81
< X-Cache: HIT.106
< X-Via-Cdn: f=edge,s=ctc.ningbo.ha2ts4.107.nb.sinaedge.com,c=61.164.56.98;f=Edge,s=ctc.ningbo.ha2ts4.106,c=61.164.56.98;f=edge,s=ctc.ningbo.ha2ts4.73.nb.sinaedge.com,c=115.238.190.106;f=Edge,s=ctc.ningbo.ha2ts4.81,c=106.38.241.153
< X-Via-Edge: jgwjigaqtn

······runspider命令·····直接运行一个爬虫文件不依托scrapy爬虫项目

............................>scrapy runspider --loglevel=INFO runspider.py

·····settings命令····查看scrapy对应的配置信息

...................>scrapy settings --get BOT_NAME
scrapybot

·········shell命令·····可以通过shell命令开启scrapy的交互终端

...................................>scrapy shell http://www.baidu.com --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000023F01A34630>
[s]   item       {}
[s]   request    <GET http://www.baidu.com>
[s]   response   <200 http://www.baidu.com>
[s]   settings   <scrapy.settings.Settings object at 0x0000023F02EFB940>
[s]   spider     <DefaultSpider 'default' at 0x23f0318e978>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: ti=sel.xpath("/html/head/title")

In [2]: print(ti)
[<Selector xpath='/html/head/title' data='<title>百度一下，你就知道</title>'>]

In [3]: exit()

·······version命令····显示scrapy的版本

...............>scrapy version
Scrapy 1.4.0

..................>scrapy version -v
Scrapy    : 1.4.0
lxml      : 3.7.3.0
libxml2   : 2.9.4
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.9.0
Python    : 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
pyOpenSSL : 17.0.0 (OpenSSL 1.0.2l  25 May 2017)
Platform  : Windows-10-10.0.15063-SP0

··········view命令······下载某个网页并用浏览器查看的功能

..........>scrapy view http://news.163.com/

6）项目命令

scrapy -h查看项目中可以使用的命令

.................>scrapy -h
Scrapy 1.4.0 - project: myfirstspjt

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

···bench命令····测试本地硬件的性能

····genspider命令····创建scrapy爬虫文件
scrapy genspider -1······查看当前可以使用的爬虫模板

available templates:basic,crawl,csvfeed,xmlfeed.

使用basic模板生成一个爬虫文件:scrapy genspider -t basic weisuen iqianyue.com(模板新爬虫名新爬虫爬取的域名)

查看csvfeed爬虫模板中内容:scrapy genspider -d csvfeed

··········check命令·····进行合同（contract）检查

scrapy check 爬虫名

······crawl命令···启动某个爬虫

scrapy crawl 爬虫名

·····list命令····列出当前可使用的爬虫文件

scrapy list

·····edit命令····打开编辑器对爬虫文件进行编辑（Windows下有问题，一般在Linux下OK）

·····parse命令····获取指定的URL网址，并使用对应的爬虫文件进行处理和分析

parse命令对应的参数表

参数	含义
--spider=SPIDER	强行指定某个爬虫文件spider进行处理
-a NAME=VALUE	设置spider的参数，可能会复制
--pipelines	通过pipelines来处理items
--nolinks	不展示提取到的链接信息
--nocolour	输出结果颜色不高亮
--rules,-r	使用crawlspider规则去处理回调函数
--callback=CALLBACK,-c CALLBACK	指定spider中用于处理返回的响应的回调函数
--noitems	不展示得到的items
--depth=DEPTH,-d DEPTH	设置爬行深度，默认深度为1
--verbose,-v	显示每层的详细信息

蜜糖雪兒

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫笔记（10/4）-------scrapy项目管理

1.爬虫项目1）创建爬虫项目scrapy startproject 项目名scrapy startproject myfirstpjt2）进入项目cd 爬虫项目所在目录..................>cd myfirstpjt3）scrapy参数scrapy startproject -h4）--logfile=FILE用来指定日志文件日志
复制链接

扫一扫