1.爬虫项目
1)创建爬虫项目
scrapy startproject 项目名
scrapy startproject myfirstpjt
2)进入项目
cd 爬虫项目所在目录
..................>cd myfirstpjt
3)scrapy参数
scrapy startproject -h
4)--logfile=FILE用来指定日志文件
等级名 | 含义 |
CAITICAL | 发生最严重的错误 |
ERROR | 发生了必须立即处理的问题 |
WARNING | 出现一些警告信息,存在潜在的错误 |
INFO | 输出一些提示信息 |
DEBUG | 输出一些调试信息,常用于开发阶段 |
5)全局命令
...............................>scrapy -h
Scrapy 1.4.0 - project: myfirstspjt
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
··fetch命令···显示爬虫爬取过程
.......................>scrapy fetch --headers --nolog http://news.sina.com.cn/
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Server: nginx
< Date: Wed, 04 Oct 2017 04:14:24 GMT
< Content-Type: text/html
< Last-Modified: Wed, 04 Oct 2017 04:12:07 GMT
< Vary: Accept-Encoding
< Expires: Wed, 04 Oct 2017 04:14:21 GMT
< Cache-Control: max-age=60
< X-Powered-By: shci_v1.03
< Age: 32
< Via: http/1.1 ctc.ningbo.ha2ts4.81 (ApacheTrafficServer/4.2.1.1 [cHs f ]), http/1.1 ctc.ningbo.ha2ts4.106 (ApacheTrafficServer/4.2.1.1 [cRs f ])
< X-Cache: HIT.81
< X-Cache: HIT.106
< X-Via-Cdn: f=edge,s=ctc.ningbo.ha2ts4.107.nb.sinaedge.com,c=61.164.56.98;f=Edge,s=ctc.ningbo.ha2ts4.106,c=61.164.56.98;f=edge,s=ctc.ningbo.ha2ts4.73.nb.sinaedge.com,c=115.238.190.106;f=Edge,s=ctc.ningbo.ha2ts4.81,c=106.38.241.153
< X-Via-Edge: jgwjigaqtn
······runspider命令·····直接运行一个爬虫文件不依托scrapy爬虫项目
............................>scrapy runspider --loglevel=INFO runspider.py
·····settings命令····查看scrapy对应的配置信息...................>scrapy settings --get BOT_NAME
scrapybot
·········shell命令·····可以通过shell命令开启scrapy的交互终端
...................................>scrapy shell http://www.baidu.com --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000023F01A34630>
[s] item {}
[s] request <GET http://www.baidu.com>
[s] response <200 http://www.baidu.com>
[s] settings <scrapy.settings.Settings object at 0x0000023F02EFB940>
[s] spider <DefaultSpider 'default' at 0x23f0318e978>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]: ti=sel.xpath("/html/head/title")
In [2]: print(ti)
[<Selector xpath='/html/head/title' data='<title>百度一下,你就知道</title>'>]
In [3]: exit()
·······version命令····显示scrapy的版本
...............>scrapy version
Scrapy 1.4.0
..................>scrapy version -v
Scrapy : 1.4.0
lxml : 3.7.3.0
libxml2 : 2.9.4
cssselect : 1.0.1
parsel : 1.2.0
w3lib : 1.18.0
Twisted : 17.9.0
Python : 3.6.1 |Anaconda 4.4.0 (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
pyOpenSSL : 17.0.0 (OpenSSL 1.0.2l 25 May 2017)
Platform : Windows-10-10.0.15063-SP0
··········view命令······下载某个网页并用浏览器查看的功能
..........>scrapy view http://news.163.com/
6)项目命令
scrapy -h查看项目中可以使用的命令
.................>scrapy -h
Scrapy 1.4.0 - project: myfirstspjt
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
···bench命令····测试本地硬件的性能
····genspider命令····创建scrapy爬虫文件
scrapy genspider -1······查看当前可以使用的爬虫模板
available templates:basic,crawl,csvfeed,xmlfeed.
使用basic模板生成一个爬虫文件:scrapy genspider -t basic weisuen iqianyue.com(模板 新爬虫名 新爬虫爬取的域名)
查看csvfeed爬虫模板中内容:scrapy genspider -d csvfeed
··········check命令·····进行合同(contract)检查
scrapy check 爬虫名
······crawl命令···启动某个爬虫
scrapy crawl 爬虫名
·····list命令····列出当前可使用的爬虫文件
scrapy list
·····edit命令····打开编辑器对爬虫文件进行编辑(Windows下有问题,一般在Linux下OK)
·····parse命令····获取指定的URL网址,并使用对应的爬虫文件进行处理和分析
parse命令对应的参数表
参数 | 含义 |
--spider=SPIDER | 强行指定某个爬虫文件spider进行处理 |
-a NAME=VALUE | 设置spider的参数,可能会复制 |
--pipelines | 通过pipelines来处理items |
--nolinks | 不展示提取到的链接信息 |
--nocolour | 输出结果颜色不高亮 |
--rules,-r | 使用crawlspider规则去处理回调函数 |
--callback=CALLBACK,-c CALLBACK | 指定spider中用于处理返回的响应的回调函数 |
--noitems | 不展示得到的items |
--depth=DEPTH,-d DEPTH | 设置爬行深度,默认深度为1 |
--verbose,-v | 显示每层的详细信息 |