scrapy 的命令行

最新推荐文章于 2023-03-23 10:25:25 发布

Ghost_02

最新推荐文章于 2023-03-23 10:25:25 发布

阅读量802

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/ghost_leader/article/details/78459769

版权

python 专栏收录该内容

27 篇文章 0 订阅

订阅专栏

Scrapy命令行

Scrapy是通过scrapy命令行工具控制的，在这里被称为“Scrapy工具”，以区别于我们刚刚称之为“命令”或“Scrapy命令”的子命令。

首先创建 scrapy项目。

[root@lol spider]# scrapy startproject testproject
New Scrapy project 'testproject', using template directory '/root/.pyenv/versions/3.6.1/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /root/PycharmProjects/spider/testproject

You can start your first spider with:
    cd testproject
    scrapy genspider example example.com
[root@lol spider]# cd testproject/
[root@lol testproject]# scrapy genspider baidu www.baidu.com
Created spider 'baidu' using template 'basic' in module:
  testproject.spiders.baidu

在生成genspider选项中，有很多模板类型。可以使用 -l 去列出来。

[root@lol testproject]# scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

指定模板 crawl

运行爬虫。

比如我要运行百度的那个爬虫就 scrapy crawl baidu

检查一下爬虫程序是否有语法错误。

[root@lol testproject]# scrapy check

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

列出spider

[root@lol testproject]# scrapy list
baidu
zhihu

访问被抓取网页的方式，并且打印网页源代码。

[root@lol testproject]# scrapy fetch --nolog http://www.iqiyi.com
【 html 页面内容 】

[root@lol testproject]# scrapy fetch --nolog --headers http://www.iqiyi.com
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> Accept-Language: en
> User-Agent: Scrapy/1.4.0 (+http://scrapy.org)
> Accept-Encoding: gzip,deflate
>
< Date: Mon, 06 Nov 2017 11:56:07 GMT
< Content-Type: text/html
< Expires: Mon, 06 Nov 2017 11:53:23 GMT
< Cache-Control: max-age=300
< Last-Modified: Mon, 06 Nov 2017 11:46:26 GMT
< Server: Apache 1.3.29
< X-Cache: HIT from 101.227.22.100
< X-Cache: HIT from 115.238.189.1


[root@lol testproject]# scrapy fetch --nolog --no-redirect http://www.xiaomi.com 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<h1>301 Moved Permanently</h1>
<p>The requested resource has been assigned a new permanent URI.</p>
<hr/>Powered by MIWS</body>
</html>

scrapy视图

scrapy view url 把页面下载到本地然后用浏览器打开。之后就可以在本地浏览。

scrapy shell。可以在命令行模式下调试。

# scrapy shell --nolog http://www.qq.com/
In [1]: response
Out[1]: <200 http://www.qq.com/>

In [2]: response.headers
Out[2]: 
{b'Cache-Control': b'max-age=60',
 b'Content-Type': b'text/html; charset=GB2312',
 b'Date': b'Mon, 06 Nov 2017 12:12:38 GMT',
 b'Expires': b'Mon, 06 Nov 2017 12:13:38 GMT',
 b'Server': b'squid/3.5.20',
 b'Vary': b'Accept-Encoding',
 b'X-Cache': b'HIT from shenzhen.qq.com'}

In [3]: response.css('title::text').extract_first()
Out[3]: '腾讯首页'

scrapy 的从settings 中查找配置项

[root@lol quote]# scrapy settings --get=MONGO_DB
quotes

查看scrapy的版本和依赖库。

[root@lol quote]# scrapy version
Scrapy 1.4.0
[root@lol quote]# scrapy version -v
Scrapy    : 1.4.0
lxml      : 4.1.0.0
libxml2   : 2.9.5
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.18.0
Twisted   : 17.9.0
Python    : 3.6.1 (default, Oct 21 2017, 18:51:01) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
pyOpenSSL : 17.3.0 (OpenSSL 1.1.0g  2 Nov 2017)
Platform  : Linux-3.10.0-514.26.1.el7.x86_64-x86_64-with-centos-7.3.1611-Core

Ghost_02

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
scrapy 的命令行

Scrapy命令行Scrapy是通过scrapy命令行工具控制的，在这里被称为“Scrapy工具”，以区别于我们刚刚称之为“命令”或“Scrapy命令”的子命令。首先创建 scrapy项目。[root@lol spider]# scrapy startproject testprojectNew Scrapy project 'testproject', using template
复制链接

扫一扫