Python爬虫框架Scrapy第1课-----框架介绍、环境配置

Scrapy框架介绍

图片来源:百度图片
这里写图片描述

忽略引擎(Scrapy Engine)(引擎负责全局指挥,发号施令)的话,大概流程是,写好爬虫(Spiders),请求交给调度器(Scheduler),调度器入队列,调度器拿出请求交给下载器(Downloader),下载器返回的响应文件交给爬虫提取,如果提取到的是URL地址,重复上述过程,如果是Items数据就交给管道(Item Pipeline)存储。

Windows下搭建环境

进入命令行窗口,如下操作:
pip install –upgrade pip
pip install Scrapy

如果pip已经是最新的了,上述第一条语句会提示不需要更新pip了。在执行第二条语句的时候,笔者遇到了错误如下:
这里写图片描述

上图表明Twisted安装时出错,解决方案如下(先下载适当版本的Twisted,放在一个目录下,根据目录,参考如下操作):
这里写图片描述

这个时候已经成功安装了那个Twisted,然后重新输入pip install Scrapy,即完成安装
这里写图片描述

笔者在利用scrapy bench命令的时候,又发现了一个错误:ModuleNotFoundError: No module named ‘win32api’,表示没有安装pywin32,然后笔者安装这个,用命令行操作:pip install pytwin32,安装成功后,可以用scrapy bench测试性能了

几个命令

在命令行下输入scrapy,弹出下面界面:

Scrapy 1.5.0 - no active project

Usage:
scrapy [options] [args]

Available commands:
bench Run quick benchmark test
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

[ more ] More commands available when run from project directory

Use “scrapy -h” to see more info about a command

bench

Run quick benchmark test
测试机器性能,大概一分钟能爬取多少网页

scrapy bench

fetch

Fetch a URL using the Scrapy downloader

scrapy fetch "http://www.baidu.com"

genspider

Generate new spider using pre-defined templates

runspider

Run a self-contained spider (without creating a project)

settings

Get settings values

shell

Interactive scraping console

scrapy shell "http://www.baidu.com"

输出:

2018-04-05 21:12:14 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-04-05 21:12:14 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.15063-SP0
2018-04-05 21:12:14 [scrapy.crawler] INFO: Overridden settings: {‘DUPEFILTER_CLASS’: ‘scrapy.dupefilters.BaseDupeFilter’, ‘LOGSTATS_INTERVAL’: 0}
2018-04-05 21:12:14 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats’,
‘scrapy.extensions.telnet.TelnetConsole’]
2018-04-05 21:12:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
[‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware’,
‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware’,
‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware’,
‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware’,
‘scrapy.downloadermiddlewares.retry.RetryMiddleware’,
‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware’,
‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware’,
‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware’,
‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware’,
‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware’,
‘scrapy.downloadermiddlewares.stats.DownloaderStats’]
2018-04-05 21:12:15 [scrapy.middleware] INFO: Enabled spider middlewares:
[‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware’,
‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware’,
‘scrapy.spidermiddlewares.referer.RefererMiddleware’,
‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware’,
‘scrapy.spidermiddlewares.depth.DepthMiddleware’]
2018-04-05 21:12:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-04-05 21:12:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-05 21:12:15 [scrapy.core.engine] INFO: Spider opened
2018-04-05 21:12:18 [scrapy.core.engine] DEBUG: Crawled (200)

response.body

输出:
b’\r\n http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8\x80\xe4\xb8\x8b\xef\xbc\x8c\xe4\xbd\xa0\xe5\xb0\xb1\xe7\x9f\xa5\xe9\x81\x93

http://home.baidu.com>\xe5\x85\xb3\xe4\xba\x8e\xe7\x99\xbe\xe5\xba\xa6 http://ir.baidu.com>About Baidu

©2017 Baidu http://www.baidu.com/duty/>\xe4\xbd\xbf\xe7\x94\xa8\xe7\x99\xbe\xe5\xba\xa6\xe5\x89\x8d\xe5\xbf\x85\xe8\xaf\xbb  http://jianyi.baidu.com/ class=cp-feedback>\xe6\x84\x8f\xe8\xa7\x81\xe5\x8f\x8d\xe9\xa6\x88 \xe4\xba\xacICP\xe8\xaf\x81030173\xe5\x8f\xb7 

\r\n’

startproject

Create new project

version

Print Scrapy version

view

Open URL in browser, as seen by Scrapy

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值