爬虫
-
下载
-
Linux:pip3 install scrapy
-
Windows
- a. pip3 install wheel
- b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
- c. 进入下载目录,执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl
- d. pip3 install scrapy
- e. 下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/
-
-
shell界面调试(pip安装好ipython):
scrapy shell “http://www.baidu.com”- 查看请求返回的包头:response.headers
- 查看请求返回的内容:response.body
-
创建项目:
scrapy startproject first_obj
目录结构:- first_obj目录
- middlewares:中间件
- items:格式化
- pipelines:持久化
- settings:配置文件
- scrapy.cfg:配置信息
- first_obj目录
-
创建爬虫:
cd first_obj
scrapy genspider baidu baidu.com -
执行爬虫:
scrapy crawl baidu [–nolog] [-o baidu.json/baidu.csv] -
其他命令:
- 列出当前项目中所有可用的spider:
scrapy list
- 下载给定的URL,并将获取到的内容送到标准输出:
scrapy fetch <url>
- 在浏览器中打开给定的URL,并以Scrapy spider获取到的形式展现:
scrapy view <url>
- 获取Scrapy的配置:
scrapy settings --get [options]
- 获取scrapy版本:
scrapy version
- 性能测试:
scrapy bench
- 列出当前项目中所有可用的spider:
-
配置文件:
settings:
ROBOTSTXT_OBEY:是否遵循网站的robots.txt规则
CONCURRENT_REQUESTS:Scrapy执行的最大并发请求
所有配置项必须大写
。
- 基本操作
- selector
from scrapy.selector import Selector
hxs = Selector(response=response)
img_list = hxs.xpath("//div[@class='item']")
for item in img_list:
title = item.xpath("./div[@class='news-content']/div[@class='part2']/@share-title").extract()[0]
url = item.xpath("./div[@class='news-pic']/img/@original").extract_first().strip('//')
- yield
page_list = hxs.xpath('//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract()
for page in page_list:
yield Request(url=page, callback=self.parse)
- pipline
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request
from ..items import ChouTiItem
class ChoutiSpider(scrapy.Spider):
name = 'chouti'
allowed_domains = ['dig.chouti.com']
start_urls = ['https://dig.chouti.com/']
def parse(self, response):
hxs = Selector(response=response)
img_list = hxs.xpath("//div[@class='item']")
for item in img_list:
title = item.xpath("./div[@class='news-content']/div[@class='part2']/@share-title").extract()[0]
url = item.xpath("./div[@class='news-pic']/img/@original").extract_first().strip('//')
obj = ChouTiItem(title=title, url=url)
yield obj
import scrapy
class ChouTiItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
- piplines.py
- Pipeline执行顺序:
1. 检测Pipeline类中是否有from_crawler方法
如果有:obj = Pipeline.from_crawler()
如果没有:obj = Pipeline()
2. 开启爬虫:obj.open_spider()
3. while True:
爬虫运行,并执行parse… yield item
obj.process_item()
4. 关闭爬虫:obj.close_spider() - 一般重构
process_item
即可
- Pipeline执行顺序:
from scrapy.exceptions import DropItem
class SavePipeline(object):
def __init__(self, v):
self.file = open('chouti.txt', 'a+')
def process_item(self, item, spider):
# 操作并进行持久化
# return表示会被后续的pipeline继续处理
self.file.write(item)
return item
# 表示将item丢弃,不会被后续pipeline处理
# raise DropItem()
@classmethod
def from_crawler(cls, crawler):
"""
初始化时候,用于创建pipeline对象
:param crawler:
:return:
"""
val = crawler.settings.get('SIX')
return cls(val)
def open_spider(self, spider):
"""
爬虫开始执行时,调用
:param spider:
:return:
"""
print('开启爬虫')
def close_spider(self, spider):
"""
爬虫关闭时,被调用
:param spider:
:return:
"""
print('关闭爬虫')
# 每行后面的整型值,确定了他们运行的顺序,item按数字从低到高的顺序通过pipeline,通常将这些数字定义在0-1000范围内。
# 当遇到raise DropItem()将不再往下执行
ITEM_PIPELINES = {
'fone.pipelines.SavePipeline': 300,
}
- 注意:settings.py中的ITEM_PIPELINES是全局生效的(所有爬虫都会执行)。如果要对个别爬虫做特殊操作可以在pipelines.py中Pipeline方法中做
spider.name
判断:
def process_item(self, item, spider):
if spider.name == 'chouti':
pass
- 去重
- 默认去重规则:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = False
# 保存范文记录的日志路径,如:/root/ 最终路径为 /root/requests.seen
JOBDIR = ""
- 自定义去重规则:
- 新建文件rfd.py,主要是重构request_seen方法
class RepeatUrl:
def __init__(self):
self.visited_url = set()
@classmethod
def from_settings(cls, settings):
"""
初始化时,调用
:param settings:
:return:
"""
return cls()
def request_seen(self, request):
"""
检测当前请求是否已经被访问过
:param request:
:return: True表示已经访问过;False表示未访问过
"""
if request.url in self.visited_url:
return True
self.visited_url.add(request.url)
return False
def open(self):
"""
开始爬去请求时,调用
:return:
"""
print('open replication')
def close(self, reason):
"""
结束爬虫爬取时,调用
:param reason:
:return:
"""
print('close replication')
def log(self, request, spider):
"""
记录日志
:param request:
:param spider:
:return:
"""
print('repeat', request.url)
2) settings.py
DUPEFILTER_CLASS = 'fone.rfp.RFPDupeFilter'
- 自定义拓展
from scrapy import signals
class MyExtension(object):
def __init__(self, value):
self.value = value
@classmethod
def from_crawler(cls, crawler):
val = crawler.settings.getint('SIX')
ext = cls(val)
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_opened(self, spider):
print('open')
def spider_closed(self, spider):
print('close')
EXTENSIONS = {
'fone.extensions.MyExtension': 100,
}
- 更多拓展详见from scrapy import signals
-
爬虫中间件SpiderMiddleware示例代码类的方法说明(执行顺序):
- process_spider_input:下载完成,执行,然后交给parse处理 (2)
- process_spider_output:spider处理完成,返回时调用,必须返回包含 Request 或 Item 对象的可迭代对象(iterable) (3)
- process_spider_exception:异常调用,返回None时继续交给后续中间件处理异常;含 Response 或 Item 的可迭代对象(iterable),交给调度器或pipeline (4)
- process_start_requests:爬虫启动时调用,包含 Request 对象的可迭代对象 (1)
-
下载中间件
DownloaderMiddleware
示例代码及说明:- 下载中间件的应用场景比较广,尤其是process_request方法中返回
None
和Response对象
方法
- 下载中间件的应用场景比较广,尤其是process_request方法中返回
class DownMiddleware1(object):
def process_request(self, request, spider):
"""
请求需要被下载时,经过所有下载器中间件的process_request调用
:param request:
:param spider:
:return:
None,继续后续中间件去下载;
Response对象,停止process_request的执行,开始执行process_response
Request对象,停止中间件的执行,将Request重新调度器
raise IgnoreRequest异常,停止process_request的执行,开始执行process_exception
"""
pass
def process_response(self, request, response, spider):
"""
spider处理完成,返回时调用
:param response:
:param result:
:param spider:
:return:
Response 对象:转交给其他中间件process_response
Request 对象:停止中间件,request会被重新调度下载
raise IgnoreRequest 异常:调用Request.errback
"""
print('response1')
return response
def process_exception(self, request, exception, spider):
"""
当下载处理器(download handler)或 process_request() (下载中间件)抛出异常
:param response:
:param exception:
:param spider:
:return:
None:继续交给后续中间件处理异常;
Response对象:停止后续process_exception方法
Request对象:停止中间件,request将会被重新调用下载
"""
return None
- 自定义命令
- 在spiders同级创建任意目录,如:commands
- 在其中创建 crawlall.py 文件 (此处文件名就是自定义的命令)
from scrapy.commands import ScrapyCommand
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
- 在settings.py 中添加配置 COMMANDS_MODULE = ‘项目名称.目录名称’
- 在项目目录执行 scrapy crawlall