python爬虫sleep_pyspider中可以使用time.sleep()吗？

最新推荐文章于 2023-12-27 16:12:40 发布

weixin_39665992

最新推荐文章于 2023-12-27 16:12:40 发布

阅读量259

点赞数

文章标签： python爬虫sleep

最近刚开始上手使用pyspider写爬虫，因为经常被ban，所以想下调一下抓取速率。尝试在脚本里用time.sleep()，发现效果不是我想像中的。

一个最简单的示例脚本如下：

@every(seconds=1)

def on_start(self):

cur_time = time.ctime()

file_object = open('/var/www/pyspider/time.txt', 'a')

file_object.write("url:http://xxx/list.html time:"+cur_time+"\n")

file_object.close( )

self.crawl('http://xxx/list.html', callback=self.index_page)

@config(age=1)

def index_page(self, response):

for each in response.doc('a[href^="http"]').items():

timestr = time.time()

self.crawl(each.attr.href, taskid=taskid, callback=self.detail_page)

time.sleep(5)

@config(priority=2)

def detail_page(self, response):

cur_time = time.ctime()

file_object = open('/var/www/pyspider/time.txt', 'a')

file_object.write("url:"+response.url+" time:"+cur_time+"\n")

file_object.close( )

rate/burst 是0.5/3

发现脚本不是每5秒爬一下，而是sleep了35秒（for循环有7次）后，仍然按照rate/burst的配置走的，记录的文本如下：

url:http://xxx/list.html time:Thu Dec 29 01:46:24 2016

url:http://xxx/6.html time:Thu Dec 29 01:47:00 2016

url:http://xxx/4.html time:Thu Dec 29 01:47:00 2016

url:http://xxx/1.html time:Thu Dec 29 01:47:00 2016

url:http://xxx/2.html time:Thu Dec 29 01:47:02 2016

url:http://xxx/3.html time:Thu Dec 29 01:47:04 2016

url:http://xxx/5.html time:Thu Dec 29 01:47:06 2016

url:http://xxx/7.html time:Thu Dec 29 01:47:08 2016

可见在start执行后，sleep了35秒，再按rate/burst执行的，这是个什么机制啊？除了用rate外，还有没有办法可以自定义抓取速率呢？

weixin_39665992

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫sleep_pyspider中可以使用time.sleep()吗？

最近刚开始上手使用pyspider写爬虫，因为经常被ban，所以想下调一下抓取速率。尝试在脚本里用time.sleep()，发现效果不是我想像中的。一个最简单的示例脚本如下：@every(seconds=1)def on_start(self):cur_time = time.ctime()file_object = open('/var/www/pyspider/time.txt', 'a')...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。