利用redis搭建分布式爬虫

最新推荐文章于 2022-05-01 08:30:00 发布

张小腿

最新推荐文章于 2022-05-01 08:30:00 发布

阅读量229

点赞数 3

本文链接：https://blog.csdn.net/qq_21189053/article/details/81457846

版权

python 同时被 2 个专栏收录

16 篇文章 1 订阅

订阅专栏

scrapy

5 篇文章 0 订阅

订阅专栏

项目需求为采集所有400开头的企业电话。

企业电话有 ['4000', '4001', '4006', '4007', '4008', '4009'] 六种开头，共计600W条，每条一次请求，共600W次请求。

由于时间较紧，需采用分布式爬取策略。

任务分析：

1、scrapy-redis会存储需要爬取的url，但url太长，会多占用redis服务器内存空间。

2、redis中不能同时存储太多的电话，需将电话量保存在可控范围。

3、redis添加时需批量添加，保证添加效率，避免重复链接redis。

针对以上分析，我们采用常规scrapy+redis脚本来实现：

redis脚本代码如下：

# coding: utf-8
import time
from settings import get_redis


red = get_redis()
MAX_LENGTH = 100000

pipe = red.pipeline()
if __name__ == '__main__':

    for phone_bef in ['4000', '4001', '4006', '4007', '4008', '4009']:
        for phone_mid in range(1, 100):
            while True:
                l = red.llen('phone400list')
                if l > MAX_LENGTH:
                    time.sleep(20)
                    print '400call length: {}'.format(l)
                else:
                    pipe = red.pipeline()
                    map(lambda phone_la: pipe.lpush('phone400list', '{}{}{}'.format(phone_bef,'%02d' % phone_mid,'%04d' % phone_la)), range(0, 10000))
                    pipe.execute()
                    break

整体思路为每20秒检查一下redis中储存数量，若小于max_length则插入10000条数据

利用pipeline.lpush可以批量存入，避免多次链接

scrapy修改为：

class Phone400Spider(Spider):
	name = 'phone400'
	start_urls = ['https://open.onebox.so.com/dataApi']

	custom_settings = {
		'ITEM_PIPELINES': {
			'pipelines.Phone400Pipeline': 200,
		},
	}

	def parse(self, response):
		red = get_redis()
		while True:
			try:
				phone = red.blpop('phone400list')[1]
				if not phone:
					time.sleep(20)
					continue
			except:
				time.sleep(20)
				continue
			url = 'http://www.baidu.com.cn/s?wd=' + phone + '&cl=3'
			request = Request(url=url, callback=self.parse_detail, dont_filter=True)
			request.meta['phone'] = phone
			yield request

	def parse_detail(self, response):
		item = Phone400Item()
		item['phone'] = response.meta['phone']
		# print (response.body_as_unicode())
		company = re.search(company_re,(response.body_as_unicode()))
		company2 = re.search(company_re2,(response.body_as_unicode()))
		company3 = response.xpath(
			u'//td[text()="客服热线" and @class="op_kefu_td1"]//ancestor::div[@class="c-row"]/ancestor::div[2]//text()').extract()

		if company:
			item['company'] = company.group(1)
			yield item
		elif company2:
			item['company'] = company2.group(1)
			yield item
		elif company3:
			item['company'] = ''.join(company3).split(u'客服热线')[0].strip()
			yield item

修改了 parse 部分，其余部分和正常scrapy项目无差别。

张小腿

关注

3
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
利用redis搭建分布式爬虫

项目需求为采集所有400开头的企业电话。企业电话有 ['4000', '4001', '4006', '4007', '4008', '4009'] 六种开头，共计600W条，每条一次请求，共600W次请求。由于时间较紧，需采用分布式爬取策略。任务分析：1、scrapy-redis会存储需要爬取的url，但url太长，会多占用redis服务器内存空间。2、redis中不能同时存...
复制链接

扫一扫