项目需求为采集所有400开头的企业电话。
企业电话有 ['4000', '4001', '4006', '4007', '4008', '4009'] 六种开头,共计600W条,每条一次请求,共600W次请求。
由于时间较紧,需采用分布式爬取策略。
任务分析:
1、scrapy-redis会存储需要爬取的url,但url太长,会多占用redis服务器内存空间。
2、redis中不能同时存储太多的电话,需将电话量保存在可控范围。
3、redis添加时需批量添加,保证添加效率,避免重复链接redis。
针对以上分析,我们采用常规scrapy+redis脚本来实现:
redis脚本代码如下:
# coding: utf-8
import time
from settings import get_redis
red = get_redis()
MAX_LENGTH = 100000
pipe = red.pipeline()
if __name__ == '__main__':
for phone_bef in ['4000', '4001', '4006', '4007', '4008', '4009']:
for phone_mid in range(1, 100):
while True:
l = red.llen('phone400list')
if l > MAX_LENGTH:
time.sleep(20)
print '400call length: {}'.format(l)
else:
pipe = red.pipeline()
map(lambda phone_la: pipe.lpush('phone400list', '{}{}{}'.format(phone_bef,'%02d' % phone_mid,'%04d' % phone_la)), range(0, 10000))
pipe.execute()
break
整体思路为每20秒检查一下redis中储存数量,若小于max_length则插入10000条数据
利用pipeline.lpush可以批量存入,避免多次链接
scrapy修改为:
class Phone400Spider(Spider):
name = 'phone400'
start_urls = ['https://open.onebox.so.com/dataApi']
custom_settings = {
'ITEM_PIPELINES': {
'pipelines.Phone400Pipeline': 200,
},
}
def parse(self, response):
red = get_redis()
while True:
try:
phone = red.blpop('phone400list')[1]
if not phone:
time.sleep(20)
continue
except:
time.sleep(20)
continue
url = 'http://www.baidu.com.cn/s?wd=' + phone + '&cl=3'
request = Request(url=url, callback=self.parse_detail, dont_filter=True)
request.meta['phone'] = phone
yield request
def parse_detail(self, response):
item = Phone400Item()
item['phone'] = response.meta['phone']
# print (response.body_as_unicode())
company = re.search(company_re,(response.body_as_unicode()))
company2 = re.search(company_re2,(response.body_as_unicode()))
company3 = response.xpath(
u'//td[text()="客服热线" and @class="op_kefu_td1"]//ancestor::div[@class="c-row"]/ancestor::div[2]//text()').extract()
if company:
item['company'] = company.group(1)
yield item
elif company2:
item['company'] = company2.group(1)
yield item
elif company3:
item['company'] = ''.join(company3).split(u'客服热线')[0].strip()
yield item
修改了 parse 部分,其余部分和正常scrapy项目无差别。