Scrapy请求顺序优化 priority(优先级)

# 解决爬虫download不能尽早执行的问题(前几分钟一直在请求url返回url,没有到达数据库的操作);优化请求顺序;
spider文件:
方法:priority=number   (默认为0,越大优先级越大)
def parse(self, response):
    res = response.selector.re('<a><span>(.*?)</span></a>')
    for val in res:
        val = quote(val)
        # range(1,61)
        for i in range(1,60):
            url = f'https://fe-api.zhaopin.com/c/i/sou?start={60*i}&pageSize=60&cityId=530&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw={val}&kt=3&lastUrlQuery=%7B%22p%22:{i},%22pageSize%22:%2260%22,%22jl%22:%22530%22,%22kw%22:%22{val}%22,%22kt%22:%223%22%7D&at=54721ddd55fd4f8ca9f2080ab3dfb7ea&_v=0.64103108'
            # Request请求中priority(优先级)默认值是0,越大优先级越大,允许是负值
            yield scrapy.Request(url = url,callback=self.parseone)

def parseone(self,response):
	# 最后一个请求,之后用来下载数据存入数据库
    res = json.loads(response.text)['data']['results']
    for i in res:
        url = 'https://jobs.zhaopin.com/' + i['number'] + '.htm'
        print(url)
        #  提高优先级,让队列中的请求尽早提前到达存储数据库这一步;
        yield scrapy.Request(url = url,callback=self.parsetwo,priority=10)


    def parsetwo(self,response):
        jobname = response.xpath('/html/body/div[1]/div[3]/div[4]/div/ul/li[1]/h1/text()').extract_first()
        time = response.xpath('/html/body/div[1]/div[3]/div[4]/div/ul/li[2]/div[1]/span/span/text()').extract_first()
        url = 'https://www.zhaopin.com/'
        salary = response.xpath('/html/body/div[1]/div[3]/div[4]/div/ul/li[1]/div[1]/strong/text()').extract_first()
        station = response.xpath('/html/body/div[1]/div[3]/div[4]/div/ul/li[2]/div[2]/span[1]/a/text()').extract_first()
        degree = response.xpath('/html/body/div[1]/div[3]/div[4]/div/ul/li[2]/div[2]/span[3]/text()').extract_first()
        experience = response.xpath('/html/body/div[1]/div[3]/div[4]/div/ul/li[2]/div[2]/span[2]/text()').extract_first()


        desc =  response.xpath("//div[@class='responsibility pos-common']//text()").getall()
        desc = ''.join(i.strip() for i in desc )

        item = LiepinItem()
        item['jobname'] = str(jobname)
        item['time'] = time
        item['url'] = url
        item['salary'] = salary
        item['station'] = station
        item['degree'] = degree
        item['experience'] = experience
        item['desc'] = desc
        return  item

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Thomas2143

您的打赏是我的动力!!!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值