如何自己实现一个scrapy框架——项目实战（八）

最新推荐文章于 2024-08-10 17:00:15 发布

无敌的白金之星

最新推荐文章于 2024-08-10 17:00:15 发布

阅读量448

点赞数 1

分类专栏：爬虫学习文章标签： scrapy

本文链接：https://blog.csdn.net/m0_38106113/article/details/81459378

版权

本文介绍了如何使用scrapy框架来实现腾讯招聘和新浪滚动新闻的爬虫项目。在腾讯招聘爬虫中，由于缺少User-Agent导致运行异常，通过在settings.py中添加User-Agent解决了问题。而在新浪新闻爬虫部分，详细给出了sina.py的项目路径，说明了如何在settings.py中启用该爬虫。

摘要由CSDN通过智能技术生成

#腾讯招聘爬虫案例
##1 腾讯招聘爬虫代码

from scrapy_plus.core.spider import Spider
from scrapy_plus.htttp.request import Request


class TencentSpider(Spider):

    name = 'tencent'
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response): # 对start_urls进行解析
        print(response.url + '*****')
        tr_list = response.xpath('//*[@class="tablelist"]//tr')[1:-1]
        print(len(tr_list))

        for tr in tr_list:
            item = {}
            # 获取一部分数据
            item['name'] = tr.xpath('./td[1]/a/text()')[0]
            item['address'] = tr.xpath('./td[4]/text()')[0]
            item['time'] = tr.xpath('./td[5]/text()')[0]
            # 获取详情页url,并发送请求
            detail_url = 'https://hr.tencent.com/' + tr.xpath('./td[1]/a/@href')[0]
            print(detail_url)
            yield Request(
                detail_url,
                parse='parse_detail',
                meta=item # meta接收一个字典
            )
        # 翻页
        print(response.xpath('//a[text()="下一页"]/@href')[0])
        next_url = 'https://hr.tencent.com/' + response.xpath('//a[text()="下一页"]/@href')[0]
        if response.xpath('//a[text()="下一页"]/@href')[0] != 'javascript:;':
            yield Request(next_url, parse='parse')

    def parse_detail(self, respo