CrawlSpider下的腾讯招聘网站内容爬取

最新推荐文章于 2019-08-19 16:25:43 发布

守云开见月明

最新推荐文章于 2019-08-19 16:25:43 发布

阅读量307

点赞数

分类专栏：爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_16963597/article/details/84750391

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1.首先是scrapy项目的建立：

scrapy startproject TencentSpider

2.打开项目文件，在对应的spider文件夹内输入：

scrapy genspider -t Crawl tencent tencent.com

进行tencent.py爬虫文件的建立。

3.爬虫程序的设置

4.tencent.py程序的编写：

源代码如下：

# -*- coding: utf-8 -*-
import scrapy
#导入链接规则匹配类，用来提取符合规则的链接
from scrapy.linkextractors import LinkExtractor
#导入Crawlspider类和Rule
from scrapy.spiders import CrawlSpider, Rule
from TencentSpider.items import TencentspiderItem

class TencentSpider(CrawlSpider):
    name = 'tencent'
    allowed_domains = ['hr.tencent.com']          #确定爬取域
    start_urls = ['http://hr.tencent.com/position.php?&start=0#a']     #初始url
    #获取列表里的链接，依次发送请求，并且继续跟进，并调用回退函数处理
    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        # i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        # return i
        for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
            item = TencentspiderItem()
            # 职位名称
            item['positionname'] = each.xpath("./td[1]/a/text()").extract()[0]
            # 详情连接
            item['positionlink'] = each.xpath("./td[1]/a/@href").extract()[0]
            # 职位类别
            # item['positionType'] = each.xpath("./td[2]/text()").extract()[0]
            r = each.xpath("./td[2]/text()").extract()
            item['positionType'] = r[0] if r else None
            # 招聘人数
            item['peopleNum'] = each.xpath("./td[3]/text()").extract()[0]
            # 工作地点
            item['workLocation'] = each.xpath("./td[4]/text()").extract()[0]
            # 发布时间
            item['publishTime'] = each.xpath("./td[5]/text()").extract()[0]

            yield item

对应的设置文件和scrapy的爬虫项目基本一样，在此省略。

守云开见月明

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
CrawlSpider下的腾讯招聘网站内容爬取

1.首先是scrapy项目的建立：scrapy startproject TencentSpider 2.打开项目文件，在对应的spider文件夹内输入：scrapy genspider -t Crawl tencent tencent.com进行tencent.py爬虫文件的建立。3.爬虫程序的设置4.tencent.py程序的编写：源代码如下：# -*- ...
复制链接

扫一扫