使用scrapy爬取腾讯社招，获取所有分页的职位名称及chaolia、类型、人数、工作地点、发布日期超链接

最新推荐文章于 2021-12-29 00:14:20 发布

silence cc

最新推荐文章于 2021-12-29 00:14:20 发布

阅读量1.3k

点赞数

分类专栏：爬虫 python

本文链接：https://blog.csdn.net/qq_19339041/article/details/80999069

版权

爬虫同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

python

14 篇文章 0 订阅

订阅专栏

目的：把腾讯社招的每个分页的职位名称及链接、类型、人数、工作地点、发布日期爬取下来，然后存储到json文件里面

思路：
1. 新建爬虫项目
2. 在items.py文件里面设置存储的字段名称及类型
3. 在spiders文件夹里面设置爬虫文件
4. 设置管道文件
5. 设置settings.py文件
6. 测试运行
实际操作流程如下：
1. 新建爬虫项目tencent
这里写图片描述
2. 在items.py文件里面设置存储的字段名称及类型

3. 在spiders文件夹里面设置爬虫文件

4. 设置管道文件

5. 设置settings.py文件

6. 测试运行

7 爬取的数据结果tencent.json内容，共373页，3733条数据。
这里写图片描述

这里写图片描述

爬取腾讯社招的CrawlSpider版

创建爬虫项目tencent_CrawlSpiders

scrapy startproject tencent_CrawlSpiders

items.py文件


# -*- coding: utf-8 -*-

import scrapy

class TencentCrawlspidersItem(scrapy.Item):

    # 职位名称
    job_Title = scrapy.Field()

    # 详细链接
    job_Link = scrapy.Field()

    # 职位类型
    job_Type = scrapy.Field()

    # 职位人数
    job_Number = scrapy.Field()

    # 工作位置
    job_Location = scrapy.Field()

    # 发布日期
    job_PublicDate = scrapy.Field()

创建爬虫文件tencent_xiaozhao.py，指定爬取的域名范围
```
scrapy genspider -t crawl tencent_shezhao hr.tencent.com
```

爬虫文件tencent_xiaozhao.py


# -*- coding: utf-8 -*-

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tencent_CrawlSpiders.items import TencentCrawlspidersItem


class TencentXiaozhaoSpider(CrawlSpider):
    name = 'tencent_shezhao'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php?&start=0#a']

    # 获取匹配分页页码的链接的正则表达式
    page_link = LinkExtractor(allow='start=\d+')

    rules = (
        Rule(page_link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        job_list = response.xpath('//tr[@class="odd"] | //tr[@class="even"]')
        item = TencentCrawlspidersItem()
        for each in job_list:
            item['job_Title'] = each.xpath('./td[1]/a/text()')[0].extract()
            item['job_Link'] = each.xpath('./td[1]/a/@href')[0].extract()
            item['job_Type'] = each.xpath('./td[2]/text()').extract()
            item['job_Number'] = each.xpath('./td[3]/text()')[0].extract()
            item['job_Location'] = each.xpath('./td[4]/text()')[0].extract()
            item['job_PublicDate'] = each.xpath('./td[5]/text()')[0].extract()
            yield item

piplines.py文件


# -*- coding: utf-8 -*-



# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting


# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class TencentCrawlspidersPipeline(object):
    def process_item(self, item, spider):
        self.filename = open('tencent.json','a')
        jsontext = json.dumps(dict(item),ensure_ascii = False).encode('utf-8') + '\n'
        self.filename.write(jsontext)
        self.filename.close()
        return item

settings.py文件


# -*- coding: utf-8 -*-


BOT_NAME = 'tencent_CrawlSpiders'

SPIDER_MODULES = ['tencent_CrawlSpiders.spiders']
NEWSPIDER_MODULE = 'tencent_CrawlSpiders.spiders'


# 设置日志文件及级别,大于及等于INFO级别的会保存到日志里面，终端执行时，会发现没有信息提示

LOG_FILE = 'tencent_CrawlSpiders.log'
LOG_LEVEL = 'INFO'

ITEM_PIPELINES = {
   'tencent_CrawlSpiders.pipelines.TencentCrawlspidersPipeline': 300,
}

silence cc

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用scrapy爬取腾讯社招，获取所有分页的职位名称及chaolia、类型、人数、工作地点、发布日期超链接

目的：把腾讯社招的每个分页的职位名称及链接、类型、人数、工作地点、发布日期爬取下来，然后存储到json文件里面思路： 1. 新建爬虫项目 2. 在items.py文件里面设置存储的字段名称及类型 ...
复制链接

扫一扫

专栏目录