Scrapy实例2_腾讯招聘

最新推荐文章于 2021-10-14 19:06:32 发布

asXt

最新推荐文章于 2021-10-14 19:06:32 发布

阅读量250

点赞数 1

分类专栏： scrapy 文章标签： python json

本文链接：https://blog.csdn.net/m0_48758529/article/details/108850938

版权

scrapy 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

爬取腾讯招聘数据https://careers.tencent.com/search.html,保存为json文件

分析网页

右键查看网页源代码发现网页主体内容是Query动态加载的数据,所以我们需要抓包
点击Network查看,发现第一页数据都在这个链接下https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1601278633129&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
在这里插入图片描述
数据找到了就好办了,然后分析翻页时这个url的变化,每一页不同的是timestamp(时间戳)和pageIndex(页数),我们可以使用fomat和for语句获取每一页的数据

创建Scrapy项目

打开Pycharm的Terminal输入scrapy startproject tencent
进入scrapy项目目录cd tencent

编写spider

创建爬虫文件`scrapy genspider recruit tencent.com`

编写爬虫代码

打开刚刚创建的recruit.py
修改start_urls

start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp={}&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(int(time.time() * 1000), index) for index in range(1, 638)]

编写parse方法

def parse(self, response):
    """
    数据解析
    :param response: 响应数据
    """
    result_data = json.loads(response.text)
    result = result_data['Data']['Posts']
    for temp in result:
        # 岗位名称
        name = temp['RecruitPostName'].strip()
        # 地址
        address = temp['LocationName'].strip()
        # 工作职责
        sibility = temp['Responsibility'].strip()
        item = TencentItem()
        item['name'] = name
        item['duty'] = sibility
        item['address'] = address
        yield item

完整代码如下

import json
import scrapy
import time
from ..items import TencentItem


class RecruitSpider(scrapy.Spider):
    name = 'recruit'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp={}&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(int(time.time() * 1000), index) for index in range(1, 638)]

    def parse(self, response):
        """
        数据解析
        :param response: 响应数据
        """
        result_data = json.loads(response.text)
        result = result_data['Data']['Posts']
        for temp in result:
            # 岗位名称
            name = temp['RecruitPostName'].strip()
            # 地址
            address = temp['LocationName'].strip()
            # 工作职责
            sibility = temp['Responsibility'].strip()
            item = TencentItem()
            item['name'] = name
            item['duty'] = sibility
            item['address'] = address
            yield item

编写items

import scrapy


class TencentItem(scrapy.Item):
    
    name = scrapy.Field()
    duty = scrapy.Field()
    address = scrapy.Field()

编写pipelines

import json



class TencentPipeline:
    def process_item(self, item, spider):
        self.fp = open('tencent.json', 'a', encoding='utf-8')
        json.dump(dict(item), self.fp, ensure_ascii=False)  # ensure_ascii=False防止中文输出ASCII字符码
        return item

    def close_pider(self):
        self.fp.close()

修改settings

找到ITEM_PIPELINES打开注释

运行scrapy项目

scrapy crawl recruit

json文件保存项目目录下
大家有什么问题欢迎留言!

asXt

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Scrapy实例2_腾讯招聘

爬取腾讯招聘数据https://careers.tencent.com/search.html,保存为json文件分析网页右键查看网页源代码发现网页主体内容是Query动态加载的数据,所以我们需要抓包点击Network查看,发现第一页数据都在这个链接下https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1601278633129&countryId=&cityId=&bgIds=&prod
复制链接

扫一扫