使用Scrapy爬腾讯社会招聘网站上的岗位需求

需求

爬虫的设计需求是,爬取腾讯招聘网站社会招聘的岗位需求,按照字段:岗位,国家,城市,事业群,岗位类别,岗位职责,发布时间,详细描述保存到数据库。
目标地址腾讯招聘

页面分析

在浏览器中打开目标网页,F12开始抓包。
在这里插入图片描述
从抓包结果可以看出,页面是通过Ajax和后端交互的,渲染当前页面的用到了俩个后端接口,GetMultiDictionary和Query。

  • GetMultiDictionary

     获取页面左边的事业群。
    
  • Query

     获取右边的职位。
    

Query接口的返回结果,是我需要的。

URL:https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1581438061100&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

result:{"Code":200,"Data":{"Count":4221,"Posts":[{"Id":0,"PostId":"1123176672162484224","RecruitPostId":47753,"RecruitPostName":"18302-新动作手游后台开发工程师(深圳)","CountryName":"中国","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"技术","Responsibility":"负责新动作手机游戏的服务器端系统开发工作;\n负责部分游戏服务器端的架构工作;\n负责服务器端部分的性能优化工作;\n根据需要可能会负责部分前端的功能开发工作;","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1123176672162484224","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1215570947008892928","RecruitPostId":56702,"RecruitPostName":"34975-高级海外游戏数据分析","CountryName":"中国","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"产品","Responsibility":"负责天美旗下海外重点产品数据分析,提供调优建议,并为后续国际化产品设计沉淀认知;\n日常游戏运营数据监控及问题分析;\n针对海外市场洞察与用户反馈提供假设并分析验证;\n通过数据分析提供版本与运营活动优化建议;\n长期沉淀基于数据分析的国际化产品研发与运营经验。\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1215570947008892928","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1212989681885515776","RecruitPostId":56503,"RecruitPostName":"18302-动作品类玩法策划专家(深圳)","CountryName":"中国","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"产品","Responsibility":"关注市场的热点和前沿动作产品,能够基于核心玩法输出高水准的产品分析报告并发现新的动作产品机会;\n参与攻坚产品立项、定位,以及前期玩法搭建;\n参与游戏核心玩法和整体架构的设计,并对其进行论证和优化;\n协同程序、美术等其他部门合作,推动游戏核心玩法的实现以及论证,达到最终的设计效果;\n参与研究用户和市场的偏好,探索动作品类前进的方向;\n关注用户反馈,准确地发现产品玩法问题并予以解决。\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1212989681885515776","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1171396248633085952","RecruitPostId":53270,"RecruitPostName":"18302-国际IP-日语海外PM(深圳)","CountryName":"中国","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"产品","Responsibility":"负责项目资源的协调和组织,确保项目团队各干系人协同工作;\n负责项目计划的制定,跟踪和维护,确定项目按计划进行;\n负责组织项目各项评审会议及项目例会;\n协调项目资源配全,确保项目任务有序推进;\n及时发现并跟踪解决项目问题,有效管理项目风险。","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1171396248633085952","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1207609468850802688","RecruitPostId":56061,"RecruitPostName":"30933-FFW-Lead Game Narrative Designer (Los Angeles)","CountryName":"中国","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"产品","Responsibility":"Build up a narrative team for an AAA game title and define high efficiency work flow for narrative design;\nContribute to the narrative development of game stories, lore, quests, etc;\nCollaborate with your team and game designers to create, iterate on outstanding experience of storytelling and narrative contents.\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1207609468850802688","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1207612736846958592","RecruitPostId":56063,"RecruitPostName":"30933-FFW-Storyboard Artist(Los Angeles)","CountryName":"中国","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"设计","Responsibility":"Create storyboard sequences from rough storyboard panels through finished storyboard sequences that serve narrative/ storytelling objectives;\nEnsure that the vision and style stay consistent throughout the show, staging, character acting in storyboard work. If needed, making drawing or text changes in description, dialog or numbering to offer clear description;\nGood understanding for perspective and knows how to utilize it to create space in the storyboard;\nGood understanding for timing and knows how to import images, audio files and make animatic using Storyboard Pro or other equivalent software.\nKnow how camera works and able to construct a good flow of it into the animatic.\nFollow production’s guidelines and properly document every iteration.\nThe ability to work well within a team environment.\nCollaborate with narrative director and production manager to setup goals and schedules for the storyboards.\nRegularly meet with Director, Producer, and other Storyboard Artists to review, execute and revise storyboards.\nOversee the implementation and provide with necessary feedback and solutions.\nAttend and contribute to relevant meetings and pitches as needed, specially script meetings and narrative meetings.\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1207612736846958592","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1207612736465276928","RecruitPostId":56062,"RecruitPostName":"30933-FFW-Senior Character Concept Artist(Los Angeles)","CountryName":"中国","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"设计","Responsibility":"Develop character concepts and costume designs;\nIterate on game asset designs with our internal and external team using ideation sketches. paint overs, and deliver final design with great rendering for visual target.","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1207612736465276928","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1227208414266920960","RecruitPostId":57173,"RecruitPostName":"29912-内容运营","CountryName":"中国","LocationName":"北京","BGName":"PCG","ProductName":"微视","CategoryName":"内容","Responsibility":"1、监控全网新闻热点,组织内容生产,及时挖掘重点新闻,对新闻线索及时作出判断;\n2、结合热点进行选题策划,进行大事件运营;\n3、联动甲方媒体,进行对接,合作策划;\n\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1227208414266920960","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1204296465451585536","RecruitPostId":55727,"RecruitPostName":"30933-AOV商业化运营(深圳)","CountryName":"中国","LocationName":"深圳","BGName":"IEG","ProductName":"","CategoryName":"产品","Responsibility":"制定商业化运营规划,把控整体商业化节奏,确保收入目标达成;\n负责游戏内商业化系统规划及具体落地,跟进系统数据,整合玩家建议,并提出优化建议;\n负责收入方面运营数据和用户反馈的收集与分析,不断优化游戏商业化体系。\n","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=1204296465451585536","SourceID":1,"IsCollect":false,"IsValid":true},{"Id":0,"PostId":"1227248031292723200","RecruitPostId":57176,"RecruitPostName":"30359-灯塔-Java后台开发高级工程师","CountryName":"中国","LocationName":"深圳","BGName":"PCG","ProductName":"","CategoryName":"技术","Responsibility":"1、负责PCG大数据实时传输、存储、实时计算、即时查询计算等底层基础支撑框架的开发和运营;\n 2、负责PCG大数据产品底层OLAP分析引擎的开发和运营。","LastUpdateTime":"2020年02月11日","PostURL":"http://careers.tencent.com/jobdesc.html?postId=0","SourceID":1,"IsCollect":false,"IsValid":true}]}}    

Item实现

class TencentItem(scrapy.Item):
    #岗位名称
    RecruitPostName = scrapy.Field()
    #国家
    CountryName = scrapy.Field()
    #地址
    LocationName = scrapy.Field()
    #事业群
    BGName = scrapy.Field()
    #岗位类别
    CategoryName = scrapy.Field()
    #岗位职责
    Responsibility = scrapy.Field()
    #发布时间
    LastUpdateTime = scrapy.Field()

pipelines实现

class TencentPipeline(object):
    #功能:保存item数据 
    def __init__(self):
        print("Pipeline Initialization complete")

    def process_item(self, item, spider):
        db = MySQLdb.connect("localhost","root","sa","spider")
        cursor = db.cursor()
        db.set_character_set('utf8')
        cursor.execute('SET NAMES utf8;')
        cursor.execute('SET CHARACTER SET utf8;')
        cursor.execute('SET character_set_connection=utf8;')
        sql = "INSERT INTO `tencentpostion` (     \
                    `recruitPostName`,     \
                    `countryName`,         \
                    `locationName`,         \
                    `bgName`,\
                    `categoryName`,\
                    `responsibility`,\
                    `lastUpdateTime`,\
                    `PostURL`\
                )\
                VALUES\
                    (\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s',\
                        '%s'\
                    )"%(item['RecruitPostName'],
                        item['CountryName'],
                        item['LocationName'],
                        item['BGName'],
                        item['CategoryName'],
                        item['Responsibility'],
                        item['LastUpdateTime'],
                        item['PostURL']
                        )
        try:
            cursor.execute(sql)
            db.commit()
        except MySQLdb.Error:
            print("some error occured")
        db.close()
        return item

    def close_spider(self, spider):
        #self.filename.close()
        print("close spider")

Spider实现

# -*- coding: utf-8 -*-
#爬腾讯社招网站
import scrapy
import time
import json
from WHNews.items import TencentItem
class TencentpostionSpider(scrapy.Spider):
    name = 'tencentPostion'
    allowed_domains = ['tencent.com']
    url = "https://careers.tencent.com/tencentcareer/api/post/Query?"
    offset = 1
    # 起始url
    nowTime = time.time();
    timestamp = int(round(nowTime * 1000))
    url_suffix = "timestamp="+str(timestamp)+"cityId=&"+"bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageSize=10&language=zh-cn&area=cn&pageIndex="
    url = url + url_suffix
    start_urls = [url + str(offset)]
    def parse(self, response):
        jsonBody = json.loads(response.body)
        posts = jsonBody['Data']['Posts']
        for dict in posts:
            modelItem = TencentItem()
            modelItem['RecruitPostName'] = dict['RecruitPostName']
            modelItem['CountryName'] = dict['CountryName']
            modelItem['LocationName'] = dict['LocationName']
            modelItem['BGName'] = dict['BGName']
            modelItem['CategoryName'] = dict['CategoryName']
            modelItem['Responsibility'] = dict['Responsibility']
            modelItem['LastUpdateTime'] = dict['LastUpdateTime']
            modelItem['PostURL'] = dict['PostURL']
            yield modelItem
        if self.offset < 417:
            self.offset = self.offset + 1
        # 每次处理完一页的数据之后,重新发送下一页页面请求
        # self.offset自增10,同时拼接为新的url,并调用回调函数self.parse处理Response
        yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

注意点:

  1. 后端返回的是JSON串,所有Xpath就没用了。需要引入JSON包, 把返回结果转出为JSON对象处理。
jsonBody = json.loads(response.body)
  1. 参数中需要时间戳

结果展示

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

软件工程师文艺

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值