爬取腾讯招聘岗位薪资，分析互联网岗位需求

最新推荐文章于 2021-09-05 09:10:41 发布

-->雨中漫步

最新推荐文章于 2021-09-05 09:10:41 发布

阅读量925

点赞数 1

分类专栏：爬虫文章标签：网络蜘蛛人

本文链接：https://blog.csdn.net/A_740449043/article/details/91354760

版权

爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

此次数据来源于腾讯招聘，仅作参考使用，如有侵权，请联系我740449043@qq.com，会立即删除

此次使用scrapy框架完成，首先，通过浏览器进入腾讯招聘的页面，因为需求不一样，所以抓包抓到的url也不一样，在这里，我是将国家定位于中国，城市定位为深圳，爬取腾讯2019年的招聘信息，数据清洗则主要使用了正则表达式，将清洗后的数据写入MySQL保存。

首先在items.py定义一下所需要爬取的数据名,简单来说就是定义变量


# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

#定义目标数据的字段
class TCItem(scrapy.Item):
    # define the fields for your item here like:
    Job = scrapy.Field()                   				 #职位
    CountryName = scrapy.Field()            			 #所在国家
    LocationName = scrapy.Field()                        #所在城市
    BGName = scrapy.Field()                              #招聘部门
    CategoryName = scrapy.Field()                        #所属部门
    Responsibility = scrapy.Field()                      #要求
    LastUpdateTime = scrapy.Field()                      #最后更新时间

接下来我们需要编写爬虫部分，一个项目可以有多个爬虫，在这里面，先拿到每页的每一个岗位的URLl，通过正则匹配到需要抓取的信息，例如工作地点，任职要求，每拿到一条数据就返回一次，以提高执行效率，提取页码，然后通过拼接构造每一页的URL，翻到末页就结束递归。

# -*- coding: utf-8 -*-
import scrapy
import re

#导入自定义的包
from MyScrapy.items import MyscrapyItem


class TengxunzpspiderSpider(scrapy.Spider):
    name = 'TengXunSpider'  # 识别名称
    allowed_domains = ['careers.tencent.com/']  # 爬取范围
    url_1 = r'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1559004660890&countryId=&cityId=1&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex='
    page = 1
    url_2 = '&pageSize=10&language=zh-cn&area=cn'
    start_urls = [url_1 + str(page) + url_2]  # 起始url


    def parse(self, response):
        data = response.body.decode('utf-8')  									  # 获取响应内容

        Job = re.findall(r'"RecruitPostName":"(.*?)",', data)					  # 职位
        CountryName = re.findall(r'"CountryName":"(.*?)",', data)  				  # 所在国家
        LocationName = re.findall(r'"LocationName":"(.*?)",', data)  			  # 所在城市
        BGName = re.findall(r'"BGName":"(.*?)",', data)  						  # 招聘部门
        CategoryName = re.findall(r'"CategoryName":"(.*?)",', data) 			  # 所属部门
        LastUpdateTime = re.findall(r'"LastUpdateTime":"(.*?)",', data)  		  # 最后更新时间
        Responsibility = re.findall(r'"Responsibility":"(.*?)",', data)  		  # 要求

        header = {
            'Accept': 'text / html, application / xhtml + xml, application / xml;q = 0.9, * / *; q = 0.8'
                      'Accept-Encoding:gzip, deflate, br'
                      'Accept-Language: en-US, en; q=0.8, zh-Hans-CN; q=0.5, zh-Hans; q=0.3'
                      'Host: movie.douban.com'
                      'Referer: https://movie.douban.com/'
                      'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763'
        }
        for i in range(0, len(CountryName)):
            item = MyscrapyItem()
            item['Job'] = Job[i]
            item['CountryName'] = CountryName[i]
            item['LocationName'] = LocationName[i]
            item['BGName'] = BGName[i]
            item['LastUpdateTime'] = LastUpdateTime[i]
            item['Responsibility'] = Responsibility[i]
            item['CategoryName'] = CategoryName[i]

            yield item

        # 实现自动翻页
        # 1.获取当前url，提取页码信息
        beforurl = response.url
        pat_1 = r'pageIndex=(\d*)&'
        page = re.search(pat_1, beforurl).group(1)


        # 2.构造下一页的url，发送下一次请求
        self.page += 1
        if self.page < 266:
            # 得到下一次请求的pageIndex信息

            # 构造下一页的url
            #self.url = self.url_1 + str(self.page) + self.url_2

            # 发送下一次请求
            yield scrapy.Request(self.url_1 + str(self.page) + self.url_2, callback=self.parse,dont_filter=True,headers=header)

数据的抓取，清洗部分已经写好了，接下来，我们要做的就是保存到数据库，这需要在管道文件pipelines.py中编写代码，以保存到数据库中


# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql


#管道文件，负责item的后期处理或保存
class MyfirstscrapyPipeline(object):

    def __init__(self):#定义一些需要初始化的参数

        #连接数据库
        self.connect = pymysql.Connect(host='localhost',port=3306,user='*****',password='******',db='scrapyspider',charset='utf8')
        self.cursor = self.connect.cursor()
        #self.file = open('tengxun.csv','a+')

    #管道每次接收到item后执行的方法(必须实现)
    #return item 必须要有
    def process_item(self, item, spider):
        #往数据库写入数据
        '''
        self.cursor.execute('insert into tengxunzp_6_5(Job,CountryName,LocationName,BGName,CategoryName,LastUpdateTime) value (%s,%s,%s,%s,%s,%s)',
                            (
                                item['Job'],
                                item['CountryName'],
                                item['LocationName'],
                                item['BGName'],
                                item['CategoryName'],
                                item['LastUpdateTime']
                                #item['Responsibility']
                            )
                           )
        '''
        self.connect.commit()

    #当爬取执行完毕时执行的方法
    def close_spider(self,spider):
        #关闭数据库
        self.cursor.close()
        self.connect.close()
        #self.file.close()

下面是写入到数据库的文件，仅2019年腾讯深圳招聘岗位数据如下
在这里插入图片描述
到这里，items.py里明确的定义要抓取的目标,爬虫部分执行并清洗数据，拿到了所需数据并返回给管道文件，管道文件写入到数据库并保存，在这里，由于前期定位较小，所以并没有使用代理IP，要没有使用自定义的IP代理池，有需求才有需要的功能，代码并不一定要很复杂，能很好的实现功能，才是代码执行的本质，废话就不多说，有时间得分析数据去了。