Boss直聘数据采集及分析

最新推荐文章于 2024-07-10 17:24:58 发布

凉城的夜

最新推荐文章于 2024-07-10 17:24:58 发布

阅读量7.9k

点赞数 2

分类专栏： Python 爬虫文章标签： python selenium 数据分析

本文链接：https://blog.csdn.net/sinat_32651363/article/details/106426304

版权

Python 同时被 2 个专栏收录

16 篇文章 4 订阅

订阅专栏

爬虫

4 篇文章 2 订阅

订阅专栏

Boss直聘数据采集及分析

我主要采集了Boss web端西安5月Python招聘情况，后面会在代码注释中进行解释

采集中碰到的问题参考，也许你也会遇到

采集

问题点

为了绕过boss直聘网站对selenium的检测需要做以下初始化工作：

首先开启：chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile"；这句话在你的谷歌浏览器可执行文件夹运行，会在你的C:\selenum\ 生成一大堆文件
其次代码中需要添加：chrome_options.add_experimental_option('debuggerAddress','127.0.0.1:9222')；开启谷歌浏览器代理，为了绕过boss对selenium的检测
chrome_driver = r"D:\chrome-selenium\chromedriver_win32\chromedriver.exe" # selenium驱动
driver = webdriver.Chrome(chrome_driver, chrome_options=chrome_options) # 将代理添加进来

注意谷歌浏览器版本要与你的chromedriver版本一致，否则无法启动。chromeDrive下载网址

#!/usr/bin/python3  
# encoding: utf-8  
""" 
@version: v1.0 
@author: W_H_J 
@license: Apache Licence  
@contact: 415900617@qq.com 
@software: PyCharm 
@file: Boss.py
@time: 2020/5/8 15:21
@describe: boss直聘数据抓取
chromdriver :https://npm.taobao.org/mirrors/chromedriver
selenium: D:\chrome-selenium\chromedriver_win32\chromedriver.exe
参考：https://blog.csdn.net/qq_35531549/article/details/89023525
首先开启：chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile"
上面这句话在你的谷歌浏览器可执行文件夹运行，会在你的C:\selenum\ 生成一大堆文件
其次添加：chrome_options.add_experimental_option('debuggerAddress','127.0.0.1:9222')
上面这句话，相当于开启你的谷歌浏览器代理，为了绕过boss对selenium的检测
"""
import hashlib
import sys
import os
import time

import requests
from pyquery import PyQuery as pq
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from fake_useragent import UserAgent
from config.MysqlContent import DBHelper

sys.path.append(os.path.abspath(os.path.dirname(__file__) + '/' + '..'))
sys.path.append("..")

UA = UserAgent()  # 获取fake_useragent中的浏览器请求头
DB = DBHelper()  # 我自己写的MySQL助手，如果不存入MySQL，可以忽略


def driver_chrome():
    """
        加载selenium浏览器
    :return:
    """
    try:

        agent = UA.random
        chrome_options = Options()
        chrome_options.add_experimental_option('debuggerAddress', '127.0.0.1:9222')  # 开启代理模式
        chrome_driver = r"D:\chrome-selenium\chromedriver_win32\chromedriver.exe"  # 加载自己本地驱动
        driver = webdriver.Chrome(chrome_driver, chrome_options=chrome_options)  # 开启selenium
        return driver, agent
    except Exception as e:
        print(e, "driver_chrome")


def get_cookies(url):
    """
    获取网站session
    :param url:
    :return:
    """
    session = {}
    try:
        driver, agent = driver_chrome()
        driver.get(url)
        cookies = driver.get_cookies()
        for i in cookies:
            session[i.get('name')] = i.get('value')
        str_session = ""
        for k, v in session.items():
            str_session += k + "=" + v + ";"
        return str_session, agent
    except Exception as e:
        print(e, "get_cookies")


def first_page(index):
    """
    抓取页面
    :param index: 翻页
    :return:
    """
    key_word = "python"  # 要检索的关键字，因为我采集python信息，所以写python
    url = "https://www.zhipin.com/c101110100/?query={}&page={}&ka=page-next".format(key_word, index)
    print("==============>", url)
    cookie, agent = get_cookies(url)
    # print("cookie:", cookie, agent)
    # cookie中有一部分为关键的加密信息，用selenium就是要获取此部分信息，否则无法抓取，不用selenium就需要去破解那部分加密js
    headers = {
        "Host": "www.zhipin.com",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9",
        "cache-control": "no-cache",
        "cookie": '__c=1588920748; __g=-; lastCity=100010000; __l=l=%2Fwww.zhipin.com%2Fjob_detail%2F9bef0dd287a0be061XF73Nm0FFE~.html&r=&friend_source=0&friend_source=0; __zp_seo_uuid__=f771bb8f-4df1-4f67-9a81-4e916fe7a01e; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1588920748,1588921100,1588922589; toUrl=https%3A%2F%2Fwww.zhipin.com%2F%2Fjob_detail%2Fa1d0d9b2bb9e31611HJ52Ni0Fls%7E.html; JSESSIONID=""; __a=51155662.1588920748..1588920748.27.1.27.27; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1588925506; {}'.format(cookie),
        "pragma": "no-cache",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": agent
    }
    # print(headers['cookie'])
    html = requests.get(url, headers=headers)
    print(html)
    doc = pq(html.text)
    doc2 = doc("#main > div.job-box > div.job-list > ul > li")

    print("-" * 100)
    sql_insert = "INSERT INTO boss_index (title_id,href,title,price,red,know,company,company_desc,tag,info_desc,job_area) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
    list_params = []
    for x in doc2.items():
        href = "https://www.zhipin.com" + x("div > div.info-primary > div.primary-wrapper > div > div.job-title > span.job-name > a").attr("href")
        title = x("div > div.info-primary > div.primary-wrapper > div > div.job-title > span.job-name > a").text()
        price = x("div > div.info-primary > div.primary-wrapper > div > div.job-limit.clearfix > span").text()
        if '实习' in title:
            try:
                x1, x2, know = x("div > div.info-primary > div.primary-wrapper > div > div.job-limit.clearfix > p").html().split('<em class="vline"/>')
                experience = x1 + ',' + x2
            except:
                experience, know = x("div > div.info-primary > div.primary-wrapper > div > div.job-limit.clearfix > p").html().split('<em class="vline"/>')
        else:
            experience, know = x("div > div.info-primary > div.primary-wrapper > div > div.job-limit.clearfix > p").html().split('<em class="vline"/>')
        company = x("div > div.info-primary > div.info-company > div > h3 > a").text()
        company_desc_temp = x("div > div.info-primary > div.info-company > div > p").html().split('<em class="vline"/>')
        company_desc = [pq(y).text() for y in company_desc_temp]
        tag_temp = x("div > div.info-append.clearfix > div.tags").items()
        tag = [y.text().split(" ") for y in tag_temp][0]
        info_desc = x("div > div.info-append.clearfix > div.info-desc").text()
        job_area = x("div > div.info-primary > div.primary-wrapper > div > div.job-title > span.job-area-wrapper > span").text()
        print("href:", href)
        print("title:", title)
        print("price:", price)
        print("red:", experience)
        print("know:", know)
        print("company:", company)
        print("company_desc：", company_desc)
        print("tag:", tag)
        print("info_desc：", info_desc)
        print("job_area:", job_area)
        print()
        list_params.append(
            [get_md5(href), href, title, price, experience, know, company, str(company_desc), str(tag), info_desc,
             job_area])
    print("全部数据：", list_params)
    # DB.insert_many(sql_insert, list_params)  # 数据入库


def get_md5(url):
    """
    由于hash不处理unicode编码的字符串（python3默认字符串是unicode）
        所以这里判断是否字符串，如果是则进行转码
        初始化md5、将url进行加密、然后返回加密字串
    """
    if isinstance(url, str):
        url = url.encode("utf-8")
    md = hashlib.md5()
    md.update(url)
    return md.hexdigest()


def run():
    for page in range(1, 11):
        first_page(page)
        time.sleep(10)


if __name__ == '__main__':
    run()

采集数据分析

以5月西安Python招聘情况为例，整个样本391条招聘数据

1. 西安整体python全职招聘情况：纯python开发只占68%，剩下就是讲师、实习及其他开发捎带着python开发了，看来西安整个python占比并不是很高。

2. 西安python招聘工作地点要求：其中西安那根柱子，主要是雁塔+高新，因此可以看出，整个招聘地点集中才西安的雁塔区和高新区比较多一点，想找工作的朋友就要在此区域考虑交通出行了。

工作地点

3. 整体的薪资范围及浮动区间：按照招聘岗位发布的数据看，10~15k的还占比较高，不过实际能给到这个数的其实可能还得除以2了，其实占比较为均匀的还是12K以下的，可能受限于西安这个整体大环境吧。

薪资范围

4. 学历要求占比：硕士、博士的就不用说了，因为西安做大数据的公司较少，所以硕士的比例稍微低一点，一般做大数据方向的或者自然科学的会要求硕士及以上，整个样本中可以看出还是得本科起步了，其中因为西安高校本来就很多，招聘公司又会要求985、211等学校的，因此大专学历的会越来越不好在该行业从事工作了。

学历要求

5. 工作经验要求：结合上面学历要求，工作经验不限的基本集中在大专范围或者实习生范围，其他最起码1~3年了，3~5年数据较多是因为西安的华为外包发不了大量招聘岗位，导致占比较高。

工作经验

6. 公司规模：要说西安没有大公司还真是，1000人以上的，基本都是外包公司，他们发布了大量的招聘岗位导致数据分布在这些区间，其实真实的数据分布在0~20人、20~99人这些小规模公司，如果此前在大公司工作，可能直接转小公司会不太适应，因为各种制度的混乱，各种流程的欠缺，公司规模不同，整个工作氛围和环境会有很大的差别，尤其是个人职业发展方向及发展空间。

公司规模

7. 公司福利排行：公司越大，福利越健全，TOP10 外包公司占比很大，TOP10 以外的公司简介中的福利看看就好，基本上都是招聘的时候复制粘贴过来的，西安五险一金不健全的公司很多，面试一定要确认清楚。

公司福利

8. 招聘公司词云分析：看到没，西安招聘python的主流公司都是外包公司，可以说这边公司招聘分为三个梯队：第一梯队（外包公司，且占大比例）；第二梯队（培训教育机构，是给小孩及其他培训的，不是北京那种让你交钱入职的那种哦）；第三梯队（其他各类小公司或者某大公司在西安的分部）

招聘公司

9. 说到了培训公司，看看培训公司福利：如果考虑转岗倒是可以瞅瞅培训类公司，至于五险一金，仅供参考，不过培训类讲师基本都没有晚上的时间和周末，其实和程序员加班来看，他们不用上早班，上夜班倒也是不错呢。

培训公司

10. 薪资比例、范围：比起北上广，基本都是12薪，小公司不给你想法设法克扣工资都不错了。

薪资范围

11. 具体岗位分析词云：因为西安整个大环境问题，纯Python开发比例其实较低，并且其他各项技能分布比较平均，没有特别突出的岗位要求，说白了一句话，开发要的基本技能基本都要会，LINUX和爬虫深受公司HR所喜爱。

岗位分析

12. Python爬虫岗位：因为近一两年爬虫招聘岗位较多，专门看了看，整个爬虫占总体招聘比例只有2.3%，其中专职爬虫又占整个纯python开发67.77%中的3.4%，因此，整体爬虫招聘是相当残酷，可以说Boss发布的爬虫岗位很少，或者说西安整体对爬虫岗位的需求并不是很大，看了看这些公司具体岗位要求，发现数据公司占比很低，几乎都是将爬虫列为可有可无的岗位，或者仅有短期招聘需求，大白话（干一段时间，公司要的数据采集完了，可能就不再需要这个岗位了，因此你的考虑自身入职公司的整体业务是干啥的了，防止还在试用，结果采集完公司所需数据，然后被炒鱿鱼了）

爬虫岗位

整体分析的维度差不多就上面几条了，最终结论，西安Python招聘岗位很少，其中爬虫岗位又少之又少且不稳定，而且整体大环境导致西安软件行业并不健全，社保福利不健全，纯技术方向发展方向又较为狭小，要么需要身兼数职（啥都会），要么需要高手中的高手。西安本土互联网行业没有大公司，剩下的都是小公司类型，他们只会站在业务角度，因此这里的开发可以说是为了养家糊口而开发，为了开发而开发，并没有特别出彩的技术方向，不像北上广可以深入技术路线，不断产出新的技术框架，新的技术产品，这里，只有业务，没有创新。

凉城的夜

关注

2
点赞
踩
31

收藏

觉得还不错? 一键收藏
6
评论
Boss直聘数据采集及分析

Boss直聘数据采集及分析我主要采集了Boss web端西安5月Python招聘情况，后面会在代码注释中进行解释采集中碰到的问题参考，也许你也会遇到采集问题点为了绕过boss直聘网站对selenium的检测需要做以下初始化工作：首先开启：chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\AutomationProfile"；这句话在你的谷歌浏览器可执行文件夹运行，会在你的C:\selenum...
复制链接

扫一扫

专栏目录