BOSS直聘爬虫项目

本章记录个人做BOSS直聘爬虫项目时的思路及步骤

背景是,在BOSS直聘网页中,搜索关键词:“爬虫”,城市为:“广州”

1. 搜素关键信息找到具体接口,并用其他页的接口进行比较,下面以第一页和第三页的url比较:

由上图可发现对同一个关键词的搜索中,接口内容只有page(页数)发生改变,因此只需要搞定第一页,通过对page值的修改就可以爬取所有关于“爬虫”的招聘数据。

2.  对于第一页的爬取,需要有以下五个参数:

  • scene:通过搜索其他关键词,能发现该值一直不变,因此可以直接设置为1
  • query:该值为关键词信息
  • city:该值为城市编码
  • page:页数
  • pageSize:每页显示的数据量

3. query的值可以通过用户输入来获得

query = input("请输入关键词搜索职位")

4. city的值需要通过搜索工具查找到对应的接口

具体代码实现如下:

# 提供city编码

import requests

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'
}

response = requests.get(url='https://www.zhipin.com/wapi/zpCommon/data/cityGroup.json', headers=headers)

#创建字典,存放城市及编码
city_dict = {}

city_list = response.json()["zpData"]['cityGroup']

for city in city_list:

    # 获取同一字母开头的具体城市及编码
    for i in city['cityList']:

        city = i['name']
        code = i['code']

        # 加入字典
        city_dict[city] = code

5. 先不考虑参数值的获取,先对第一页数据进行爬取

import requests

response = requests.get('https://www.zhipin.com/wapi/zpgeek/search/joblist.json', params=params, cookies=cookies, headers=headers)

json_data = response.json()

job_contents = json_data["zpData"]["jobList"]
for job_content in job_contents:
    jobName = job_content['jobName'] #职业名
    salaryDesc = job_content['salaryDesc'] #薪资
    skills = job_content['skills'] #技术
    jobDegree = job_content['jobDegree'] #学历
    welfareList = job_content['jobName'] #福利
    jobExperience = job_content['jobExperience'] #工作经历
    brandName = job_content['brandName'] #公司名
    cityName = job_content['cityName'] #城市
    bossName = job_content['bossName'] #老板名称
    cityName = job_content['cityName'] #城市

    encryptJobId = job_content['encryptJobId'] #伏笔1
    securityId = job_content['securityId'] #伏笔2

6. 招聘信息已经成功爬取下来了,但是json里并没有跳转链接,因此接下来的目标是获取到具体职位的跳转链接。

6.1 点击进入任意一个招聘详情页,可以发现url非常的长,如下图

6.2 对该url进行解析,由下图可得出该链接由:

BOSS直聘网址 + 字符串 + lid + securityId + sessionId 组成,只要找到后面四个就可以得到具体链接

6.3 对于/job_detail/5cfd4e2a08e395f81nF53tm7EVJU.html?的获取,可以通过搜索/job_detail找到具体的加密方法,如下代码所示:

href: "/job_detail/" + e.encryptJobId + ".html?securityId=" + e.securityId

具体查找方法如下图:

6.4 在控制台输出对应数据,可以发现url是由json中的encryptJobId和securityId组合,这两个值正是步骤5中伏笔1和伏笔2,如下图:

6.5 而最后一个参数sessionId值为空,到此url也能获得了。

7. 代码如下:

供用户输入的功能及获取城市编码:

# 用户提供搜索关键词
def user_input_query():
    query = input("请输入职位关键词")
    return query

# 用户提供搜索关键词及城市
def user_input_city():
    city = input("请输入目标城市")
    return city

# 接收城市返回城市编码
def get_code(city):
    city_code = city_dict.get(city)
    return city_code
    

while 1:
     query = user_input_query()
     city = user_input_city()
     code = get_code(city)
     if code == None:
        print("城市不存在,请重新输入")
     else:
        print(f"正在{city}搜索{query},请稍等")
        break

获取职位信息:

# 获取到json数据
def get_request(query,city_code,index):
    cookies = {
        'lastCity': '101281000',
        '__g': 'sem_bingpc',
        'Hm_lvt_194df3105ad7148dcf2b98a91b5e727a': '1733815947',
        'HMACCOUNT': 'C4603FF5FD89DD5B',
        'wt2': 'DQ2hpC-SxREB8wHy5UR_O4qaluW35qPJ9KVXv5QgM8eGApk5z0ypsu98w0SL25-oEvW5LpyAhS4PycAKyuPooYw~~',
        'wbg': '0',
        'zp_at': 'LyeHbf68ytEAiQMeRW6-LCn4NJwmyMxsRCsDEqZKm8Q~',
        'ab_guid': '2b362c14-998a-440d-bb02-05eef4dd9395',
        '__zp_seo_uuid__': '49f53280-4c27-4cbf-95ec-0408792918f9',
        '__l': 'r=https%3A%2F%2Fcn.bing.com%2F&l=%2Fwww.zhipin.com%2Fjob_detail%2Fd26a271e94045f4a1nZ73d-6EldR.html%3FsecurityId%3DuLxGTJgMPVWS1-e1rsVGxSKsL3FywIfBX8gEylrv1njeOrszuB8J7ffbmTYVN4BmMLnaoCQ~%26sessionId%3D&s=3&g=%2Fwww.zhipin.com%2Fsem%2F10.html%3F_ts%3D1733819062714%26sid%3Dsem_bingpc%26qudao%3Dbing_pc_H120003UY5%26plan%3DBOSS-%25E5%25BF%2585%25E5%25BA%2594-%25E5%2593%2581%25E7%2589%258C%26unit%3Dboss%25E7%259B%25B4%25E8%2581%2598-%25E7%25AC%25A6%25E5%258F%25B7%25E8%25AF%258D%26keyword%3Dboss%25E7%259B%25B4%25E8%2581%2598%25E7%25BD%2591%25E9%25A1%25B5%25E7%2589%2588%2528%26msclkid%3D7be2972fe9d11696e67af0781364e48b&friend_source=0&s=3&friend_source=0',
        'Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a': '1733830631',
        '__zp_stoken__': '63fafOTnDm8K0wrzCujwtCQgICA8%2FJDY5Mhg9OSU6O0A5OTw5Qjk5NBs0KcO4wrzEg3puX8OTQjskOTw8Ozk%2FPDs%2BFDk4xLXCtDg4J8KnwrvEgXhrXcOEW3cPfSXDpMK5D0XCuwrEgMODJynCgcK4QzQ8NlbCvjTCvgbCu0PCvgTCuzbCuTQ0NjYpOAxbDw44NEVHXgZGYUZUXFIET0pNJjY9PTvDo8K4w7YnOQQGBhMEDQ8PCg0LCQkJDg4MDAkODQ8PCg0zOcKcw4PEvMSgxLrCjsSYxabDjcKxxIXChsOTw4DEvcO%2Bw6fCrcOFw7TDpcKIw4zCssOiw73DucKXxIPCpcS5wo3Ds0nDhcKuw43CrMOdworDi8K4w7vChsKgwrbDrcK0wo7Dg8Khw4DDv8K6xIJcw7hMwpDCrsKwT8K9VsKIR8KnYsK4wrzCvMKpUU7CsEnDgMK7wprCtFTCuEVsXHbCrMK7woJ1fXbCvltycl1fYA0OP1hHDsOI',
        'bst': 'V2R98gEOP43VtiVtRuzBkQKSy27DrVxyg~|R98gEOP43VtiVtRuzBkQKSy27DrezCw~',
        '__c': '1733815946',
        '__a': '51952642.1728739453.1728742713.1733815946.50.3.45.45',
    }
    headers = {
        'accept': 'application/json, text/plain, */*',
        'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
        'priority': 'u=1, i',
        'referer': 'https://www.zhipin.com/web/geek/job?query=%E7%88%AC%E8%99%AB&city=101280100',
        'sec-ch-ua': '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'token': 'yPy6F3g6gvwbyWll',
        'traceid': 'F-c504e2QZW7X1Ubx6',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0',
        'x-requested-with': 'XMLHttpRequest',
        'zp_token': 'V2R98gEOP43VtiVtRuzBkQKSy27DrVxyg~|R98gEOP43VtiVtRuzBkQKSy27DrezCw~',
    }
    params = {
        'scene': '1',
        'query': query,
        'city': city_code,
        'experience': '',
        'payType': '',
        'partTime': '',
        'degree': '',
        'industry': '',
        'scale': '',
        'stage': '',
        'position': '',
        'jobType': '',
        'salary': '',
        'multiBusinessDistrict': '',
        'multiSubway': '',
        'page': index,
        'pageSize': '30',
    }
    response = requests.get('https://www.zhipin.com/wapi/zpgeek/search/joblist.json', params=params, cookies=cookies, headers=headers)
    response.raise_for_status()
    json_data = response.json()
    return json_data


# 获取招聘详细信息
def get_data(json_data,f):
    job_contents = json_data["zpData"]["jobList"]
    for job_content in job_contents:
        jobName = job_content['jobName'] #职业名
        salaryDesc = job_content['salaryDesc'] #薪资
        skills = job_content['skills'] #技术
        jobDegree = job_content['jobDegree'] #学历
        welfareList = job_content['jobName'] #福利
        jobExperience = job_content['jobExperience'] #工作经历
        brandName = job_content['brandName'] #公司名
        cityName = job_content['cityName'] #城市
        bossName = job_content['bossName'] #老板名称

        # 链接加密
        encryptJobId = job_content['encryptJobId'] #伏笔1
        securityId = job_content['securityId'] #伏笔2
        url = "https://www.zhipin.com/job_detail/" + encryptJobId + ".html?securityId=" + securityId

        f.write(f"{jobName}, {salaryDesc}, {skills}, {jobDegree}, {jobExperience}, {brandName}, {cityName}, {bossName}, {welfareList}, {url}")

完整代码如下:

import requests
from city import city_dict #导入城市-编码字典

# 获取到json数据
def get_request(query,city_code,index):
    cookies = {
        'lastCity': '101281000',
        '__g': 'sem_bingpc',
        'Hm_lvt_194df3105ad7148dcf2b98a91b5e727a': '1733815947',
        'HMACCOUNT': 'C4603FF5FD89DD5B',
        'wt2': 'DQ2hpC-SxREB8wHy5UR_O4qaluW35qPJ9KVXv5QgM8eGApk5z0ypsu98w0SL25-oEvW5LpyAhS4PycAKyuPooYw~~',
        'wbg': '0',
        'zp_at': 'LyeHbf68ytEAiQMeRW6-LCn4NJwmyMxsRCsDEqZKm8Q~',
        'ab_guid': '2b362c14-998a-440d-bb02-05eef4dd9395',
        '__zp_seo_uuid__': '49f53280-4c27-4cbf-95ec-0408792918f9',
        '__l': 'r=https%3A%2F%2Fcn.bing.com%2F&l=%2Fwww.zhipin.com%2Fjob_detail%2Fd26a271e94045f4a1nZ73d-6EldR.html%3FsecurityId%3DuLxGTJgMPVWS1-e1rsVGxSKsL3FywIfBX8gEylrv1njeOrszuB8J7ffbmTYVN4BmMLnaoCQ~%26sessionId%3D&s=3&g=%2Fwww.zhipin.com%2Fsem%2F10.html%3F_ts%3D1733819062714%26sid%3Dsem_bingpc%26qudao%3Dbing_pc_H120003UY5%26plan%3DBOSS-%25E5%25BF%2585%25E5%25BA%2594-%25E5%2593%2581%25E7%2589%258C%26unit%3Dboss%25E7%259B%25B4%25E8%2581%2598-%25E7%25AC%25A6%25E5%258F%25B7%25E8%25AF%258D%26keyword%3Dboss%25E7%259B%25B4%25E8%2581%2598%25E7%25BD%2591%25E9%25A1%25B5%25E7%2589%2588%2528%26msclkid%3D7be2972fe9d11696e67af0781364e48b&friend_source=0&s=3&friend_source=0',
        'Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a': '1733830631',
        '__zp_stoken__': '63fafOTzDmsK4w4LCujksBQYIDQ5DKjY8Mx9COTA7Nz45PD01PDk8NRc6KSXCvMOVwoJuWsOSNTUkPD1ANTk6PTdAFDw5xLnCujg9JsOZwrXDm3tqYcOKW8KCDsKBK8OkwrwOScK1CsO1w4IrJ8KBwr1COEI2Y8K%2FOMOABsK%2BQsOCCsK7Q8K4ODo2Qyg0ElsKDzQ6RVJfCkhhU1VgTARKS1EoNjg8N8OdwrjEgyY1CgYTEggTDwoLEQUJDAgSEAwJCBITDwoLES05wpnDgsOIxoDFgMKLxJnFqsOTwrHEoMKMxJbEgsOMwovEvsKWxLHDtMOXwrzDoMKPw5rCpsOjwrjDvFLDskbEuMKsw79KwonCp8OJwq3EgsKSw4pIw5%2FCtsOywqrClGLChMKfw63CtcSCWcO5UMKOwq7CpU7DgVjCiFLCpl7CtsK8wrnCqE1QwrBMw4HCt8KUwrRhwrlJclzCg8Ktwrd8dXh3w4JVcmdcY14NCz5UDXjDiQ%3D%3D',
        'bst': 'V2R98gEOP43VtiVtRuzBkQKSy27DrSwi8~|R98gEOP43VtiVtRuzBkQKSy27DrRxC8~',
        '__c': '1733815946',
        '__a': '51952642.1728739453.1728742713.1733815946.53.3.48.48',
    }

    headers = {
        'accept': 'application/json, text/plain, */*',
        'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
        # 'cookie': 'lastCity=101281000; __g=sem_bingpc; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1733815947; HMACCOUNT=C4603FF5FD89DD5B; wt2=DQ2hpC-SxREB8wHy5UR_O4qaluW35qPJ9KVXv5QgM8eGApk5z0ypsu98w0SL25-oEvW5LpyAhS4PycAKyuPooYw~~; wbg=0; zp_at=LyeHbf68ytEAiQMeRW6-LCn4NJwmyMxsRCsDEqZKm8Q~; ab_guid=2b362c14-998a-440d-bb02-05eef4dd9395; __zp_seo_uuid__=49f53280-4c27-4cbf-95ec-0408792918f9; __l=r=https%3A%2F%2Fcn.bing.com%2F&l=%2Fwww.zhipin.com%2Fjob_detail%2Fd26a271e94045f4a1nZ73d-6EldR.html%3FsecurityId%3DuLxGTJgMPVWS1-e1rsVGxSKsL3FywIfBX8gEylrv1njeOrszuB8J7ffbmTYVN4BmMLnaoCQ~%26sessionId%3D&s=3&g=%2Fwww.zhipin.com%2Fsem%2F10.html%3F_ts%3D1733819062714%26sid%3Dsem_bingpc%26qudao%3Dbing_pc_H120003UY5%26plan%3DBOSS-%25E5%25BF%2585%25E5%25BA%2594-%25E5%2593%2581%25E7%2589%258C%26unit%3Dboss%25E7%259B%25B4%25E8%2581%2598-%25E7%25AC%25A6%25E5%258F%25B7%25E8%25AF%258D%26keyword%3Dboss%25E7%259B%25B4%25E8%2581%2598%25E7%25BD%2591%25E9%25A1%25B5%25E7%2589%2588%2528%26msclkid%3D7be2972fe9d11696e67af0781364e48b&friend_source=0&s=3&friend_source=0; Hm_lpvt_194df3105ad7148dcf2b98a91b5e727a=1733830631; __zp_stoken__=63fafOTzDmsK4w4LCujksBQYIDQ5DKjY8Mx9COTA7Nz45PD01PDk8NRc6KSXCvMOVwoJuWsOSNTUkPD1ANTk6PTdAFDw5xLnCujg9JsOZwrXDm3tqYcOKW8KCDsKBK8OkwrwOScK1CsO1w4IrJ8KBwr1COEI2Y8K%2FOMOABsK%2BQsOCCsK7Q8K4ODo2Qyg0ElsKDzQ6RVJfCkhhU1VgTARKS1EoNjg8N8OdwrjEgyY1CgYTEggTDwoLEQUJDAgSEAwJCBITDwoLES05wpnDgsOIxoDFgMKLxJnFqsOTwrHEoMKMxJbEgsOMwovEvsKWxLHDtMOXwrzDoMKPw5rCpsOjwrjDvFLDskbEuMKsw79KwonCp8OJwq3EgsKSw4pIw5%2FCtsOywqrClGLChMKfw63CtcSCWcO5UMKOwq7CpU7DgVjCiFLCpl7CtsK8wrnCqE1QwrBMw4HCt8KUwrRhwrlJclzCg8Ktwrd8dXh3w4JVcmdcY14NCz5UDXjDiQ%3D%3D; bst=V2R98gEOP43VtiVtRuzBkQKSy27DrSwi8~|R98gEOP43VtiVtRuzBkQKSy27DrRxC8~; __c=1733815946; __a=51952642.1728739453.1728742713.1733815946.53.3.48.48',
        'priority': 'u=1, i',
        'referer': 'https://www.zhipin.com/web/geek/job?query=%E7%88%AC%E8%99%AB&city=101020100',
        'sec-ch-ua': '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'token': 'yPy6F3g6gvwbyWll',
        'traceid': 'F-dccebbJEbotxD75U',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0',
        'x-requested-with': 'XMLHttpRequest',
        'zp_token': 'V2R98gEOP43VtiVtRuzBkQKSy27DrSwi8~|R98gEOP43VtiVtRuzBkQKSy27DrRxC8~',
    }

    params = {
        'scene': '1',
        'query': query,
        'city': city_code,
        'experience': '',
        'payType': '',
        'partTime': '',
        'degree': '',
        'industry': '',
        'scale': '',
        'stage': '',
        'position': '',
        'jobType': '',
        'salary': '',
        'multiBusinessDistrict': '',
        'multiSubway': '',
        'page': index,
        'pageSize': '30',
    }

    response = requests.get('https://www.zhipin.com/wapi/zpgeek/search/joblist.json', params=params, cookies=cookies, headers=headers)
    response.raise_for_status()
    json_data = response.json()
    return json_data

# 用户提供搜索关键词
def user_input_query():
    query = input("请输入职位关键词")
    return query

# 用户提供搜索关键词及城市
def user_input_city():
    city = input("请输入目标城市")
    return city

# 接收城市返回城市编码
def get_code(city):
    try:
        city_code = city_dict.get(city)
        return city_code
    except:
        return 0

# 获取招聘详细信息
def get_data(json_data,f):
    job_contents = json_data["zpData"]["jobList"]
    for job_content in job_contents:
        jobName = job_content['jobName'] #职业名
        salaryDesc = job_content['salaryDesc'] #薪资
        skills = job_content['skills'] #技术
        jobDegree = job_content['jobDegree'] #学历
        welfareList = job_content['jobName'] #福利
        jobExperience = job_content['jobExperience'] #工作经历
        brandName = job_content['brandName'] #公司名
        cityName = job_content['cityName'] #城市
        bossName = job_content['bossName'] #老板名称

        # 链接加密
        encryptJobId = job_content['encryptJobId'] #伏笔1
        securityId = job_content['securityId'] #伏笔2
        url = "https://www.zhipin.com/job_detail/" + encryptJobId + ".html?securityId=" + securityId

        f.write(f"{jobName}, {salaryDesc}, {skills}, {jobDegree}, {jobExperience}, {brandName}, {cityName}, {bossName}, {welfareList}, {url}\n")


if __name__ == '__main__':

    while 1:
        query = user_input_query()
        city = user_input_city()
        code = get_code(city)
        if code == None:
            print("城市不存在,请重新输入")
        else:
            print(f"正在{city}搜索{query},请稍等")
            break

    f = open(f"{city}{query}职位.csv",mode='w',encoding='utf-8')

    # 不贪心,就爬四页
    for index in range(1,5):
        json_data = get_request(query,code,index)
        get_data(json_data,f)


成果展示:

注:大佬们应该一直带着疑问找到了这里,url不是由五部分组成吗,lid 去哪了,我不对lib进行添加是因为两个原因:

(1)有无lib,链接仍然能正常到达详情页

(2)为包含lib的方法进行断点,程序并未在此暂停,不过通过扣源代码也能找到lib的伪加密函数

结论:通过比较伏笔1和伏笔2,可以猜测lid的值也是json里的lid值。

以上为本人的思路及步骤,希望网上的各位大神有空能够点评小弟。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值