python拉勾网招聘信息爬取（单线程，最新）_用代码提取招聘网站某类型公司的名称-CSDN博客

本文链接：https://blog.csdn.net/weixin_44549540/article/details/106462944

一、任务描述

爬取拉勾网发布的关于“会计”岗位的招聘信息，通过查询相关文章发现，普遍都是使用单线程对网站信息进行爬取，且拉勾网经过多次维护更新，对简单的爬取代码有反爬虫机制，例如不设置休眠时间，则无法获取内容，甚至爬取每一条招聘信息之间都需要时间间隔：
在这里插入图片描述
如上图所示，如果不设置时间间隔，爬取到第10条信息后则会无法获取信息。本文先简单用单线程实现拉勾网的信息爬取。

二、网页分析

首先需要你用自己的手机号进行登录，且本次任务是爬取“会计”相关工作的招聘信息，因此先在搜索栏里搜索“会计”，当然你想爬取别的专业职位只需要将代码中的“会计”改为你想要的就行。简单爬取网页中的信息，第一步是要对网页进行分析，右击鼠标，点击“查看”：
在这里插入图片描述
如上图，可以看到抓到的包，其中关于职位信息的包为“positionAjax.json?px=new&needAddtionalResult=false”（如果找不到可以在Filter中搜索position）点击该包，我们可以发现它与以往的网站不同，采取的是post方式：
在这里插入图片描述
因此代码需要使用request.post()函数来获取源码。接下来继续看源码内容，这里我们发现，以往直接点击包，可以跳转到源码网站，但是这里我们点击该包出现以下情况：

发现我们无法获取源码，这应该也是网站开发工作人员的一种反爬虫办法吧。我们点开“preview"，如下图：
在这里插入图片描述
可以发现，每一页有15条招聘信息，且每条招聘信息的内容如下：

我们想要的信息都在上图中，且我们可以很容易发现，源码类似字典形式，那么就好办了，我们可以将源码转为json格式，然后通过字典的关键字"key"来抓取想要的信息。

三、代码编写

3.1 请求头编写

请求头的内容不需要包括太多，如下代码：

header = {'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                            '(KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36',
              'referer': 'https://www.lagou.com/jobs/list_%E4%BC%9A%E8%AE%A1/p-city_0?&cl=false&fromSearch=true'
                         '&labelWords '
                         '=&suginput= '
              }

当然，这里还少了”cookies"，由于其内容太长，我采用了获取的方法：

s = requests.Session()
    # 获取搜索页的cookies
    s.get(urls, headers=header, timeout=3)
    # 获取cookies
    cookie = s.cookies

其中urls为当前网页的网址（注意不是包对应的网址），如下图：
在这里插入图片描述

3.2 实现翻页

通过分析网页发现，“会计”专业的招聘信息一共只有30页，实现翻页的功能代码如下：

    for number in range(30):
        print("正在爬取第" + str(number + 1) + "页")
        payload = {
            'first': 'true',
            'pn': number,  # 当前页
            'kd': '会计'  # 招聘信息类型
        }

3.3 爬取每一页信息

前面已经实现了翻页功能，现在需要考虑的就是如何获取每一页中我们所需要的信息。首先我们先将获取到的网页源码转为JSON格式：response = s.post(url, data=payload, headers=header, cookies=cookie, timeout=5).text data = json.loads(response) # 转换为JSON格式前面也说到，网页中每一页的信息有15条，同样可以通过循环实现，但是这里要注意的是，爬取每一条信息之间需要间隔一段时间，不然会被检测到，无法获取信息，因此在这里我设置了休眠10：time.sleep(10)，这里要注意的是，由于10的休眠时间有点长，因此我尝试改为了5，爬取到第10条招聘信息后就会报错，无法获取信息，因此这里最好将休眠时间设置得长一点。以下代码实现：

        for i in range(15):
            try:
                response = s.post(url, data=payload, headers=header, cookies=cookie, timeout=5).text
                data = json.loads(response)  # 转换为JSON格式
                createTime = data['content']['positionResult']['result'][i]['createTime']
                companyFullName = data['content']['positionResult']['result'][i]['companyFullName']
                companySize = data['content']['positionResult']['result'][i]['companySize']
                industryField = data['content']['positionResult']['result'][i]['industryField']
                positionName = data['content']['positionResult']['result'][i]['positionName']
                positionsalary = data['content']['positionResult']['result'][i]['salary']
                jobcity = data['content']['positionResult']['result'][i]['city']
                neededucation = data['content']['positionResult']['result'][i]['education']
                needworkYear = data['content']['positionResult']['result'][i]['workYear']
                create_Time.append(createTime)
                company_FullName.append(companyFullName)
                company_Size.append(companySize)
                industry_Field.append(industryField)
                position_Name.append(positionName)
                position_salary.append(positionsalary)
                job_city.append(jobcity)
                need_education.append(neededucation)
                need_workYear.append(needworkYear)
            except IndexError:
                print('有问题')
                break
            print("————————第" + str(number * 15 + i + 1) + "条招聘信息爬取完毕——————————")
            time.sleep(10)  # 注意这个地方，最好休眠时间比这个长，不然无法获取信息
        print("第" + str(number + 1) + "页爬取完毕")

由于之前爬取电商评论的时候发现，有的页面的评论数并不是固定的，有的一页是10条评论，有的一页是9条或是其他，因此这个地方用了一个try，可以防止报错使得程序崩溃。上述中爬取的职位属性解释如下：

create_Time = []  # 创建时间
company_FullName = []  # 公司名
company_Size = []  # 公司规模
industry_Field = []  # 企业类型
position_Name = []  # 职位名称
position_salary = []  # 职位薪酬
job_city = []  # 工作地点
need_education = []  # 学历要求
need_workYear = []  # 工作经验

3.4 制作表格

获取到相应信息后，我们需要将其保存到excel表格中，代码页十分简单：

table = pd.DataFrame(
        {'create_Time': create_Time, 'company_FullName': company_FullName, 'company_Size': company_Size,
         'industry_Field': industry_Field, 'position_Name': position_Name, 'position_salary': position_salary,
         'job_city': job_city, 'need_education': need_education, 'need_workYear': need_workYear})
    table.to_excel(file) #file为文件名

四、完整代码

import requests
import json
import pandas as pd
import time


def position_data(url, urls, file):
    header = {'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                            '(KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36',
              'referer': 'https://www.lagou.com/jobs/list_%E4%BC%9A%E8%AE%A1/p-city_0?&cl=false&fromSearch=true'
                         '&labelWords '
                         '=&suginput= '
              }
    s = requests.Session()
    # 获取搜索页的cookies
    s.get(urls, headers=header, timeout=3)
    # 获取cookies
    cookie = s.cookies
    # 获取文本
    start = time.time()
    for number in range(30):
        print("正在爬取第" + str(number + 1) + "页")
        payload = {
            'first': 'true',
            'pn': number,  # 当前页
            'kd': '会计'  # 招聘信息类型
        }
        for i in range(15):
            try:
                response = s.post(url, data=payload, headers=header, cookies=cookie, timeout=5).text
                data = json.loads(response)  # 转换为JSON格式
                createTime = data['content']['positionResult']['result'][i]['createTime']
                companyFullName = data['content']['positionResult']['result'][i]['companyFullName']
                companySize = data['content']['positionResult']['result'][i]['companySize']
                industryField = data['content']['positionResult']['result'][i]['industryField']
                positionName = data['content']['positionResult']['result'][i]['positionName']
                positionsalary = data['content']['positionResult']['result'][i]['salary']
                jobcity = data['content']['positionResult']['result'][i]['city']
                neededucation = data['content']['positionResult']['result'][i]['education']
                needworkYear = data['content']['positionResult']['result'][i]['workYear']
                create_Time.append(createTime)
                company_FullName.append(companyFullName)
                company_Size.append(companySize)
                industry_Field.append(industryField)
                position_Name.append(positionName)
                position_salary.append(positionsalary)
                job_city.append(jobcity)
                need_education.append(neededucation)
                need_workYear.append(needworkYear)
            except IndexError:
                print('有问题')
                break
            print("————————第" + str(number * 15 + i + 1) + "条招聘信息爬取完毕——————————")
            time.sleep(10)  # 注意这个地方，最好休眠时间比这个长，不然无法获取信息
        print("第" + str(number + 1) + "页爬取完毕")
    end = time.time()
    print("共用时：" + str(end - start))
    table = pd.DataFrame(
        {'create_Time': create_Time, 'company_FullName': company_FullName, 'company_Size': company_Size,
         'industry_Field': industry_Field, 'position_Name': position_Name, 'position_salary': position_salary,
         'job_city': job_city, 'need_education': need_education, 'need_workYear': need_workYear})
    table.to_excel(file)
    print(table)


urls = 'https://www.lagou.com/jobs/list_%E4%BC%9A%E8%AE%A1?px=new&city=%E5%85%A8%E5%9B%BD#order'
url = 'https://www.lagou.com/jobs/positionAjax.json?px=new&needAddtionalResult=false' #url为包的网址
file = r'C:\Users\QSF\Desktop\拉勾网招聘信息.xlsx'
create_Time = []  # 创建时间
company_FullName = []  # 公司名
company_Size = []  # 公司规模
industry_Field = []  # 企业类型
position_Name = []  # 职位名称
position_salary = []  # 职位薪酬
job_city = []  # 工作地点
need_education = []  # 学历要求
need_workYear = []  # 工作经验
position_data(url, urls, file) #调用函数