拉勾网爬取失败？试试这一招

最新推荐文章于 2024-04-04 01:44:03 发布

冰山_

最新推荐文章于 2024-04-04 01:44:03 发布

阅读量957

点赞数 2

文章标签：拉勾网拉勾 Python 爬虫

本文链接：https://blog.csdn.net/weixin_46369953/article/details/107292060

版权

如果你爬过拉勾网就知道拉勾网有点难爬。

不愧是一家专为互联网从业者提供工作机会的招聘网站……

所以拉勾网使用的是什么反爬机制呢？一个是cookie限制，另一个是IP访问频率限制。

我在这次的爬取中遇到的反爬不是cookie限制，而是IP访问频率被限制了。

解决反爬虫

我选择了拉勾网自带岗位栏中的“数据运营”岗。

在第一次的尝试爬取中我遇到了这样的问题……

查看一下返回的响应页面发现……

即使加了请求头，每次只要爬到第6条数据时就会跳出验证页面，这是因为拉勾网检测到了同一IP的访问频率过快，于是触发了验证机制，需要输入验证信息才能获取我们想要的页面信息。

对于IP访问频率限制，使用IP代理是最理想的应对方法，不过也可以使用time模块来降低访问频率，缺点是速度很慢，如果需要爬取的数据不是很多的话可以采取这种方法。

以下我采用了延长访问频率的方式来尝试获取“数据运营”岗的全部招聘信息，结果没有报错！

具体的方法是：使用random模块中randint()函数随机获取秒数，再用time模块中sleep()函数将程序暂停一下，将其设置在请求网站后即可。

import random,time 
time.sleep(random.randint(10,15))

随机暂停的秒数设置在10~15s最好，因为我尝试过设置在5~10s结果还是被检测出来。

爬取拉勾网

爬取每个岗位以下招聘数据：

完整代码如下：

import requests,random,time,re
from bs4 import BeautifulSoup
import pandas as pd

# 定义空列表，用于存储信息
job_all={}
company_content=[]
industry_content=[]
job_content=[]
experience_content=[]
education_content=[]
salary_content=[]
detail_content=[]
url_content=[]

# 爬取1到8页的招聘信息
for i in range(1,9):
    url='https://www.lagou.com/guangzhou-zhaopin/shujuyunying/{}/'
    url=url.format(i)
    headers={'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
    res=requests.get(url,headers=headers,timeout=3)
    res.encoding=res.apparent_encoding
    soup=BeautifulSoup(res.text,'html.parser')
    urls=soup.find_all('div',class_='p_top') # 提取当页招聘岗位的详情链接
    # 遍历每一个详情链接，提取招聘岗位的公司、行业、岗位、经验要求、学历要求、工资、职责和工作要求
    for url in urls:
        url=url.find('a')['href']
        rep=requests.get(url,headers=headers,timeout=3)
        time.sleep(random.randint(10,15)) # 延迟程序运行，应对反爬虫
        rep.encoding=rep.apparent_encoding
        soup=BeautifulSoup(rep.text,'html.parser')
        company=soup.find('em',class_='fl-cn').text.strip() # 公司
        industry=soup.find('h4',class_='c_feature_name').text.strip() # 行业
        job=soup.find('h1',class_='name').text.strip() # 岗位
        detail=soup.find('div',class_='job-detail').text.strip() # 职责和工作要求
        request=soup.find('dd',class_='job_request') # 经验要求、学历要求和工资
        # 使用正则表达式进行提取
        request_match=re.match('^<dd .*?<span class=.*?>(.*?) </span>.*?span>/(.*?) /</span.*?span>(.*?) /</span.*?span>(.*?) /</span.*?span>(.*?)</span.*?h3>',str(request),re.S)
        experience=request_match.group(3) # 经验要求
        education=request_match.group(4) # 学历要求
        salary=request_match.group(1) # 工资
        # 添加岗位信息到列表
        company_content.append(company)
        industry_content.append(industry)
        job_content.append(job)
        experience_content.append(experience)
        education_content.append(education)
        salary_content.append(salary)
        detail_content.append(detail)
        url_content.append(url)

job_all['公司']=company_content
job_all['行业']=industry_content
job_all['岗位']=job_content
job_all['经验']=experience_content
job_all['学历']=education_content
job_all['工资']=salary_content
job_all['职责和要求']=detail_content
job_all['详情链接']=url_content

df=pd.DataFrame(job_all,columns=['公司','行业','岗位','经验','学历','工资','职责和要求','详情链接'])
df.to_excel('拉勾网数据运营岗.xlsx')

公众号：「Python编程小记」，持续推送学习分享，欢迎关注！

冰山_

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
拉勾网爬取失败？试试这一招

如果你爬过拉勾网就知道拉勾网有点难爬。不愧是一家专为互联网从业者提供工作机会的招聘网站……所以拉勾网使用的是什么反爬机制呢？一个是cookie限制，另一个是IP访问频率限制。我在这次的爬取中遇到的反爬不是cookie限制，而是IP访问频率被限制了。解决反爬虫我选择了拉勾网自带岗位栏中的“数据运营”岗。在第一次的尝试爬取中我遇到了这样的问题……查看...
复制链接

扫一扫