python爬虫之爬取拉勾网

最新推荐文章于 2024-08-14 14:25:23 发布

是白白

最新推荐文章于 2024-08-14 14:25:23 发布

阅读量7.2k

点赞数 41

文章标签： python 爬虫 selenium

本文链接：https://blog.csdn.net/m0_59874815/article/details/121480276

版权

本文介绍了使用Python的selenium库模拟浏览器爬取拉勾网招聘页面的过程。首先分析了拉勾网的反爬策略，然后通过获取主页源代码、解析详情页URL、解析详情页内容来实现爬虫。在获取多页数据时，通过判断下一页按钮的class属性来决定是否继续翻页，避免了因元素未加载完成导致的错误。最后提醒读者，该程序仅供技术交流学习，不可用于非法用途。

摘要由CSDN通过智能技术生成

这次要爬取拉勾网，拉勾网的反爬做的还是很不错的啊，因为目标网站是Ajax交互的我一开始是直接分析json接口来爬取的，但是真的很麻烦，请求头一旦出点问题就给识别出来了后续我就改了一下方法用selenium来模拟浏览器去获取

招聘求职信息-招聘网站-人才网-拉勾招聘 (lagou.com)https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=先把前面代码写好

思路嘛大概就是获取主页的源代码——从中获取详情页的url——在去解析先围绕这三步来写

这里我们已经获取到了主页的源代码

from selenium import webdriver
import requests
from selenium.webdriver import ChromeOptions   #这个包用来规避被检测的风险
from lxml import etree
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import re



class lagouSpitder(object):
    option = webdriver.ChromeOptions()
    option.add_experimental_option('useAutomationExtension', False)
    option.add_experimental_option('excludeSwitches', ['enable-automation'])
    driver_path = r'驱动路径'  # 定义好路径
    def __init__(self):
        self.driver=webdriver.Chrome(executable_path=lagouSpitder.driver_path,options=lagouSpitder.option)#初始化路径+规避检测selenium框架
        self. driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
        Object.defineProperty(navigator, 'webdriver', {
          get: () => undefined
        })
      """
    })
        self.url='https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput='
        self.positions=[]
    def run(self):  #主页面
            self.driver.get(self.url)
            source = self.driver.page_source  # source页面来源  先获取一页

        

if __name__ == '__main__':
    spider=lagouSpitder()   
    spider.run()

接下来获取详情页的url，定义一个函数parse_list_page 显得美观可维护也强ÿ