2020-09-10

最新推荐文章于 2024-05-10 16:25:53 发布

研途研学

最新推荐文章于 2024-05-10 16:25:53 发布

阅读量1.4k

点赞数 1

分类专栏： python+selenium爬虫文章标签： python selenium

本文链接：https://blog.csdn.net/qq_41855454/article/details/108511656

版权

python+selenium爬虫专栏收录该内容

0 篇文章 0 订阅

订阅专栏

Python+selenium自动化爬取拉勾网的职位信息

前言：

python爬虫方法有多种，比如request请求、beautifulSup、selenium等。其中selenium是模拟浏览器操作，只要运行爬虫代码，爬取数据的过程均由程序控制浏览器，无需人工操作。

selenium的优点：由于是模拟浏览器操作，只要设置适当的爬取间隔时间，就不用担心被识别出是爬虫；只要是浏览器可以浏览到的东西，python+selenium都可以爬到。

selenium的缺点：由于该方法是模拟浏览器操作，所以爬取效率是所有爬虫方法中最低的（但是它的成功率高，所以对于初学者或者一些难爬的网站来说，selenium是个不错的选择）。

代码编辑器和浏览器：pycharm，google

爬取思路：先获取岗位链接，再根据链接爬取详细页面的信息。

页面分析：

1、总体分析

先搜索与python有关的岗位，接着按F12打开源代码窗口，然后点击Elements模块，点击目标3的箭头后，鼠标放到网页中想爬取数据上时，相应的源代码会被选中，下图的目标4是上面职位详细页面的链接，如第二张图，我们要从第二张图获取详细的岗位信息。

图1 岗位列表页

图2 详细页面

2、详细页面分析

下图网页中的红、紫色箭头分别表示源代码中对应页面的红、紫色箭头。招聘公司和岗位信息可以从下图箭头1所指的代码块获取，从箭头2所指的代码块则可以获取薪资、工作城市、经验要求等信息。

图3 详细页面源码分析

代码编辑（借助注释理解）：

import requests
from selenium import webdriver
from lxml import etree
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import re

#将爬虫封装成一个类
class LagouSpider(object):

    def __init__(self):
        #启动浏览器驱动
        self.driver=webdriver.Chrome()
        #拉勾网链接
        self.url='https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='
        self.positions=[]

    def run(self):
        #向服务器发出强求
        self.driver.get(self.url)
        #显示等待，通过xpath定位找下一页
        while True:
            WebDriverWait(driver=self.driver, timeout=1000).until(
                EC.presence_of_element_located((By.XPATH, "//div[@class='pager_container']/span[last()]"))
            )
        #获取下一页按钮
            next_btn = self.driver.find_element_by_xpath("//div[@class='pager_container']/span[last()]")
            #判断是否是最后一页，若是，则停止；若不是，则点击下一页
            if "pager_next_disabled" in next_btn.get_attribute("class"):
                break
            else:
                next_btn.click()
            #获取网页源代码
            source = self.driver.page_source
            #获取每个岗位源码的函数（下面自定义）
            self.parse_list_page(source)

    #获取每个岗位源码的函数
    def parse_list_page(self,source):
        #将网页源代码改成html格式，以便使用xpath定位元素
        html = etree.HTML(source)
        #获取每个岗位的链接
        links = html.xpath("//a[@class='position_link']/@href")
        for link in links:
            #根据链接获取每个页面的详细信息（自定义函数）
            self.request_detail_page(link)
            time.sleep(1)

    #根据链接获取每个页面的详细信息
    def request_detail_page(self, url):
        #打开每个链接
        self.driver.execute_script("window.open('%s')"%url)
        #利用句柄，页面切换到对应链接的页面
        self.driver.switch_to_window(self.driver.window_handles[1])
        time.sleep(3)
        source = self.driver.page_source
        #爬取详细页面的信息（自定义函数）
        self.parse_detail_page(source)
        #关闭该链接的页面，切换回原先的页面
        self.driver.close()
        self.driver.switch_to_window(self.driver.window_handles[0])

    #关闭该链接的页面
    def parse_detail_page(self,source):
        html=etree.HTML(source)
        #根据xpath找到要爬取数据的位置
        co_name = html.xpath("//h3[@class='fl']/em[@class='fl-cn']/text()")
        po_name = html.xpath("//h1[@class='name']/text()")
        job_request = html.xpath("//dd[@class='job_request']//span")

        if (co_name and po_name and job_request)!=[]:
            company_name=co_name[0].strip()
            position_name=po_name[0].strip()
            salary = job_request[0].xpath('.//text()')[0].strip()
            work_place = job_request[1].xpath('.//text()')[0].strip()
            work_place = re.sub(r"[\s/]", "", work_place)
            work_years = job_request[2].xpath('.//text()')[0].strip()
            work_years = re.sub(r"[\s/]", "", work_years)
            education = job_request[3].xpath('.//text()')[0].strip()
            education = re.sub(r"[\s/]", "", education)
            work_type = job_request[4].xpath('.//text()')[0].strip()
            work_type = re.sub(r"[\s/]", "", work_type)
            job_advantage = "".join(html.xpath("//dd[@class='job-advantage']//text()")).strip()
            job_desc = "".join(html.xpath(".//dd[@class='job_bt']//text()")).strip()

            position = {
                'company_name': company_name,
                'position_name': position_name,
                'salary': salary,
                'work_place': work_place,
                'work_years': work_years,
                'education': education,
                'work_type': work_type,
                'job_advantage': job_advantage,
                'job_desc': job_desc
                }
            self.positions.append(position)
            print(position)
            print('=='*30)


#执行结果
if __name__ == '__main__':
    spider=LagouSpider()
    spider.run()

研途研学

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
2020-09-10

Python+selenium自动化爬取拉勾网的职位信息前言：python爬虫方法有多种，比如request请求、beautifulSup、selenium等。其中selenium是模拟浏览器操作，只要运行爬虫代码，爬取数据的过程均由程序控制浏览器，无需人工操作。selenium的优点：由于是模拟浏览器操作，只要设置适当的爬取间隔时间，就不用担心被识别出是爬虫；只要是浏览器可以浏览到的东西，python+selenium都可以爬到。selenium的缺点：由于该方法是模拟浏览器操作，所以爬取
复制链接

扫一扫