简述 selenium+拉钩网爬取案例，新手指南

最新推荐文章于 2022-10-18 16:26:11 发布

milk_and_bread

最新推荐文章于 2022-10-18 16:26:11 发布

阅读量348

点赞数

分类专栏：后端

本文链接：https://blog.csdn.net/milk_and_bread/article/details/98066296

版权

后端专栏收录该内容

50 篇文章 1 订阅

订阅专栏

上来先贴张代码先，(ง •_•)ง

# encoding: utf-8

from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from lxml import etree
import time
import re
import csv


class LagouSpider(object):
    # chromedriver的绝对路径


    def __init__(self):
        # 初始化一个driver，并且指定chromedriver的路径
        self.driver = webdriver.Chrome()
        self.positions = []  #存储职位
        self.fp = open('lagou.csv', 'a', encoding='utf-8', newline='')
        self.writer = csv.DictWriter(self.fp,
                                     ['title', 'salary', 'city', 'work_years', 'education', "company_website", 'desc',
                                      'acquire', 'origin_url'])
        self.writer.writeheader()

    def run(self):
        # 运行
        url = 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput='  # 路径
        self.driver.get(url)  # 获取路径
        while True:  # 死循环
            #获取id将不再等待
            WebDriverWait(driver=self.driver, timeout=10).until(
                EC.presence_of_element_located((By.XPATH, "//span[contains(@class,'pager_next')]"))
            )
            resource = self.driver.page_source
            self.parse_list_page(resource)
            next_btn = self.driver.find_element_by_xpath("//span[contains(@class,'pager_next')]")
            if "pager_next_disabled" in next_btn.get_attribute('class'):
                break
            next_btn.click()
            time.sleep(1)

    def parse_list_page(self, resource):
        html = etree.HTML(resource)
        links = html.xpath("//a[@class='position_link']/@href")
        for link in links:
            self.parse_detail_page(link)
            time.sleep(1)

    def parse_detail_page(self, url):

        self.driver.execute_script("window.open('" + url + "')")
        self.driver.switch_to.window(self.driver.window_handles[1])
        WebDriverWait(self.driver, timeout=10).until(
            EC.presence_of_element_located((By.XPATH, "//dd[@class='job_bt']"))
        )
        resource = self.driver.page_source
        html = etree.HTML(resource)
        title = html.xpath("//span[@class='name']/text()")[0]
        company = html.xpath("//h2[@class='fl']/em/text()")
        job_request_span = html.xpath("//dd[@class='job_request']//span")
        salary = job_request_span[0].xpath(".//text()")[0]
        salary = salary.strip()
        city = job_request_span[1].xpath(".//text()")[0]
        city = re.sub(r"[/\s]", "", city)
        work_years = job_request_span[2].xpath(".//text()")[0]
        work_years = re.sub(r"[/\s]", "", work_years)
        education = job_request_span[3].xpath(".//text()")[0]
        education = re.sub(r"[/\s]", "", education)
        company_website = html.xpath("//ul[@class='c_feature']/li[last()]/a/@href")[0]
        position_desc = "".join(html.xpath("//dd[@class='job_bt']/div//text()"))
        position = {
            'title': title,
            'city': city,
            'salary': salary,
            'company': company,
            'company_website': company_website,
            'education': education,
            'work_years': work_years,
            'desc': position_desc,
            'origin_url': url
        }
        self.driver.close()
        self.driver.switch_to.window(self.driver.window_handles[0])
        self.write_position(position)

    def write_position(self, position):
        if len(self.positions) >= 100:
            self.writer.writerows(self.positions)
            self.positions.clear()
        self.positions.append(position)
        print(position)


def main():
    spider = LagouSpider()
    spider.run()


if __name__ == '__main__':
    main()

一下简单的分析下代码，方便更快速对selenium项目的上手

注：以下的数字为行号。

17-25：主要完成初始化的工作，这里使用的csv文件的操作，不懂的可以参考下https://mp.csdn.net/postedit/97777771

27-42：整体是设计思路是，访问带有python这一关键字的url,获取下一页的控件，模拟点击，遍历访问

33：这是一个显示等待函数，显示等待不用非要到10秒结束，满足条件可以提前结束，如果超出时间没有满足条件，就会报异常........

WebDriverWait（）：传入当前驱动实例，设置延迟为10秒，其中unti（）方法可以理解为直到满足什么什么，条件函数EC.presence_of_element_located（）可以理解为期望出现定位元素，也即控件，这里用xpath定位控件

36-37：获取html页面，并交给我们写好的函数：
```
parse_list_page处理html的内容。
```
38：常用的获取html元素方法，这里根据xpath,不过还有关于class、id等方法，这里不介绍啦。O(∩_∩)O
39：用法十分精妙，如何判断没有下一页，只有这个下一个控件变灰的时候，也就是class样式等于
```
"pager_next_disabled"的时候
```

44-49：主要用xpath提取所有的职位详情页的链接，交给我们自己写的函数

parse_detail_page解析详情页，这里不多说啦

55-86：主要是负责对url进行解析，数据的提取工作

53：selenium支持对js脚本执行，用js执行打开一个窗口的命令
```
driver.execute_script
```
54:上一行打开了一个窗口，不过我们的driver就跳到这个新的窗口上了吗，实际并没有，没有这么智能，只能让我们指定，第一个窗口是driver.window_handles[0]，第二个窗口就是driver.window_handles[1]，这里用这个函数跳转
```
driver.switch_to.window，以后还可能遇到frame的跳转......并不单单只有窗口
```
84-85：为了方便操作我们还是关闭窗口，让页面最多只有两个窗口，关闭之后，在跳到第一个页面去

剩下的内容关于csv操作，见https://mp.csdn.net/postedit/97777771

新手简单的练手操作：https://mp.csdn.net/postedit/97920241