Selenium爬取拉勾 ------2021年4月6日

最新推荐文章于 2023-01-05 16:11:18 发布

你很棒滴

最新推荐文章于 2023-01-05 16:11:18 发布

阅读量351

点赞数 1

分类专栏： Selenium 文章标签： selenium python

本文链接：https://blog.csdn.net/RayMand168/article/details/115467804

版权

Selenium 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

前言

尝试过用requests爬取，失败了，不会处理cookie的问题，就使用了selenium。

需要使用的库

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
import re
from selenium.webdriver.chrome.options import Options
from chaojiying import Chaojiying_Client
from selenium.webdriver.common.action_chains import ActionChains

登录拉勾

def login_lagou(url,chrome):
    chrome.get(url)
    chrome.find_element_by_xpath('//button[@id="cboxClose"]').click()
    chrome.maximize_window()
    sleep(1)
    # 弹出窗口点×，最大化主窗口
    chrome.find_element_by_xpath('//a[@data-lg-webtj-_address_id="1p4n"]').click()
    chrome.find_element_by_xpath('//input[@type="text"]').send_keys('账号')
    chrome.find_element_by_xpath('//input[@type="password"]').send_keys('密码', Keys.ENTER)
    sleep(1)
    # 输入账号密码
    code_el = chrome.find_element_by_xpath('//div[@class="geetest_table_box"]')
    chaojiying = Chaojiying_Client('账号', '密码', '端口')
    dic_list = chaojiying.PostPic(code_el.screenshot_as_png, 9004)['pic_str'].split('|')
    # 返回的是个字典，切成列表
    sleep(1)
    for dic in dic_list:
        x = int(dic.split(',')[0])
        y = int(dic.split(',')[1])
        # print(x, y)
        ActionChains(chrome).move_to_element_with_offset(code_el, x, y).click().perform()
    sleep(3)
    chrome.find_element_by_class_name('geetest_commit_tip').click()
    #超级鹰处理登陆的验证码问题
    
    chrome.find_element_by_xpath('//input[@id="search_input"]').send_keys('python爬虫',Keys.ENTER)
    sleep(2)

得到每页公司的基本信息

def get_one_page(chrome):
    try:
        hrefs_list = chrome.find_elements_by_xpath('//a[@class="position_link"]')
        sleep(1)
        for href in hrefs_list:
            link = href.get_attribute('href')
            # href.find_element_by_tag_name('span').click()
            href.find_element_by_xpath('./h3').click()
            chrome.switch_to.window(chrome.window_handles[-1])
            sleep(1)
            # 转换到分窗口，并休息1秒，拿去信息
            company = chrome.find_element_by_xpath('//h4[@class="company"]').text
            name = chrome.find_element_by_xpath('//h1[@class="name"]').text
            salary = chrome.find_element_by_xpath('//h3/span[@class="salary"]').text
            jing_yan = chrome.find_element_by_xpath('//dd[@class="job_request"]//span[3]').text.strip('/')
            full_job = chrome.find_element_by_xpath('//dd[@class="job_request"]//span[5]').text.strip('/')
            job_advantage = chrome.find_element_by_xpath('//dd[@class="job-advantage"]').text
            job_detail = chrome.find_element_by_xpath('//div[@class="job-detail"]').text
            title = re.search(r'(\d*?).html',link).group(1)

            with open(f'公司名称/{title}', mode='w', encoding='utf-8') as file:
                data = company +'\n' + link +'\n'+ name + '\n' + salary + '\n' + jing_yan + '\n' \
                       + full_job + '\n' + job_advantage + '\n' + job_detail
                file.write(data)
            print(company + '爬取完毕')
            # 取名字太烦，用的url里面职位的数字代码，自己也可以编了一个有规则的名字，比较懒就取了省事的方法.
            sleep(2)
            chrome.close()
            chrome.switch_to.window(chrome.window_handles[0])
    except Exception as e:
        print(e)
    # 引出异常，防止程序中断

翻页操作

def find_more_page(chrome):
    chrome.find_element_by_xpath('//span[@action="next"]').click()
    sleep(4)
    get_one_page(chrome)
    # 时间一定要睡多点，要不然验证码，会烦死个人

主程序

if __name__ == '__main__':
    opt = Options()
    # opt.add_argument('--headless')
    # opt.add_argument('--disable-gpu')
    opt.add_argument('--disable-blink-features=AutomationControlled')
	# 无头，规避检测固定语法
	
    i = 1
    chrome = webdriver.Chrome(options=opt)
    url = 'https://www.lagou.com'
    login_lagou(url,chrome)
    get_one_page(chrome)
    while i < 5:
        print('第  %d  打印完毕'% i)
        find_more_page(chrome)
        i += 1
        sleep(3)
    chrome.close()

备注

若有其他更好的操作，欢迎大家补充讨论。

你很棒滴

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
3
评论
Selenium爬取拉勾 ------2021年4月6日

前言尝试过用requests爬取，失败了，不会处理cookie的问题，就使用了selenium。需要使用的库from selenium import webdriverfrom selenium.webdriver.common.keys import Keysfrom time import sleepimport refrom selenium.webdriver.chrome.options import Optionsfrom chaojiying import Chaojiying
复制链接

扫一扫

专栏目录