爬虫练习笔记（三）

最新推荐文章于 2024-07-24 23:24:10 发布

造血干细胞

最新推荐文章于 2024-07-24 23:24:10 发布

阅读量93

点赞数

分类专栏：爬虫练习文章标签：爬虫 python chrome

本文链接：https://blog.csdn.net/qq_43557618/article/details/123959058

版权

爬虫练习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

python爬虫练习笔记（三）

爬取拉钩网招聘信息并将招聘要求储存为txt文本

参考：
视频：B站路飞学城IT
博客：https://blog.csdn.net/weixin_46739549/article/details/123388215

先导入要用到的库

from selenium.webdriver import Chrome  #selenium自动化测试工具 Chrome用来自动化打开浏览器
import time
from selenium.webdriver.common.keys import Keys #键盘
from lxml import etree  #使用xpath语法来进行文件格式解析
import json

自动打开页面并进入搜索结果页面

#创建浏览器
web = Chrome()
#打开浏览器请求到拉钩
web.get("https://www.lagou.com")

#点击主页弹窗广告的× 
web.find_element_by_xpath('//*[@id="cboxClose"]').click() #也可以直接用id找
#需要一个延迟，否则操作太快
time.sleep(1)
#找到文本框输入python，再输入回车
web.find_element_by_xpath('//*[@id="search_input"]').send_keys('python',Keys.ENTER)

由于拉钩的反爬机制，视频里的方法不能适用于拉钩现在的页面，所以参考了另一篇博客。

#原视频代码
#alist = web.find_elements_by_xpath('//*[@id="jobList"]/div[1]/div[1]/div[1]/div[1]/div[1]/a')
#获取每页链接
url_temp = "https://www.lagou.com/wn/jobs/{}.html"
html_str = web.page_source
html = etree.HTML(html_str)
json_str = html.xpath("//script[@id='__NEXT_DATA__']/text()")[0]
json_dict = json.loads(json_str)
list = []
for i in range(15):  # 每次页面有15个岗位信息
    positionId = json_dict["props"]["pageProps"]["initData"]["content"]["positionResult"]["result"][i]["positionId"]
    # print(url_temp.format(positionId))
    list.append(url_temp.format(positionId))

获取了当前页面的15个网址之后，就依次打开爬取岗位要求

n = 1
for a in list:
    #点击链接 但代码并没有跳转到新页面
    # web.execute_script('arguments[0].click()',a)
    temp = 'window.open("{}")'.format(a)
    web.execute_script(temp)
    # 窗口转换  window_handles存储现在打开的所有标签页 [-1]表示倒数第一个
    web.switch_to.window(web.window_handles[-1])
    # 提取内容
    text = web.find_element_by_xpath('//*[@id="job_detail"]/dd[2]').text
    #保存在文件中
    f = open("data/requirement%s.txt"%n,mode='w')
    f.write(text)
    f.close()

    #关闭窗口
    web.close()
    #调整窗口到最开始页面
    web.switch_to.window(web.window_handles[0])
    time.sleep(1)
    n = n + 1