Python使用requests爬取国际人才网招聘信息并保存为csv文件

最近在做一个项目,需要用到招聘岗位数据进行分析,但是寻找无果,没有找到合适的数据。于是打算自己爬虫,在找了几个招聘网站之后终于找到了一个对爬虫没有做限制的网站,故分享给大家。(技术小白,爬虫是自学的,写的不好但能跑)

一、打开网站,搜索想要爬取的职位信息

(此处以搜索“工程师为例”)

二、使用F12查找User-Agent、Cookie信息,请求网页

(以Edge浏览器为例)

1.导入库

import requests
import os
from lxml import etree
import csv

2.使用for循环请求网页,并将网页保存在当前目录下的html文件夹内,保存到本地是为了防止请求次数过多被网站封IP.

尝试翻页,发现网页链接pageNo=1 部分有变动,故此处即页数

 

由翻页按钮可知最大页数为300页

请求网页部分代码(此处Cookie省略,需自行填写):

# 文件夹路径
folder_path = './html'
# 检查文件夹是否存在,如果不存在则创建
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

# 循环遍历1到300页的职位信息
for i in range(1,301):
    #定义请求的网址,以及翻页的位置
    url = "https://www.job001.cn/jobs?keyType=0&keyWord=&jobTypeId=&jobType=%E8%81%8C%E4%BD%8D%E7%B1%BB%E5%9E%8B&industry=&industryname=%E8%A1%8C%E4%B8%9A%E7%B1%BB%E5%9E%8B&workId=&workPlace=&salary=&salaryType=&entType=&experience=&education=&entSize=&benefits=&reftime=&workTypeId=&sortField=&pageNo={i}&curItem=&searchType=1".format(i=i)
    #定义请求头,伪装浏览器请求
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0"
    }
    #请求网页
    response = requests.get(url, headers=header).text
    # 将获取到的页面内容写入文件
    file = open("./html/job{i}.txt".format(i=i),"w", encoding="utf-8")
    file.write(response)

三、使用F12查找数据的XPath定位,爬取职位信息

定位第一个职位位置

复制完整的XPath

将第一个和第二个职位信息进行对比,发现第四个div处发生变化,并且每页第一个职位是从div[2]开始的,所以这个位置是循环变动的位置。

以此类推,定位其他信息。

定位薪资、工作经验、学历要求时发现此处的文本不在网页标签内。我们可以在爬取时按\n来进行分隔,提取出我们想要的工作经验和学历数据。

例:

diqu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/span[1]/span/text()")[0]
xinzi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/span[1]/text()")[0].split("\n")[2].strip()
jingyan = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[1].strip()

四、遍历完整的所需要的职位信息

部分代码:

#循环遍历300页
for j in range(1,301):
    filename = f"./html/job{j}.txt"
    with open(filename, "r", encoding="utf-8") as file:
        content = file.read()
        tree = etree.HTML(content)
    #循环爬取页面信息(2开始到21,一页共20条)
    for i in range(2,22):
        try:
            zhiwei = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/a/text()")[0]
            diqu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/span[1]/span/text()")[0]
            xinzi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/span[1]/text()")[0].split("\n")[2].strip()
            jingyan = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[1].strip()
            xueli = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[2].strip()
            gongsi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dt/a/text()")[0].split("\n")[1].strip()
            ziben = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[1]/text()")[0].strip()
            renshu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[2]/text()")[0].strip()
            hangye = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[3]/text()")[0].strip()
        except:
            continue

因该网页职位信息不完全一致,我们运行到有些职位时会出错,例下图企业没有人数和行业信息,程序到此处会直接停止,故代码需要使用try/except结构来防止程序因为一两条错误信息而停止。

五、输出职位信息到控制台用于测试,将信息写入csv文件

部分代码:

print(zhiwei,diqu,xinzi,jingyan,xueli,gongsi,ziben,renshu,hangye)
        print(j,i)
        print("========================")
        row = [zhiwei, diqu, xinzi, jingyan, xueli, gongsi,ziben,renshu,hangye]
        rows.append(row)

with open('BigDataJobs.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerows(rows)

运行效果:

输出文件:

最后输出的文件还是有少量脏数据,我们可以使用其他方法进行数据清洗

六、完整代码

import requests
import os
from lxml import etree
import csv

# 文件夹路径
folder_path = './html'
# 检查文件夹是否存在,如果不存在则创建
if not os.path.exists(folder_path):
    os.makedirs(folder_path)

# 循环遍历1到300页的职位信息
for i in range(1,301):
    url = "https://www.job001.cn/jobs?keyType=0&keyWord=&jobTypeId=&jobType=%E8%81%8C%E4%BD%8D%E7%B1%BB%E5%9E%8B&industry=&industryname=%E8%A1%8C%E4%B8%9A%E7%B1%BB%E5%9E%8B&workId=&workPlace=&salary=&salaryType=&entType=&experience=&education=&entSize=&benefits=&reftime=&workTypeId=&sortField=&pageNo={i}&curItem=&searchType=1".format(i=i)
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0",
        "Cookie":"SITE=index; JSESSIONID=CFCD49CD7022AF3FB18DD40C09BF7A3F; __qc_wId=782; __qc__k=; _gid=GA1.2.1883802686.1720405462; Qs_lvt_167992=1720405461; Hm_lvt_e3b0ab92511ce1f46960bed57f65f532=1720405462; HMACCOUNT=0BA76AA9CA29E336; _ga_2WSGNKF5RD=GS1.1.1720405461.1.1.1720405667.0.0.0; _ga=GA1.1.1003479441.1720405462; Hm_lpvt_e3b0ab92511ce1f46960bed57f65f532=1720405667; Qs_pv_167992=4511810421623466000%2C2170381492033848000%2C4333446900271874600%2C1578508563396953600%2C387265381149986940; mediav=%7B%22eid%22%3A%22205165%22%2C%22ep%22%3A%22%22%2C%22vid%22%3A%22I-sb%5B3HRSU9T%5BY%25J(7%3C-%22%2C%22ctn%22%3A%22%22%2C%22vvid%22%3A%22I-sb%5B3HRSU9T%5BY%25J(7%3C-%22%2C%22_mvnf%22%3A1%2C%22_mvctn%22%3A0%2C%22_mvck%22%3A0%2C%22_refnf%22%3A1%7D; tfstk=f7uq1yAlWELqK5ogU8aaLJkmslzY5ypC_Vw_sfcgG-2clZ6i4xkJ6R4gm0yZU8h_hhefM0HItZgXDP_akPUMdpTBRIhYWPAQ0EeuHbc8s_01XZlxMPC-LeVS9jIwo4gooPDgraV31PX0SoAzr5PUSZqGogXue54gIAqiZbVbgOV0INVoTIcmsaPQmBX0RTTubSr4LNwP7NR_goyni8JHK48T0Jc0UNvCRXbuQ83HF3l-BmDbwA8eUPH-jAVZETv_DfmrUSkXLOeskjgawj5dJjqa300041biZryrpck23iemcYrLZufwJ7MQEj3m4CBoivw4o7qW-Bluj0gYvq9luymKGriiBHs8ufm0zgzGB7vxvVnVIGr0w7yBaQv5DA2NVBFcqGITqyFzdINfXGE0w7yBaQSOXuVLaJObG"
    }
    response = requests.get(url, headers=header).text

    # 将获取到的页面内容写入文件
    file = open("./html/job{i}.txt".format(i=i),"w", encoding="utf-8")
    file.write(response)

# 定义职位信息的标题
title = ['职位', '地区', '薪资', '经验','学历','公司','资本类型','企业人数','行业']
rows = [title]

# 读取保存的职位信息文件
for j in range(1,301):
    filename = f"./html/job{j}.txt"
    with open(filename, "r", encoding="utf-8") as file:
        content = file.read()
        tree = etree.HTML(content)
    for i in range(2,22):
        try:
            zhiwei = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/a/text()")[0]
            diqu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/span[1]/span/text()")[0]
            xinzi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/span[1]/text()")[0].split("\n")[2].strip()
            jingyan = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[1].strip()
            xueli = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[2].strip()
            gongsi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dt/a/text()")[0].split("\n")[1].strip()
            ziben = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[1]/text()")[0].strip()
            renshu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[2]/text()")[0].strip()
            hangye = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[3]/text()")[0].strip()
        except:
            continue
        print(zhiwei,diqu,xinzi,jingyan,xueli,gongsi,ziben,renshu,hangye)
        print(j,i)
        print("========================")
        row = [zhiwei, diqu, xinzi, jingyan, xueli, gongsi,ziben,renshu,hangye]
        rows.append(row)

with open('BigDataJobs.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerows(rows)
  • 5
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Python使用XPath爬取招聘信息保存CSV文件的步骤如下: 1. 导入所需的库: ```python import requests from lxml import etree import csv ``` 2. 发送请求获取页面内容: ```python url = "招聘信息页面的URL" response = requests.get(url) ``` 3. 解析页面内容: ```python html = etree.HTML(response.text) ``` 4. 使用XPath选取招聘信息: ```python title = html.xpath("XPath表达式1") company = html.xpath("XPath表达式2") salary = html.xpath("XPath表达式3") ``` 5. 创建CSV文件并写入表头: ```python csv_file = open('招聘信息.csv', 'w', newline='', encoding='utf-8') writer = csv.writer(csv_file) writer.writerow(['标题', '公司', '薪水']) ``` 6. 遍历招聘信息并写入CSV文件: ```python for i in range(len(title)): writer.writerow([title[i], company[i], salary[i]]) ``` 7. 关闭CSV文件: ```python csv_file.close() ``` 完整代码示例: ```python import requests from lxml import etree import csv url = "招聘信息页面的URL" response = requests.get(url) html = etree.HTML(response.text) title = html.xpath("XPath表达式1") company = html.xpath("XPath表达式2") salary = html.xpath("XPath表达式3") csv_file = open('招聘信息.csv', 'w', newline='', encoding='utf-8') writer = csv.writer(csv_file) writer.writerow(['标题', '公司', '薪水']) for i in range(len(title)): writer.writerow([title[i], company[i], salary[i]]) csv_file.close() ``` 运行这段代码后,会在当前目录下生成一个名为"招聘信息.csv"的CSV文件,其中包含了爬取到的招聘信息的标题、公司和薪水。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值