最近在做一个项目,需要用到招聘岗位数据进行分析,但是寻找无果,没有找到合适的数据。于是打算自己爬虫,在找了几个招聘网站之后终于找到了一个对爬虫没有做限制的网站,故分享给大家。(技术小白,爬虫是自学的,写的不好但能跑)
一、打开网站,搜索想要爬取的职位信息
(此处以搜索“工程师为例”)
二、使用F12查找User-Agent、Cookie信息,请求网页
(以Edge浏览器为例)
1.导入库
import requests
import os
from lxml import etree
import csv
2.使用for循环请求网页,并将网页保存在当前目录下的html文件夹内,保存到本地是为了防止请求次数过多被网站封IP.
尝试翻页,发现网页链接pageNo=1 部分有变动,故此处即页数
由翻页按钮可知最大页数为300页
请求网页部分代码(此处Cookie省略,需自行填写):
# 文件夹路径
folder_path = './html'
# 检查文件夹是否存在,如果不存在则创建
if not os.path.exists(folder_path):
os.makedirs(folder_path)
# 循环遍历1到300页的职位信息
for i in range(1,301):
#定义请求的网址,以及翻页的位置
url = "https://www.job001.cn/jobs?keyType=0&keyWord=&jobTypeId=&jobType=%E8%81%8C%E4%BD%8D%E7%B1%BB%E5%9E%8B&industry=&industryname=%E8%A1%8C%E4%B8%9A%E7%B1%BB%E5%9E%8B&workId=&workPlace=&salary=&salaryType=&entType=&experience=&education=&entSize=&benefits=&reftime=&workTypeId=&sortField=&pageNo={i}&curItem=&searchType=1".format(i=i)
#定义请求头,伪装浏览器请求
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0"
}
#请求网页
response = requests.get(url, headers=header).text
# 将获取到的页面内容写入文件
file = open("./html/job{i}.txt".format(i=i),"w", encoding="utf-8")
file.write(response)
三、使用F12查找数据的XPath定位,爬取职位信息
定位第一个职位位置
复制完整的XPath
将第一个和第二个职位信息进行对比,发现第四个div处发生变化,并且每页第一个职位是从div[2]开始的,所以这个位置是循环变动的位置。
以此类推,定位其他信息。
定位薪资、工作经验、学历要求时发现此处的文本不在网页标签内。我们可以在爬取时按\n来进行分隔,提取出我们想要的工作经验和学历数据。
例:
diqu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/span[1]/span/text()")[0]
xinzi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/span[1]/text()")[0].split("\n")[2].strip()
jingyan = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[1].strip()
四、遍历完整的所需要的职位信息
部分代码:
#循环遍历300页
for j in range(1,301):
filename = f"./html/job{j}.txt"
with open(filename, "r", encoding="utf-8") as file:
content = file.read()
tree = etree.HTML(content)
#循环爬取页面信息(2开始到21,一页共20条)
for i in range(2,22):
try:
zhiwei = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/a/text()")[0]
diqu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/span[1]/span/text()")[0]
xinzi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/span[1]/text()")[0].split("\n")[2].strip()
jingyan = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[1].strip()
xueli = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[2].strip()
gongsi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dt/a/text()")[0].split("\n")[1].strip()
ziben = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[1]/text()")[0].strip()
renshu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[2]/text()")[0].strip()
hangye = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[3]/text()")[0].strip()
except:
continue
因该网页职位信息不完全一致,我们运行到有些职位时会出错,例下图企业没有人数和行业信息,程序到此处会直接停止,故代码需要使用try/except结构来防止程序因为一两条错误信息而停止。
五、输出职位信息到控制台用于测试,将信息写入csv文件
部分代码:
print(zhiwei,diqu,xinzi,jingyan,xueli,gongsi,ziben,renshu,hangye)
print(j,i)
print("========================")
row = [zhiwei, diqu, xinzi, jingyan, xueli, gongsi,ziben,renshu,hangye]
rows.append(row)
with open('BigDataJobs.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerows(rows)
运行效果:
输出文件:
最后输出的文件还是有少量脏数据,我们可以使用其他方法进行数据清洗
六、完整代码
import requests
import os
from lxml import etree
import csv
# 文件夹路径
folder_path = './html'
# 检查文件夹是否存在,如果不存在则创建
if not os.path.exists(folder_path):
os.makedirs(folder_path)
# 循环遍历1到300页的职位信息
for i in range(1,301):
url = "https://www.job001.cn/jobs?keyType=0&keyWord=&jobTypeId=&jobType=%E8%81%8C%E4%BD%8D%E7%B1%BB%E5%9E%8B&industry=&industryname=%E8%A1%8C%E4%B8%9A%E7%B1%BB%E5%9E%8B&workId=&workPlace=&salary=&salaryType=&entType=&experience=&education=&entSize=&benefits=&reftime=&workTypeId=&sortField=&pageNo={i}&curItem=&searchType=1".format(i=i)
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 Edg/126.0.0.0",
"Cookie":"SITE=index; JSESSIONID=CFCD49CD7022AF3FB18DD40C09BF7A3F; __qc_wId=782; __qc__k=; _gid=GA1.2.1883802686.1720405462; Qs_lvt_167992=1720405461; Hm_lvt_e3b0ab92511ce1f46960bed57f65f532=1720405462; HMACCOUNT=0BA76AA9CA29E336; _ga_2WSGNKF5RD=GS1.1.1720405461.1.1.1720405667.0.0.0; _ga=GA1.1.1003479441.1720405462; Hm_lpvt_e3b0ab92511ce1f46960bed57f65f532=1720405667; Qs_pv_167992=4511810421623466000%2C2170381492033848000%2C4333446900271874600%2C1578508563396953600%2C387265381149986940; mediav=%7B%22eid%22%3A%22205165%22%2C%22ep%22%3A%22%22%2C%22vid%22%3A%22I-sb%5B3HRSU9T%5BY%25J(7%3C-%22%2C%22ctn%22%3A%22%22%2C%22vvid%22%3A%22I-sb%5B3HRSU9T%5BY%25J(7%3C-%22%2C%22_mvnf%22%3A1%2C%22_mvctn%22%3A0%2C%22_mvck%22%3A0%2C%22_refnf%22%3A1%7D; tfstk=f7uq1yAlWELqK5ogU8aaLJkmslzY5ypC_Vw_sfcgG-2clZ6i4xkJ6R4gm0yZU8h_hhefM0HItZgXDP_akPUMdpTBRIhYWPAQ0EeuHbc8s_01XZlxMPC-LeVS9jIwo4gooPDgraV31PX0SoAzr5PUSZqGogXue54gIAqiZbVbgOV0INVoTIcmsaPQmBX0RTTubSr4LNwP7NR_goyni8JHK48T0Jc0UNvCRXbuQ83HF3l-BmDbwA8eUPH-jAVZETv_DfmrUSkXLOeskjgawj5dJjqa300041biZryrpck23iemcYrLZufwJ7MQEj3m4CBoivw4o7qW-Bluj0gYvq9luymKGriiBHs8ufm0zgzGB7vxvVnVIGr0w7yBaQv5DA2NVBFcqGITqyFzdINfXGE0w7yBaQSOXuVLaJObG"
}
response = requests.get(url, headers=header).text
# 将获取到的页面内容写入文件
file = open("./html/job{i}.txt".format(i=i),"w", encoding="utf-8")
file.write(response)
# 定义职位信息的标题
title = ['职位', '地区', '薪资', '经验','学历','公司','资本类型','企业人数','行业']
rows = [title]
# 读取保存的职位信息文件
for j in range(1,301):
filename = f"./html/job{j}.txt"
with open(filename, "r", encoding="utf-8") as file:
content = file.read()
tree = etree.HTML(content)
for i in range(2,22):
try:
zhiwei = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/a/text()")[0]
diqu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dt/div[1]/span[1]/span/text()")[0]
xinzi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/span[1]/text()")[0].split("\n")[2].strip()
jingyan = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[1].strip()
xueli = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[1]/dl/dd[1]/text()")[2].strip()
gongsi = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dt/a/text()")[0].split("\n")[1].strip()
ziben = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[1]/text()")[0].strip()
renshu = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[2]/text()")[0].strip()
hangye = tree.xpath("/html/body/div[17]/div[1]/div[1]/div[" + str(i) + "]/div[1]/div[2]/dl/dd[1]/span[3]/text()")[0].strip()
except:
continue
print(zhiwei,diqu,xinzi,jingyan,xueli,gongsi,ziben,renshu,hangye)
print(j,i)
print("========================")
row = [zhiwei, diqu, xinzi, jingyan, xueli, gongsi,ziben,renshu,hangye]
rows.append(row)
with open('BigDataJobs.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerows(rows)