一切的一切都是为了找工作,都是为了有一个好的工作,较高的薪资;
在这里我们通过爬取51job网(前程无忧)
进行筛选全国Python方向的职位,薪资待遇如何,工作地址是否如意等一些基本信息;
网址:https://search.51job.com
这里我们使用基本的requests库进行基本的web信息文本爬取;
一、 requests安装
同时按住Windows+R键
,输入cmd
在终端输入pip install requests,回车就行了;
pip install requests
二、信息爬取
# 导入库
import requests
import lxml
import lxml.html
import csv
import json
from lxml import etree
下面将会进行url链接“创造”,找规律就可以发现链接的规律;
url = ["https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,{0}.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=".format(i) for i in range(1,822)]
三、web文本分析
try:
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
response = requests.get(URL, headers=headers) #爬取web文本
t = response.text # 转换为人看得懂的str
print(response,' ','爬取成功! :',number,'页')
bs = etree.HTML(t)
c = bs.xpath("//script[@type='text/javascript']")[2].text#定位,找到职位信息盒子,获取文本
op = eval(c[29:])['engine_search_result']#根据获取的文本,转化为字典
except Exception as error:
print(error)
五、完整代码
import requests
import lxml
import lxml.html
import csv
import json
from lxml import etree
url = ["https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,{0}.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=".format(i) for i in range(1,822)]
f_car = open('H:\on.csv', 'a+',newline='') #爬取信息保存文件
header = ['company_name', 'workarea_text', 'companytype_text','job_name','providesalary_text','jobwelf','attribute_text','companyind_text']#公司名称,公司地址,公司类型,求职岗位,薪资,福利,要求,职位方向
writer_car = csv.writer(f_car)
writer_car.writerow(header)
number = 1
for URL in url:
try:
headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
}
response = requests.get(URL, headers=headers)
t = response.text
print(response,' ','爬取成功! :',number,'页')
bs = etree.HTML(t)
c = bs.xpath("//script[@type='text/javascript']")[2].text
op = eval(c[29:])['engine_search_result']
for i in range(0,len(op)):
a=[op[i]['company_name'],op[i]['workarea_text'],op[i]['companytype_text'],op[i]['job_name'],op[i]['providesalary_text'],op[i]['jobwelf'],op[i]['attribute_text'],op[i]['companyind_text']]
writer_car.writerow(a)
except Exception as error:
print(error)
number += 1
f_car.close() #关闭文件
爬取结果,有41000条数据
数据集链接:51job网数据集https://download.csdn.net/download/qq_44936246/15405121
学习点赞,评论!