简介
你好! 这是我在学习爬虫知识的一次实操。如果你想学习如何使用py爬虫, 可以仔细阅读这篇文章,了解一下py爬虫。
使用库
其中使用的库包括 requests,re,csv,time
代码
import requests,re,csv,time
re1 = re.compile(r'"securityId":"(?P<sid>.*?)".*?"lid":"(?P<lid>.*?)"',re.S)
re2 = re.compile(r'"jobName":"(?P<jobname>.*?)","postDescription":"(?P<post>.*?)","encryptJobId.*?salaryDesc":"(?P<salary>.*?)",.*?jobLabels":(?P<joblabels>.*?),"address":"(?P<address>.*?)",.*?bossTitle":"(?P<bosstitle>.*?)",.*?brandName":"(?P<brandname>.*?)"',re.S)
headers = {
"cookie":第一个cookie ,
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
"referer": "https://www.zhipin.com/web/geek/job?query=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&city=101200100"
}
headers1 = {
"cookie": 第二个cookie,
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
f = open("boss1.csv" , mode="w" , encoding='utf-8')
csvwriter = csv.writer(f)
for it in range(5):
url = f"https://www.zhipin.com/wapi/zpgeek/search/joblist.json?scene=1&query=%E4%BA%A7%E5%93%81%E7%BB%8F%E7%90%86&city=100010000&experience=&payType=&partTime=°ree=&industry=100001&scale=&stage=&position=&jobType=&salary=&multiBusinessDistrict=&multiSubway=&page={it}&pageSize=30"
request1 = requests.get(url,headers = headers)
# csvwriter.writerow(["bossname","jobname","salary","skills","exp","jobdegree","cityname","brandname","brandindustry"])
re11 = re1.finditer(request1.text)
time.sleep(1)
for i in re11:
sid = i.group("sid")
lid = i.group("lid")
url2 = f"https://www.zhipin.com/wapi/zpgeek/job/card.json?securityId={sid}&lid={lid}&sessionId="
request2 = requests.get(url2,headers = headers1)
re22 = re2.finditer(request2.text)
for a in re22:
jobname = a.group("jobname")
post = a.group("post")
salary = a.group("salary")
joblabels = a.group("joblabels")
address = a.group("address")
bosstitle = a.group("bosstitle")
brandname = a.group("brandname")
csvwriter.writerow([jobname,post,salary,joblabels,address,bosstitle,brandname])
分析
cookie需要登陆后,用浏览器自带的检查查看,快捷键F12
然后过滤joblist
其中第一个cookie为
然后过滤card
如果没有找到不要刷新,随便点进去一个招聘信息等待10秒左右就会出现
第二个cookie为
输出文件
输出为csv文件,在excel中可以另存为xlsx文件