年底了,学习数据分析快四个月了。为了尽快找到一份数据分析相关的工作,计划把BOSS直聘上的相关职位都爬取下来分析分析,也好检验一下最近的学习成果。python新手,代码写的乱,将就看吧。首先,对给BOSS直聘服务器造成的扰动表示歉意;其次,赶紧赐我一个好工作吧!因为地处珠三角,这次计划爬取广州和深圳与数据分析、数据挖掘、数据运营相关的职位。
作为初学者,在实际操作的路上遇到无数坑,都一一记录下来,为以后提升打好基础。
import requests
import os
import re
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
首先设置文件路径。本文是用jupyter notebook写的,也可以不设置路径,后面需要的话再从notebook上下载文件。
pd.set_option('max_colwidth',500)
folder_name='E:\share tragedy\liepin'
if not os.path.exists(folder_name):
os.makedirs(folder_name)
数据抓取
BOSS直聘采用很严格的反爬虫机制,登录页面有拖动验证,爬取信息过快时还会随时需要输入验证码,经常会造成爬取中断。
url="https://www.zhipin.com/?sid=sem_pz_bdpc_dasou_title"
df_job=pd.DataFrame(columns=[['job_num','job_title']])
proxies={
'http': 'http://210.22.176.146:32153','http': 'http://211.152.33.24:48749','http': 'http://175.165.128.214:1133','http': 'http://36.110.14.66:50519'}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'cookie':'_uab_collina=154410324775207731129147; lastCity=101010100; t=vPAu9pVZjhJw4CEs; wt=vPAu9pVZjhJw4CEs; sid=sem_pz_bdpc_dasou_title; JSESSIONID=""; __c=1544875350; __g=sem_pz_bdpc_dasou_title; __l=l=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title&r=https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3Dboss%25E7%259B%25B4%25E8%2581%2598%26rsv_spt%3D1%26rsv_iqid%3D0xa2eaa2390016f56e%26issp%3D1%26f%3D8%26rsv_bp%3D0%26rsv_idx%3D2%26ie%3Dutf-8%26tn%3D90066238_hao_pg%26rsv_enter%3D1%26rsv_t%3D161dvC%252FaDWi%252Fh%252B1%252F7Li2Ji8FrSldZ4PCYkrVrBo1BpjThzGjwIzfr1jHvtvsMEXU2CM3GntQ&g=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title; Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1544103248,1544688540,1544757018,1544875351; 356'}
response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content,'lxml')
job=soup.find_all('div','text')
for i in range(0,len(job)):
job_num_all=job[i].contents
for j in range(1,len(job_num_all)):
if j%2==1:
job_num=job_num_all[j]['href'][12:19]
job_title=job_num_all[j].text
df_job=df_job.append(pd.DataFrame({
'job_num':[job_num],'job_title':[job_title]}),ignore_index=True)
df_job.to_csv('E:/share tragedy/liepin/df_job.csv')
data={
'city_num':[100010000,101010100,101020100,101280100,101280600,101210100,101030100,101110100,101190400,101200100,101230200,101250100,101270100,101180100,101040100],
'city_name':['全国','北京','上海','广州','深圳','杭州','天津','西安','苏州','武汉','厦门','长沙','成都','郑州','重庆']}
df_city=pd.DataFrame(data)
df_city.to_csv('E:/share tragedy/liepin/df_city.csv')
打开生成的df_job和df_city表,发现跟数据分析相关的职位编号有3个,跟数据挖掘相关的职位编号有2个,其他还有数据运营和商业数据分析是跟数据分析相关的。很明显这些职位分类里面有很多是重叠的,所以后面需要查重。另外,boss直聘设置是每页显示30条招聘信息,最多显示10页,每条招聘信息对应一个唯一的27位编码(类似于’As2434Hsg-hsdbhdir34iygjhl-'这样),由大小写字母、数字和-构成。确定了编码也就确定了这条招聘信息。接下来首先需要爬取相关的编码,然后构造每条招聘信息的对应url,再逐一爬取。
def get_job_sign(job_city,job_title,a=11):
df_ma=pd.DataFrame(columns=['p_new'])
if len(job_city)+len(job_title)>6:
print '过多数据,可能因验证码中断操作,建议城市数+职位数小于5!'
for j in job_city:
for k in job_title:
for i in range(1,a):
urls='https://www.zhipin.com/c'+str(j)+'-'+k+'/?page='+str(i)+'&ka=page-'+str(i)
proxies={
'http': 'http://210.22.176.146:32153','http': 'http://211.152.33.24:48749','http': 'http://175.165.128.214:1133','http': 'http://36.110.14.66:50519'}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'cookie':'_uab_collina=154410324775207731129147; lastCity=101010100; t=vPAu9pVZjhJw4CEs; wt=vPAu9pVZjhJw4CEs; sid=sem_pz_bdpc_dasou_title; JSESSIONID=""; __c=1544875350;__g=sem_pz_bdpc_dasou_title; __l=l=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title&r=https%3A%2F%2Fwww.baidu.com%2Fs%3Fwd%3Dboss%25E7%259B%25B4%25E8%2581%2598%26rsv_spt%3D1%26rsv_iqid%3D0xa2eaa2390016f56e%26issp%3D1%26f%3D8%26rsv_bp%3D0%26rsv_idx%3D2%26ie%3Dutf-8%26tn%3D90066238_hao_pg%26rsv_enter%3D1%26rsv_t%3D161dvC%252FaDWi%252Fh%252B1%252F7Li2Ji8FrSldZ4PCYkrVrBo1BpjThzGjwIzfr1jHvtvsMEXU2CM3GntQ&g=%2Fwww.zhipin.com%2F%3Fsid%3Dsem_pz_bdpc_dasou_title;Hm_lvt_194df3105ad7148dcf2b98a91b5e727a=1544103248,1544688540,1544757018,1544875351; 356'}
response=requests.get(urls,headers=headers)
page=response.content
data_jid=re.findall(r'data-jid="[a-zA-Z0-9-.+-]+~"',page,re.M)
if data_jid is not None:
for l in data_jid:
l=l[10:]
l=l[:-2]
df_ma=df_ma.append(pd.DataFrame({
'p_new':[l]}),ignore_index=True)
return df_ma
爬取广州、深圳所有关于数据分析师、数据挖掘、数据运营、商业数据分析的招聘信息;这里广州和深圳分开爬取后再合并,如果一次爬取建议用sleep方法,每爬取一条信息休息30秒。
df=get_job_sign([101280600],['p100104','p100509','p100511','p120301','p130103','p260102','p140108'])
df=df.drop_duplicates('p_new',keep='first')
df['number']=range(1,len(df)+1)
df['urls']="https://www.zhipin.com/job_detail/"