作者找工作中
开发环境
4.3【开发平台及环境】
Windons 10 教育版
Python 3.7
IntelliJ IDEA 2018.2.1 / PyCharm
Googe Chrome
数据清洗 分析模块pandas,numpy
可视化模块pyecharts
下期更新flask可项目视化项目
python,MySQL,Echarts,js
一:数据采集
招聘信息采集:使用爬虫采集技术,采集字段如下:
公司名称,职位,职位亮点,ID,规模,城市,学历,工作经验,公司类型,公司网站,求职网址,编号,城市ID
项目开发时间2019-10-10到10-16期间,爬虫代码随时间变化可能无效
import requests
from lxml import etree
import re
import json
import csv
import time
header = {
'Accept': 'application/json, text/plain, */*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3732.400 QQBrowser/10.5.3819.400',
"cookie":"x-zp-client-id=e2f8492a-39c6-44f1-f181-3408dfc4c651; urlfrom2=121114583; adfcid2=www.baidu.com; adfbid2=0; sts_deviceid=1"
"6d66515ef32a9-00a0ecf38d6864-34564a75-2073600-16d66515ef5900; sou_experiment=capi; sensorsdata2015jssdkcross=%7B%22distin"
"ct_id%22%3A%2216d66515f058fe-0a7bf2d03b44ab-34564a75-2073600-16d66515f062a6%22%2C%22%24device_id%22%3A%2216d66515f058fe-0a7"
"bf2d03b44ab-34564a75-2073600-16d66515f062a6%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E"
"%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search"
"_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D; acw_tc=276082061571053"
"5124757507e7f855599045d70c3a3baead7cb13244f9ce1; dywea=95841923.3929760379540693000.1569379672.1569379672.1571054618.2; dywez"
"=95841923.1571054618.2.2.dywecsr=jobs.zhaopin.com|dyweccn=(referral)|dywecmd=referral|dywectr=undefined|dywecct=/cc224037312"
"j00240379404.htm; Hm_lvt_38ba284938d5eddca645bb5e02a02006=1569379672,1571054618; __utma=269921210.106900723.1569379672.156937"
"9672.1571054618.2; __utmz=269921210.1571054618.2.2.utmcsr=jobs.zhaopin.com|utmccn=(referral)|utmcmd=referral|utmcct=/CC2240373"
"12J00240379404.htm; LastCity%5Fid=749; ZP_OLD_FLAG=false; POSSPORTLOGIN=0; CANCELALL=0; LastCity=%E9%95%BF%E6%A0%AA%E6%BD%AD; "
"sts_sg=1; sts_chnlsid=Unknown; zp_src_url=http%3A%2F%2Fjobs.zhaopin.com%2FCC879864350J00334868004.htm; jobRiskWarning=true; acw"
"_sc__v2=5da57cb5b3223856c3fb768be55c39bec99b9b33; ZL_REPORT_GLOBAL={%22jobs%22:{%22recommandActionidShare%22:%22f4ec2b1a-bbe2-41"
"ba-b0fc-14c426ffd63b-job%22%2C%22funczoneShare%22:%22dtl_best_for_you%22}}; sts_sid=16dce6f31656d-0cee0282bd8b1b-34564a75-2073600-16dce6f31666cf; sts_evtseq=2"
}
def get_context(number):
url = "https://fe-api.zhaopin.com/c/i/similar-positions?number="+number
urll='https://jobs.zhaopin.com/'+number+'.htm'
html = requests.get(url=url, headers=header)
# print(html.json()['data']['data']['list'])
companyName,companyNumber,companySize,salary60,workCity,education,\
workingExp,property,companyUrl,positionURL,name,welfareLabel,number,cityId,cityDistrict,applyType,score,tag="","","","","","","","","","","","","","","","","",""
try:
for i in html.json()['data']['data']['list']:
companyName = i['companyName'] # 公司
companyNumber = i['companyNumber'] # ID
companySize = i['companySize'] # 规模
salary60 = i['salary60'] # 薪水
workCity = i['workCity'] # 城市
education = i['education'] # 学历
workingExp = i['workingExp'] # 工作经验
property = i['property'] #企业性质
companyUrl = i['companyUrl'] # 公司网址
positionURL = i['positionURL'] # 求职网址
name = i['name'] # 职位名称
# welfareLabel = i['welfareLabel'] # 福利
number = i['number'] # 编号
cityId = i['cityId'] # 城市id
cityDistrict = i['cityDistrict'] # 城市区域
applyType = i['applyType'] # 公司类型
score = i['score'] # 公司分数
tag=[] #标签
for j in i['welfareLabel']:
tag.append(j['value'])
tag="/".join(tag)
except:
pass
html = requests.get(url=urll,headers=header)
html_xpath = etree.HTML(html.text)
# miaosu = re.findall('<div class="describtion__detail-content">(.*?)</div></div><div class="job-address clearfix">', html.text)
miaosu = html_xpath.xpath('string(//*[@class="describtion__detail-content"])') # 提取子标签所有文本
print("----------------------"+miaosu)
miaosu = ''.join(miaosu)
# time.sleep(1)
fp = open('智联招聘_大数据.csv', 'a', newline='')
write = csv.writer(fp)
row = (companyName,name, tag ,companyNumber ,companySize, salary60,workCity,
education,workingExp,property,companyUrl,positionURL,name,number,cityId,cityDistrict,applyType,score,miaosu)
write.writerow(row)
print('正在写入----'+workCity+'----的职位数据'+'----------'+name)
fp.close()
# Web前端
def get_url(city):
key = '大数据' # 搜索关键字
url = 'https://fe-api.zhaopin.com/c/i/sou?pageSize=4000&cityId='+city+'&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1' \
'&kw='+key+'&kt=3&lastUrlQuery=%7B%22pageSize%22:%2260%22,%22jl%22:%22489%22,%22kw%22:%22%E5%A4%A7%E6%95%B0%E6%8D%AE%22,%22kt%22:%223%22%7D'
number = ''
url_head = 'https://jobs.zhaopin.com/'
html = requests.get(url = url, headers = header)
try:
for i in html.json()['data']['results']:
print("-----------"+i['number'])
get_context(i['number']) # 内容爬虫开始---/
except:
pass
url = 'https://sou.zhaopin.com/?jl=852&sf=0&st=0&kw=%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B8%88&kt=3'
html = requests.get(url = url, headers = header).text
data =re.findall('<script>__INITIAL_STATE__=(.*?)</script>',html)
datas = json.loads(data[0])
try:
for i in datas["basic"]["dict"]["location"]["province"]:
get_url(i["code"])
except:
pass
采集的数据保存为csv格式
关于python访问MySQL数据,使用matplotlib(复杂)可视化的简单例子访问:点击查看项目
简介:众多的招聘岗位中,大数据岗位分布在全国各个城市,岗位与企业之间又有着错综复杂的联系,企业类型多样,不同的企业有着各自不同的文化,对应聘者也有着不同约束。应聘者不同经验获得的薪资也不一样,找到符合自己的职位,需要考虑招聘者发布的基本要求,如:经验,学历等各方面的需求。应聘者也会考查企业性质和类型。以下我们对发布求职公司进行分析。
大数据岗位基本分析
1 统计出公司类型的数量
数据量不大,为了节省开发时间,使用了pandas,可视化使用的是pyecharts,也可以使用将数据导入MySQL,可视化使用 Echarts,后端我常使用flask / node.js,选其一即可,下一个项目介绍flask为依托的可视化项目
# 公司类型的数量
import pandas as pd
from pyecharts import Bar, Pie
# # 显示所有列
# pd.set_option('display.max_columns', None)
# # 显示所有行
# pd.set_option('display.max_rows', None)
# # 设置value的显示长度为100,默认为50
# pd.set_option('max_colwidth', 100)
# 引擎,去空(只有有一个字段为空就删除整行数据),根据ID字段去重,保留第一个
data = pd.read_csv('../File/智联招聘_数据分析师.csv',engine='python').dropna().drop_duplicates('ID','first')
# 分组,求数量,排序(倒叙),
# conpany = data[['ID','公司类型']].groupby(by='公司类型',as_index=False).count()
company = data[['ID','公司类型']].groupby(by='公司类型',as_index=False).count().sort_index(by='ID',ascending