学校给我们大三同学找了一个培训组织,做了10天的实训,我学习的是python,最后是以一个爬虫程序为结业作业,就着这个实训,顺便发一个博客记录一下。考虑到我们即将就业,所以准备爬一个招聘网站,最终决定是Boss直聘。
爬虫具体的步骤为:1.分析url 2.获取网页内容 3.存储到本地
1.分析url
这是一个页面的网址:https://www.zhipin.com/c101130100/d_203/?query=Java&page=1&ka=page-1
上述网址中,我选择的城市是乌鲁木齐,学历是本科,搜索的职位是java,第一页。规则一目了然了,该网站城市有着自己的编码,就是c101130100中c后面的那串数字,经过一些观察,得到了一些热门城市和编码,将其保存在字典中,方便后续输入查询,如下所示:
citycode = {"北京":"101010100","上海":"101020100","天津":"101030100","重庆":"101040100",
"哈尔滨":"101050100","长春": "101060100","沈阳":"101070100","呼和浩特":"101080100","石家庄":"101090100",
"太原": "101100100","西安": "101110100","济南": "101120100","乌鲁木齐":"101130100","西宁":"101150100",
"兰州":"101160100","银川":"101170100","郑州":"101180100","南京":"101190100","武汉": "101200100",
"杭州":"101210100","合肥":"101220100","福州":"101230100","南昌":"101240100","长沙":"101250100",
"贵阳":"101260100","成都":"101270100","广州":"101280100","昆明":"101290100","南宁":"101300100",
"海口":"101310100","台湾":"101341100","拉萨":"101140100","香港":"101320300","澳门":"101330100"}
d_203就代表的是本科学历了,query后面跟的自然是职位了,而且该网站还有一个很棒的地方,中文也不会转码,也就是说“web前端”就是“query=web前端”,后面的page自然是页数了。得到了url的规律后,我们就可以获取网页内容了。
2.获取网页内容
初始化,定义好一些变量,其中user-Agent可以自己在网上搜一个,或者多找几个,随机选择,这样更能模拟真人访问网站
def __init__(self):
self.baseurl = "https://www.zhipin.com/c"
self.headers = {"user-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1"}
self.name = ""
self.city = ""
首先获取整个页面,查看一下结果,把下方代码注释的打印出来:
# 获取页面
def getPage(self,url):
req = urllib.request.Request(url,headers=self.headers)
res = urllib.request.urlopen(req)
html = res.read().decode("utf-8")
#print(html)
self.parsePage(html)
设计我们要爬取的html代码是下方这段:
<li>
<div class="job-primary">
<div class="info-primary">
<h3 class="name">
<a href="/job_detail/54739b6331ca63141HZ40t64FVY~.html" data-jid="54739b6331ca63141HZ40t64FVY~" data-itemid="30" data-lid="1C47rkiCuwE.search" data-jobid="32293554" data-index="29" ka="search_list_30" target="_blank">
<div class="job-title">软件系统架构师</div>
<span class="red">20-40K</span>
<div class="info-detail"></div>
</a>
</h3>
<p>珠海 <em class="vline"></em>5-10年<em class="vline"></em>本科</p>
</div>
<div class="info-company">
<div class="company-text">
<h3 class="name"><a href="/gongsi/b6db35d416a35e390nB93tW5.html" ka="search_list_company_30_custompage" target="_blank">格力电器</a></h3>
<p>其他行业<em class="vline"></em>已上市<em class="vline"></em>10000人以上</p>
</div>
</div>
<div class="info-publis">
<h3 class="name"><img src="https://img.bosszhipin.com/beijin/mcs/useravatar/20181128/2ee9b8f7399de940e03d9e769553ec644b1b62e500ea64347fc43909b5ba1421_s.jpg?x-oss-process=image/resize,w_40,limit_0" />蔡女士<em class="vline"></em>综合人事</h3>
<p></p>
</div>
<a href="javascript:;" data-url="/wapi/zpgeek/friend/add.json?jobId=54739b6331ca63141HZ40t64FVY~&lid=1C47rkiCuwE.search" redirect-url="/geek/new/index/chat?id=62e0f2f6d250b9eb03192tm5EFE~" class="btn btn-startchat">立即沟通
</a>
</div>
</li>
具体如何写正则表达式就不详解了,反正注意观察就行了,还是挺简单的,解析页面的代码如下:
# 解析页面
def parsePage(self, html):
p = re.compile(r'<div class="job-primary">.*?<div class="job-title">(.*?)</div>.*?<span class="red">(.*?)</span>.*?<em class="vline"></em>(.*?)<em class="vline">.*?<h3 class="name">.*?target="_blank">(.*?)</a></h3>.*?',re.S)
rList = p.findall(html)
if bool(rList):
#print(rList)
self.writePage(rList)
同意,解析后的数据也可以打印出来差查看一下,结果是个列表,含有多个元祖。
3.存储到本地
之后就是把刚才后获取到的数据保存到本地就行了,什么文件格式自己选择,我选择的是csv,代码如下:
# 保存数据
def writePage(self,List):
f = open(self.city+"_"+self.name+".csv","a",newline="",encoding="utf-8")
write = csv.writer(f)
write.writerow(["职位名称","薪酬","工作经验","公司名称"])
for rTuple in List:
write.writerow([rTuple[0],rTuple[1],rTuple[2],rTuple[3]])
f.close()
主程序的代码是这样的:
# 主方法
def workOn(self):
citycode = {"北京":"101010100","上海":"101020100","天津":"101030100","重庆":"101040100",
"哈尔滨":"101050100","长春": "101060100","沈阳":"101070100","呼和浩特":"101080100","石家庄":"101090100",
"太原": "101100100","西安": "101110100","济南": "101120100","乌鲁木齐":"101130100","西宁":"101150100",
"兰州":"101160100","银川":"101170100","郑州":"101180100","南京":"101190100","武汉": "101200100",
"杭州":"101210100","合肥":"101220100","福州":"101230100","南昌":"101240100","长沙":"101250100",
"贵阳":"101260100","成都":"101270100","广州":"101280100","昆明":"101290100","南宁":"101300100",
"海口":"101310100","台湾":"101341100","拉萨":"101140100","香港":"101320300","澳门":"101330100"}
self.city = input("请输入您要搜索的城市:")
self.name = input("请输入需要搜索的职位:")
city = citycode[self.city]
# 搜索内容进行转码
query = urllib.parse.urlencode({"query":self.name})
print("爬取开始")
for i in range(1,4):
url = self.baseurl+city+"/d_203/?"+query+"&page%s&ka=page-%s"%(str(i),str(i))
print("爬取第%d页"%(i))
self.getPage(url)
time.sleep(0.5)
print("爬取完毕")
我自己设置的爬取页数是3页,这个可以自行调整的。
获取的数据大致如下:
初次写爬虫,还有很多不足之处,希望大家一起学习交流