今天爬了智联招聘关于应聘python的要求,这个代码没什么讲的,和上一个一样,加的东西会在后面讲。可以把所在页面的所有招聘信息内部的招聘信息都抓到
from bs4 import BeautifulSoup import requests url2 = 'http://sou.zhaopin.com/jobs/searchresult.ashx?kw=python&sm=0&p=1' #print(res.text) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36', 'Host':'sou.zhaopin.com', } web_data = requests.get(url2, headers = headers) # print(web_data.text) soup = BeautifulSoup(web_data.text,'html.parser') for i in soup.select('a'): # if i['href'][:24] == 'http://jobs.zhaopin.com/': # print(i['href']) try: if i['href'].startswith('http://jobs.zhaopin.com/'): info = requests.get(i['href']) infosoup = BeautifulSoup(info.text,'html.parser') for a in infosoup.select('.tab-inner-cont'): try: print(a.text) except KeyError: pass except KeyError: pass
敲这个代码的过程中遇到的问题。
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36', 'Host':'sou.zhaopin.com', }
在爬智联首页的时候,可以获得源代码,但到里面具体招聘页面的时候,发现总是返回错误的代码。原来是反爬虫,就要加一个代理。就是上面的headers
try: if i['href'].startswith('http://jobs.zhaopin.com/'): info = requests.get(i['href']) infosoup = BeautifulSoup(info.text,'html.parser') for a in infosoup.select('.tab-inner-cont'): try: print(a.text) except KeyError: pass except KeyError: pass
这个是一直报一个KeyError的错误。然后收了一下答案,需要加一个异常,就是上面的try:......except KeyError: pass 就行了。
上面的代码,可以看因为是class所以爬下来的,不全是python应聘要求看下面
from lxml import html import requests # page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') # tree = html.fromstring(page.content) # #This will create a list of buyers: # buyers = tree.xpath('//div[@title="buyer-name"]/text()') # #This will create a list of prices # prices = tree.xpath('//span[@class="item-price"]/text()') # # print ('Buyers: ', buyers) # print ('Prices: ', prices) page = requests.get("http://jobs.zhaopin.com/450575810250022.htm?ssidkey=y&ss=201&ff=03&sg=d382e8f6a66b4c9e800b41c98de68d55&so=1&uid=689899307") tree = html.fromstring(page.content) content = tree.xpath('//div[@class="tab-inner-cont"]/p/text()') print(content)
这个代码经过试验。可以只爬pyhton招聘的部分。因为是xpath的方法。但是复制xpath。总是的得到空格。注释的是网上的代码。我就改了一下
//div[@class="tab-inner-cont"]/p/text()