本文主要使用代理池爬取拉勾网的python招聘信息,关键有两点,一,构建代理池解决同一ip访问频繁封ip问题,二,是找到python招聘信息真正的页面,拉勾网是动态加载出来的,招聘的信息是通过json数据传递的,直接在页面中是获取不到的,需要找到传递json数据的连接,获取到json数据,然后再解析数据,提取需要的因素
一,代理池搭建,可以直接去github上下载别人搭建好的代理池,我用的代理池链接是:https://github.com/Germey/CookiesPool 需要安装好redis数据库并配置好启动服务,需要安装flash,aiohttp,requests,redis-py,pyquery等python库。
二,招聘信息的在数据所在:
打开拉勾网,搜索python,按f12打开开发者工具,点击network,勾选XHR,可以查看动态加载页面的链接,然后查看Preview里面的数据,可以看到招聘信息的数据。
然后可以在headers里面找到链接信息 url和from data:
pn可以控制页面翻页加载新的数据。
代码如下:
import requests
import time
PROXY_POOL_URL = 'http://127.0.0.1:5000/get'
url = "https://www.lagou.com/jobs/positionAjax.json?city=%E6%AD%A6%E6%B1%89&needAddtionalResult=false"
headers = {
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/jobs/list_java?city=%E5%85%A8%E5%9B%BD&cl=false&fromSearch=true&labelWords=sug&suginput=Java',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': None,
'X-Requested-With': 'XMLHttpRequest'
}
proxy=None
MAX_COUNT=5
def get_proxy():
try:
response = requests.get(PROXY_POOL_URL)
if response.status_code == 200:
return response.text
return None
except ConnectionError:
return None
def get_html(url, count=1):
print('Crawling', url)
print('Trying Count', count)
global proxy
if count >= MAX_COUNT:
print('Tried Too Many Counts')
return None
try:
if proxy:
proxies = {
'http': 'http://' + proxy
}
response = requests.get(url, allow_redirects=False, headers=headers, proxies=proxies)
json_data = response.json()
else:
response = requests.get(url, allow_redirects=False, headers=headers)
json_data = response.json()
if 'content' in json_data:
postions = json_data.get('content', ).get("positionResult").get('result')
for postion in postions:
postion = {
'education': postion.get('education'),
'workYear': postion.get('workYear'),
'salary': postion.get('salary'),
'positionName': postion.get('positionName'),
'companyFullName': postion.get('companyFullName')
}
print(postion)
else:
# Need Proxy
print('requests too mach times ,need proxy')
proxy = get_proxy()
if proxy:
print('Using Proxy', proxy)
return get_html(url)
else:
print('Get Proxy Failed')
return None
except ConnectionError as e:
print('Error Occurred', e.args)
proxy = get_proxy()
count += 1
return get_html(url, count)
def main():
for page in range(1, 20):
print("第{}页数据".format(page))
get_html(url,count=1)
time.sleep(3)
if __name__=="__main__":
main()