首先要说明的是该代码并不能完美运行(可运行代码见下篇),因为请求拉勾网的cookie信息中加入了时间元素,cookie信息很快就会过期,在爬去几条信息后就不能再提取信息了,会报错:IndexError: list index out of range,就是因为请求网站后已经获取不到信息了而导致列表越界。完整代码如下
import requests
from lxml import etree
import re
import time
headers = {
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/73.0.3683.103 Safari/537.36",
'Referer': "https://www.lagou.com/jobs/list_python/p-city_213?px=default",
}
# 获取主页面的cookie
urls = 'https://www.lagou.com/jobs/list_python/p-city_213?&cl=false&fromSearch=true&labelWords=&suginput='
s = requests.Session()
s.get(urls, headers=headers)
cookie = s.cookies
def requests_list_page():
url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'
data = {
'first': 'false',
'pn': 1,
'kd': 'python'
}
# 获取多少页的职位信息
for x in range(1, 10):
data['pn'] = x
response =requests.post(url, data=data, headers=headers, cookies=cookie)
# 如果返回的数据式json,Python会自动load成json格式
result = response.json()
# 在职位信息页面获取每一个positionId
positions = result['content']['positionResult']['result']
for position in positions:
position_id = position['positionId']
position_url = 'https://www.lagou.com/jobs/%s.html?show=b160ab59c5c34f988f8c950e9430f969' % position_id
parse_position_detail(position_url)
# 解析每一个职位的信息
def parse_position_detail(url):
response = requests.get(url, headers=headers, cookies=cookie)
text = response.text
html =etree.HTML(text)
# position_name = html.xpath("//h2[@class='name']/text()")[0]
position_name = html.xpath("//div[@class='job-name']/@title")
# company_name = html.xpath("//em[@class='fl-cn']/text()")[0].strip()
company_name = html.xpath("//img[@class='b2']/@alt")
job_request_spans = html.xpath("//dd[@class='job_request']//span")
salary = job_request_spans[0].xpath('.//text()')[0].strip()
# 利用正则表达式去掉数据前后的斜杠
city = re.sub(r"[\s/]", "", job_request_spans[1].xpath('.//text()')[0].strip())
work_years = re.sub(r"[\s/]", "", job_request_spans[2].xpath('.//text()')[0].strip())
education = re.sub(r"[\s/]", "", job_request_spans[3].xpath('.//text()')[0].strip())
# 利用正则表达式去掉HTML中的 代码
desc = "".join(html.xpath("//dd[@class='job_bt']/div/p/text()"))
desc = re.sub(r"[\xa0]", "", desc)
position = {
"name": position_name,
"company_name": company_name,
"salary": salary,
"city": city,
"work_years": work_years,
"education": education,
"desc": desc
}
print(position)
def main():
requests_list_page()
if __name__ == '__main__':
main()