爬取拉勾网之一:利用requests和lxml库爬取

     首先要说明的是该代码并不能完美运行(可运行代码见下篇),因为请求拉勾网的cookie信息中加入了时间元素,cookie信息很快就会过期,在爬去几条信息后就不能再提取信息了,会报错:IndexError: list index out of range,就是因为请求网站后已经获取不到信息了而导致列表越界。完整代码如下

import requests
from lxml import etree
import re
import time

headers = {
        'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/73.0.3683.103 Safari/537.36",
        'Referer': "https://www.lagou.com/jobs/list_python/p-city_213?px=default",
    }
# 获取主页面的cookie
urls = 'https://www.lagou.com/jobs/list_python/p-city_213?&cl=false&fromSearch=true&labelWords=&suginput='
s = requests.Session()
s.get(urls, headers=headers)
cookie = s.cookies



def requests_list_page():
    url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'
    data = {
        'first': 'false',
        'pn': 1,
        'kd': 'python'
    }
    # 获取多少页的职位信息
    for x in range(1, 10):
        data['pn'] = x
        response =requests.post(url, data=data, headers=headers, cookies=cookie)
        # 如果返回的数据式json,Python会自动load成json格式
        result = response.json()
        # 在职位信息页面获取每一个positionId
        positions = result['content']['positionResult']['result']
        for position in positions:
            position_id = position['positionId']
            position_url = 'https://www.lagou.com/jobs/%s.html?show=b160ab59c5c34f988f8c950e9430f969' % position_id
            parse_position_detail(position_url)


# 解析每一个职位的信息
def parse_position_detail(url):
    response = requests.get(url, headers=headers, cookies=cookie)
    text = response.text
    html =etree.HTML(text)

    # position_name = html.xpath("//h2[@class='name']/text()")[0]
    position_name = html.xpath("//div[@class='job-name']/@title")

    # company_name = html.xpath("//em[@class='fl-cn']/text()")[0].strip()
    company_name = html.xpath("//img[@class='b2']/@alt")

    job_request_spans = html.xpath("//dd[@class='job_request']//span")
    salary = job_request_spans[0].xpath('.//text()')[0].strip()
    # 利用正则表达式去掉数据前后的斜杠
    city = re.sub(r"[\s/]", "", job_request_spans[1].xpath('.//text()')[0].strip())
    work_years = re.sub(r"[\s/]", "", job_request_spans[2].xpath('.//text()')[0].strip())
    education = re.sub(r"[\s/]", "", job_request_spans[3].xpath('.//text()')[0].strip())
    # 利用正则表达式去掉HTML中的 代码
    desc = "".join(html.xpath("//dd[@class='job_bt']/div/p/text()"))
    desc = re.sub(r"[\xa0]", "", desc)
    position = {
        "name": position_name,
        "company_name": company_name,
        "salary": salary,
        "city": city,
        "work_years": work_years,
        "education": education,
        "desc": desc
    }
    print(position)


def main():
    requests_list_page()


if __name__ == '__main__':
    main()

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值