爬取拉勾网之一：利用requests和lxml库爬取

最新推荐文章于 2021-02-04 05:48:46 发布

bingtang21

最新推荐文章于 2021-02-04 05:48:46 发布

阅读量272

点赞数

本文链接：https://blog.csdn.net/bingtang21/article/details/100905814

版权

首先要说明的是该代码并不能完美运行（可运行代码见下篇），因为请求拉勾网的cookie信息中加入了时间元素，cookie信息很快就会过期，在爬去几条信息后就不能再提取信息了，会报错：IndexError: list index out of range，就是因为请求网站后已经获取不到信息了而导致列表越界。完整代码如下

import requests
from lxml import etree
import re
import time

headers = {
        'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/73.0.3683.103 Safari/537.36",
        'Referer': "https://www.lagou.com/jobs/list_python/p-city_213?px=default",
    }
# 获取主页面的cookie
urls = 'https://www.lagou.com/jobs/list_python/p-city_213?&cl=false&fromSearch=true&labelWords=&suginput='
s = requests.Session()
s.get(urls, headers=headers)
cookie = s.cookies



def requests_list_page():
    url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%B9%BF%E5%B7%9E&needAddtionalResult=false'
    data = {
        'first': 'false',
        'pn': 1,
        'kd': 'python'
    }
    # 获取多少页的职位信息
    for x in range(1, 10):
        data['pn'] = x
        response =requests.post(url, data=data, headers=headers, cookies=cookie)
        # 如果返回的数据式json，Python会自动load成json格式
        result = response.json()
        # 在职位信息页面获取每一个positionId
        positions = result['content']['positionResult']['result']
        for position in positions:
            position_id = position['positionId']
            position_url = 'https://www.lagou.com/jobs/%s.html?show=b160ab59c5c34f988f8c950e9430f969' % position_id
            parse_position_detail(position_url)


# 解析每一个职位的信息
def parse_position_detail(url):
    response = requests.get(url, headers=headers, cookies=cookie)
    text = response.text
    html =etree.HTML(text)

    # position_name = html.xpath("//h2[@class='name']/text()")[0]
    position_name = html.xpath("//div[@class='job-name']/@title")

    # company_name = html.xpath("//em[@class='fl-cn']/text()")[0].strip()
    company_name = html.xpath("//img[@class='b2']/@alt")

    job_request_spans = html.xpath("//dd[@class='job_request']//span")
    salary = job_request_spans[0].xpath('.//text()')[0].strip()
    # 利用正则表达式去掉数据前后的斜杠
    city = re.sub(r"[\s/]", "", job_request_spans[1].xpath('.//text()')[0].strip())
    work_years = re.sub(r"[\s/]", "", job_request_spans[2].xpath('.//text()')[0].strip())
    education = re.sub(r"[\s/]", "", job_request_spans[3].xpath('.//text()')[0].strip())
    # 利用正则表达式去掉HTML中的&nbsp;代码
    desc = "".join(html.xpath("//dd[@class='job_bt']/div/p/text()"))
    desc = re.sub(r"[\xa0]", "", desc)
    position = {
        "name": position_name,
        "company_name": company_name,
        "salary": salary,
        "city": city,
        "work_years": work_years,
        "education": education,
        "desc": desc
    }
    print(position)


def main():
    requests_list_page()


if __name__ == '__main__':
    main()

bingtang21

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬取拉勾网之一：利用requests和lxml库爬取

首先要说明的是该代码并不能完美运行（可运行代码见下篇），因为请求拉勾网的cookie信息中加入了时间元素，cookie信息很快就会过期，在爬去几条信息后就不能再提取信息了，会报错：IndexError: list index out of range，就是因为请求网站后已经获取不到信息了而导致列表越界。完整代码如下import requestsfrom lxml import e...
复制链接

扫一扫