使用python构建招聘信息爬虫(二)拉勾网爬虫

最新推荐文章于 2024-08-15 01:55:01 发布

小邪kkk

最新推荐文章于 2024-08-15 01:55:01 发布

阅读量211

点赞数

分类专栏： Python学习文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_39482996/article/details/80874464

版权

Python学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

构建关于拉勾网的网络爬虫，首先要分析站点结构，然后设计足够的函数构建。

由此可知，此时拉勾网的招聘信息并不存在与页面中，由此，我们分析其ajax请求。

经分析发现，拉勾网的招聘信息是通过ajax返回到前台，然后通过javaScript渲染到页面上的。

而且，关于招聘信息的ajax请求使用的是post请求，我们要对此设置相应的data部分，作为http请求的一部分。

    data = {
        'first':'true',
        'pn':i,
        'kd':'python'
    }                                             #其中参数i为页码，作为参数传入进来

然后，对应拉勾网的后台设置，设置请求头

headers = {
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36',
     'Host': 'www.lagou.com',
     'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
     'Upgrade-Insecure-Requests': '1',
     'X-Anit-Forge-Code': '0',
     'X-Requested-With': 'XMLHttpRequest'
    }

在url和headers和data设置之后，之后就可以发送ajax请求，并且对其进行处理。为此设计的主函数为:

def main(i):
    keyword = 'python'
    url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
    data = {
        'first':'true',
        'pn':i,
        'kd':'python'
    }
    headers = {
     'User-Agent':

'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'

, 'Host': 'www.lagou.com', 'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=', 'Upgrade-Insecure-Requests': '1', 'X-Anit-Forge-Code': '0', 'X-Requested-With': 'XMLHttpRequest' } res = requests.post(url,data=data,headers=headers) json_result=res.json() processing_result(json_result) time.sleep(5)

此中，调用processing_result()函数，将jaon的数据提取出来，保存到数据库

分析ajax返回的json结构。

其数据主要存在 items=res[ 'content' ][ 'positionResult' ][ 'result' ]，

同时人们比较关心的是公司名称、职位名称、工作类型、工作经验、教育经历、城市、薪酬等方面，对此构建一个相应的键值对结构，并存储到数据库。

为此定义了processing_result(res):

def processing_result(res):
    items=res['content']['positionResult']['result']
    for item in items:
        position = {
            'companyName':item['companyShortName'],
            'positionName':item['positionName'],
            'jobNature':item['jobNature'],
            'workYear':item['workYear'],
            'education':item['education'],
            'city':item['city'],
            'salary':item['salary']
        }
        sava_to_mongo(position)
        print(position)

在此函数里，调用上次博客里的sava_to_mongo()函数。

至此，爬虫构建结束。

在调用主函数的时候，我们选择了多线程的一个模块。实例：

from multiprocessing import Pool

if __name__=='__main__':
    pool = Pool()
    pool.map(main,[i*1 for i in range(0,10)]

至此，全文结束。

附件：

import requests
import json
import time
from multiprocessing import Pool
from mongoSitting import *

def open_sitting_file(i):
    with open('sitting.json',encoding='utf-8') as f:
        sitting = json.load(f)
        return sitting['user-agent'][i];



def processing_result(res):
    items=res['content']['positionResult']['result']
    for item in items:
        position = {
            'companyName':item['companyShortName'],
            'positionName':item['positionName'],
            'jobNature':item['jobNature'],
            'workYear':item['workYear'],
            'education':item['education'],
            'city':item['city'],
            'salary':item['salary']
        }
        sava_to_mongo(position)
        print(position)




def main(i):
    keyword = 'python'
    url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'
    data = {
        'first':'true',
        'pn':i,
        'kd':'python'
    }
    headers = {
     'User-Agent': open_sitting_file(i),
     'Host': 'www.lagou.com',
     'Referer': 'https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=',
     'Upgrade-Insecure-Requests': '1',
     'X-Anit-Forge-Code': '0',
     'X-Requested-With': 'XMLHttpRequest'
    }
    res = requests.post(url,data=data,headers=headers)
    json_result=res.json()
    processing_result(json_result)
    time.sleep(5)

if __name__=='__main__':
    pool = Pool()
    pool.map(main,[i*1 for i in range(0,10)])

小邪kkk

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
使用python构建招聘信息爬虫(二)拉勾网爬虫

构建关于拉勾网的网络爬虫，首先要分析站点结构，然后设计足够的函数构建。由此可知，此时拉勾网的招聘信息并不存在与页面中，由此，我们分析其ajax请求。经分析发现，拉勾网的招聘信息是通过ajax返回到前台，然后通过javaScript渲染到页面上的。而且，关于招聘信息的ajax请求使用的是post请求，我们要对此设置相应的data部分，作为http请求的一部分。 data...
复制链接

扫一扫