pyspider爬虫框架之boss直聘职位信息爬取

需求

1、 遍历首页所有职位分类
2、 点击进入职位分类详情页,按照地区抓取,职位名称,月薪,经验年限要求,学历要求,招聘公司,所属行业,轮次,人数(规模),发布时间
3、 点击进入职位详情页,抓取该职位的技能标签。

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-08-06 10:40:07
# Project: boss_recruit

from pyspider.libs.base_handler import *
import re
import datetime
from pymongo import MongoClient

# 连接线下数据库
DB_IP = '10.15.4.126'
DB_PORT = 28018

client = MongoClient(host=DB_IP, port=DB_PORT)

# admin 数据库有帐号,连接-认证-切换
db_auth = client.admin
db_auth.authenticate("xyzhang", "niub-food*2018")

DB_NAME = 'research'
DB_COL = 'boss_recruit'
db = client[DB_NAME]
col = db[DB_COL]



class Handler(BaseHandler):
    crawl_config = {
        "headers":{
  "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
                  },
        #"proxy": "http://10.15.100.94:6666"
    }

    url = 'https://www.zhipin.com/?ka=header-home'


    def format_date(self, date):
        return datetime.datetime.strptime(date, '%Y%m%d')


    @every(minutes=24 * 60)
    def on_start(self):
        print(get_proxy())
        self.crawl(self.url, callback=self.index_page, proxy=get_proxy())

    @config(age=60)
    def index_page(self, response):
        page = response.etree
        base_url = 'https://www.zhipin.com'

        # 所有行业列表
        vocation_list = page.xpath("//div[@class='job-menu']//div[@class='menu-sub']/ul/li")

        for each in vocation_list: 
            belong = each.xpath("./h4/text()")[0]

            detail_list = each.xpath("./div[@class='text']/a")
            print(belong)
            for detail in detail_list:
                detail_title = detail.xpath("./text()")[0]
                detail_url = base_url + detail.xpath("./@href")[0]

                #save = {"belonging":[belong, detail_title]}
                save = {
  "belonging": detail_title}

                print(detail_title, detail_url)

                self.crawl(detail_url, callback=self.det
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值