python爬虫爬取拉勾网职业信息

最新推荐文章于 2022-06-07 17:24:13 发布

风雨雾凇

最新推荐文章于 2022-06-07 17:24:13 发布

阅读量7.4k

点赞数 1

分类专栏：爬虫 python scrapy 文章标签： python 爬虫 scrapy

本文链接：https://blog.csdn.net/qq_33850908/article/details/79120203

版权

先放成果招聘关键字词云公司关键字词云代码git地址：https://github.com/fengyuwusong/lagou-scrapy目标抓取拉钩关于java工程师的招聘信息并制作成词云图。研究目标网站打开拉钩网可以发现目标url为：https://www.lagou.com/zhaopin/Java/2/?filterOption=2 ，这通过

摘要由CSDN通过智能技术生成

先放成果

招聘关键字词云
公司关键字词云

代码git地址：https://github.com/fengyuwusong/lagou-scrapy

目标

抓取拉钩关于java工程师的招聘信息并制作成词云图。

研究目标网站

打开拉钩网可以发现目标url为：https://www.lagou.com/zhaopin/Java/2/?filterOption=2 ，这通过翻页发现filterOption=2对应的是页码，这可以通过总页数遍历的方式爬取所有信息。
这里写图片描述
我们可以抓取得数据有：
公司名、发布日期、工资、最低需求、工作标签、公司名、公司类型、公司地址、公司关键词

开始scrapy项目：

具体参考我的上一遍文章：http://blog.csdn.net/qq_33850908/article/details/79063271

编写代码：

items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class LagouItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    day = scrapy.Field()
    salary = scrapy.Field()
    require = scrapy.Field()
    tag = scrapy.Field()
    keyWord = scrapy.Field()
    companyName = scrapy.Field()
    companyType = scrapy.Field()
    location = scrapy.Field()

这里没什么好说的，就是吧要抓取的数据列出来。
middlewares.py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
from scrapy import signals
import unit.userAgents as userAgents
from unit.proxyMysql import sqlHelper


class LagouSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it

最低0.47元/天解锁文章

风雨雾凇

关注

1
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
python爬虫爬取拉勾网职业信息

先放成果招聘关键字词云公司关键字词云代码git地址：https://github.com/fengyuwusong/lagou-scrapy目标抓取拉钩关于java工程师的招聘信息并制作成词云图。研究目标网站打开拉钩网可以发现目标url为：https://www.lagou.com/zhaopin/Java/2/?filterOption=2 ，这通过
复制链接

扫一扫