先放成果
- 招聘关键字词云
- 公司关键字词云
代码git地址:https://github.com/fengyuwusong/lagou-scrapy
目标
抓取拉钩关于java工程师的招聘信息并制作成词云图。
研究目标网站
打开拉钩网可以发现目标url为:https://www.lagou.com/zhaopin/Java/2/?filterOption=2
,这通过翻页发现filterOption=2对应的是页码,这可以通过总页数遍历的方式爬取所有信息。
我们可以抓取得数据有:
公司名、发布日期、工资、最低需求、工作标签、公司名、公司类型、公司地址、公司关键词
开始scrapy项目:
具体参考我的上一遍文章:http://blog.csdn.net/qq_33850908/article/details/79063271
编写代码:
items.py:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class LagouItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
day = scrapy.Field()
salary = scrapy.Field()
require = scrapy.Field()
tag = scrapy.Field()
keyWord = scrapy.Field()
companyName = scrapy.Field()
companyType = scrapy.Field()
location = scrapy.Field()
这里没什么好说的,就是吧要抓取的数据列出来。
middlewares.py
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
from scrapy import signals
import unit.userAgents as userAgents
from unit.proxyMysql import sqlHelper
class LagouSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it