要求:
1.爬取字段:职位名称、薪资水平、招聘单位、工作地点、工作经验、学历要求、工作内容(岗位职责)、任职要求(技能要求)。
2.数据存储:将爬取的数据存储到MongoDB数据库中。
3.数据分析与可视化:
(1)分析“数据分析”、“大数据开发工程师”、“数据采集”等岗位的平均工资、最高工资、最低工资,并作条形图将结果展示出来;
(2)分析“数据分析”、“大数据开发工程师”、“数据采集”等大数据相关岗位在成都、北京、上海、广州、深圳的岗位数,并做饼图将结果展示出来。
(3)分析大数据相关岗位1-3年工作经验的薪资水平(平均工资、最高工资、最低工资),并做出条形图展示出来;
4.词云图
基本结构图:
scrapy startproject qianchen01
cd qianchen01
scrapy genspider -t crawl qianchen qianchen.com
基本配置:
items.py
import scrapy
class Qianchen01Item(scrapy.Item):
position = scrapy.Field() #职位名称
salary = scrapy.Field() #工资
company = scrapy.Field() #公司名字
where = scrapy.Field() #地点
job_require = scrapy.Field() #工作要求、内容
experience = scrapy.Field() #经验
education = scrapy.Field() #教育
pipelines.py(连接MongoDB)
from pymongo import MongoClient
class Qianchen01Pipeline(object):
# 在open_spider方法中连接MongoDB,创建数据库和集合,也可以在__init__初始化方法中处理这些操作
def open_spider(self, spider):
self.db = MongoClient('localhost', 27017).QCa_db
self.collection = self.db.qianchen_collection
def process_item(self, item, spider):
# 把Item转化成字典方式,然后添加数据
self.collection.insert_one(dict(item))
return item
settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4168.2 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 1
ITEM_PIPELINES = {
'qianchen01.pipelines.Qianchen01Pipeline': 300,
}
2.数据存储:将爬取的数据存储到MongoDB数据库中。 qianchen.py
# -*- coding: utf-8 -*-
import scrapy
from qianchen01.items import Qianchen01Item
class QianchenSpider(scrapy.Spider):
name = 'qianchen'
allowed_domains = ['51job.com']
start_urls = ['https://search.51job.com/list/000000,000000,0130%252C7501%252C7506%252C7502,01%252C32%252C38,9,99,%2520,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=']
def parse(self, response):
joblist = response.xpath("//div[@id='resultList']/div[@class='el']")
for job in joblist:
item = Qianchen01Item()
item["position"] = job.xpath("./p/span/a/@title").extract_first() #职位
item["salary"] = job.xpath("./span[@class='t4']/text()").extract() #工资
item["company"] = job.xpath("./span[@class='t2']/a/@title").extract_first() #公司名字
item["where"] = job.xpath("./span[@class='t3']/text()").extract_first() #地点
#详情页面
detail_url = job.xpath("./p/span/a/@href").extract_first()
yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={
"item": item})
next_url = response.xpath("//div[@class='p_in']//li[@class='bk'][2]/a/@href").extract_first()
if not next_url:
return
yield scrapy.Request(url=next_url, callback=self.parse)
def parse_detail(self, response):
item = response.meta["item"]
item["job_require"] = response.xpath("//div[@class='bmsg job_msg inbox']/p/text()").extract()#工作需求
item["education"] = response.xpath("//div[@class='tHeader tHjob']/div/div/p[2]/text()").extract()[2]#学历
item["experience"] = response.xpath("//div[@class='tHeader tHjob']/div/div/p[2]/text()").extract()[1]#经验<