scrapy爬取前程无忧、应届生数据+分析

最新推荐文章于 2024-02-23 13:07:31 发布

weixin_44701462

最新推荐文章于 2024-02-23 13:07:31 发布

阅读量1.1k

点赞数 2

本文链接：https://blog.csdn.net/weixin_44701462/article/details/107174675

版权

本文介绍了使用Python Scrapy爬虫框架抓取前程无忧和应届生网站上的招聘信息，数据存储到MongoDB，然后进行数据清洗、分析，包括平均薪资、岗位分布、工作经验与薪资的关系等，最后通过数据可视化展示结果。

摘要由CSDN通过智能技术生成

一、总体要求
利用python编写爬虫程序，从招聘网站上爬取数据，将数据存入到MongoDB数据库中，将存入的数据作一定的数据清洗后做数据分析，最后将分析的结果做数据可视化。
二、环境
pycharm、mongodb、python3.6
三、爬取字段
1、具体要求：职位名称、薪资水平、招聘单位、工作地点、工作经验、学历要求、工作内容（岗位职责）、任职要求（技能要求）。
（1）新建一个项目：scrapy startproject pawuyijob
(2)生成一个spider文件：scrapy genspider wuyi wuyi.com
结构如下：
在这里插入图片描述
(3)修改settings.py

BOT_NAME = 'pawuyijob'

SPIDER_MODULES = ['pawuyijob.spiders']
NEWSPIDER_MODULE = 'pawuyijob.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
DOWNLOAD_DELAY = 0.5
ITEM_PIPELINES = {
   
   'pawuyijob.pipelines.PawuyijobPipeline': 300,
}

(4)编写items.py
代码如下：

import scrapy


class PawuyijobItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    work_place = scrapy.Field()  # 工作地点
    company_name = scrapy.Field()  # 公司名称
    position_name = scrapy.Field()  # 职位名称
    company_info = scrapy.Field()  # 公司信息
    work_salary = scrapy.Field()  # 薪资情况
    release_date = scrapy.Field()  # 发布时间
    job_require = scrapy.Field()  # 职位信息
    contact_way = scrapy.Field()  # 联系方式
    education = scrapy.Field()  # 学历
    work_experience = scrapy.Field()#工作经验
    pass

（5）编写spiders文件
我们最关键的东西就是能够把xpath找正确，很明显我们能看见每行数据都在这个标签中，我们可以写个循环
在这里插入图片描述
还有我们可以按住ctrl+f，看我们的xpath是否匹配到了

（6）详情页的url

下一页的url:

spider代码如下:

# -*- coding: utf-8 -*-
import scrapy


from pawuyijob.items import PawuyijobItem

class WuyiSpider(scrapy.Spider):
    name = 'wuyi'
    allowed_domains = ['51job.com']
    start_urls =['https://search.51job.com/list/000000,000000,0130%252C7501%252C7506%252C7502,01%252C32%252C38,9,99,%2520,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=']

    def parse(self, response):
        #每条数据存放的xpath
        node_list = response.xpath("//div[@id='resultList']/div[@class='el']")
        # 整个for循环结束代表 当前这一页已经爬完了, 那么就该开始爬取下一页
        for node in node_list:
            item = PawuyijobItem()
            # 职位名称
            item["position_name"] = node.xpath("./p/span/a/@title").extract_first()
            # 公司信息
            item["company_name"] = node.xpath("./span[@class='t2']/a/@title").extract_first()
            # 工作地点
            item["work_place"] = node.xpath("./span[@class='t3']/text()").extract_first()
            # 薪资情况
            item["work_salary"] = node.xpath("./span[@class='t4']/text()").extract_first()
            # 发布时间
            item["release_date"] = node.xpath("./span[@class='t5']/text()").extract_first()
            #详情页的url
            detail_url = node.xpath("./p/span/a/@href").extract_first()
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={
   "item": item})
            #下一页
        next_url = response.xpath("//div[@class='p_in']//li[@class='bk'][2]/a/@href").extract_first()
        #如果没有详情页的url我们就返回不再执行
        if not next_url:
            return

        yield scrapy.Request(url=next_url, callback=self.parse)

    def parse_detail(self, response):
        item = response.meta["item"]
        # 职位信息
        item["job_require"] = response.xpath("//div[@class='bmsg job_msg inbox']/p/text()").extract()
        # 联系方式
        item["contact_way"] = response.xpath("//div[@class='bmsg inbox']/a/te

最低0.47元/天解锁文章

weixin_44701462

关注

2
点赞
踩
20

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

scrapy爬取前程无忧、应届生 数据+分析

scrapy爬取前程无忧、应届生数据+分析