Python3网络爬虫之Scrapy框架实现招聘数据抓取

项目需求:

某招聘网上面有公司发布的的各种工作岗位,进入首页 https://careers.tencent.com/ 后可见 到一个搜索框,如下图所示:

 

在搜索框输入岗位名称,跳转到如下图所示页面,页面上可见各种工作岗位信息,页面底部是页面选择按钮。

 

 选中其中一个工作岗位点击进去,可见下图所示的岗位信息,其中包括岗位名称、地点、时间、工作职责和工作要求等信息。

 

 现要求如下:

  1. 搭建腾讯招聘Scrapy框架
  2. 通过框架输入你要抓取的岗位名称,然后搜索结果里面的所有岗位的数据抓取下来,抓取内容包括岗位名称、地点、岗位类别、岗位需求、岗位职责、发布时间
  3. 将抓取的数据存入MySQL数据库和CSV文件中
  4. 制作岗位需求词云图

最终运行结果:

 

 项目步骤:

  1. 搭建腾讯招聘Scrapy爬虫框架
  2. 抓取数据包,分析页面结构,厘清抓取思路和抓取策略
  3. items.py里面定义要抓取的数据字段
  4. 编写爬虫文件主体逻辑,实现数据的抓取
  5. 修改settings.py文件
  6. 编写管道文件pipelines.py,将数据存入MySQL数据库和CSV文件中
  7. 编写词云图代码逻辑,实现词云图输出

 1.搭建腾讯招聘Scrapy爬虫框架

 (1)安装Scrapy框架

pip install scrapy

 (2)创建Scrapy项目

 scrapy startproject Tencent

(3)创建完成后,切换到项目路径下 

cd Tencent 

(4)启动 Scrapy项目

scrapy genspider tencent  careers.tencent.com 

(5)运行 Scrapy项目

 scrapy crawl tencent 

或者在spiders文件夹同级的路径下创建run.py启动文件:

# -*- coding:utf-8 -*-
 
from scrapy import cmdline
 
cmdline.execute("scrapy crawl tencent".split())

2.抓取数据包,分析页面结构,厘清抓取思路和抓取策略 

 

3.在items.py里面定义要抓取的数据字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass
    # 定义要抓取的数据结构
    job_name = scrapy.Field()
    job_location = scrapy.Field()
    job_requirement = scrapy.Field()
    job_responsibility = scrapy.Field()

4.编写爬虫文件主体逻辑,实现数据的抓取,tencent.py

import scrapy
import urllib.parse
import json
import math
from ..items import TencentItem

class TencentSpider(scrapy.Spider):
    # 对一级页面发送请求,获取岗位的post_id,通过post_id在构建二级页面url,才能获取岗位详情页的岗位信息
    # 首先获取岗位信息的总页数count,就可以获取所有符合搜索条件的岗位二级页url
    name = 'tencent'
    allowed_domains = ['careers.tencent.com']
    # start_urls = ['http://careers.tencent.com/']

    job = input("请输入你要搜索的工作岗位:")
    # 对job进行编码处理
    encode_job = urllib.parse.quote(job)
    # 一级页面地址(搜索页)
    one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1632547113170&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    # 二级页面地址(岗位详情页)
    two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1632546677601&postId={}&language=zh-cn"
    # 重写start_urls
    start_urls = [one_url.format(encode_job,1)]

    def parse(self, response):
        # pass
        # 获取response返回的数据文本,并转换成Python json 字典形式
        json_dic = json.loads(response.text)
        # 获取"Data"节点下“Count”的值,得到搜索记录的总记录数,并int转换成整型
        job_count = int(json_dic["Data"]["Count"])
        # 通过总记录数获得总页数,ceil()函数实现向上取整
        total_pages = math.ceil(job_count / 10)
        # 构建每一页的url
        for page in range(1,total_pages+1):
            # 构建一级页面的url
            one_url = self.one_url.format(self.encode_job, page)
            # 对一级页面发起请求,获取所有岗位的post_id
            # 利用调度器实现,将url交给调度器进入队列
            # yield相当于return直接返回,可以参考 https://blog.csdn.net/mahaokun/article/details/120471305
            # callback回调自定义函数parse_post_ids,实现
            yield scrapy.Request(url=one_url, callback=self.parse_post_ids)

    # 自定义函数,实现一级页面Request的callback处理
    def parse_post_ids(self, response):
        # post_id_list列表获取respons json数据中的post_id数据集字典
        posts_list = json.loads(response.text)["Data"]["Posts"]
        for p in posts_list:
            post_id = p["PostId"]
            # 构建二级页面的url
            two_url = self.two_url.format(post_id)
            # 将url交给调度器进入队列
            yield scrapy.Request(url=two_url, callback=self.parse_job)

    # 自定义函数,实现二级页面Request的callback处理
    def parse_job(self, response):
        # 二级页面岗位详情解析逻辑
        item = TencentItem()
        job = json.loads(response.text)["Data"]
        # job_name = job["RecruitPostName"]
        # job_location = job["LocationName"]
        # job_requirement = job["Requirement"]
        # job_responsibility = job["Responsibility"]
        # print(job_name)
        # print(job_location)
        # print(job_requirement)
        # print(job_responsibility)
        item['job_name'] = job["RecruitPostName"]
        item['job_location'] = job["LocationName"]
        item['job_requirement'] = job["Requirement"]
        item['job_responsibility'] = job["Responsibility"]

        yield item

5.修改settings.py文件

(1)修改是否遵守robots协议,默认是True表示遵守,通常改为False
     ROBOTSTXT_OBEY = False
(2)修改最大并发请求数量,默认是16
     CONCURRENT_REQUESTS = 1
(3)修改下载延迟,类似time.sleep(2)
     DOWNLOAD_DELAY = 2
(4)修改默认的 request header,加入User-Agent
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
(5)修改管道,300表示优先级,数字越小,优先级越高
    ITEM_PIPELINES = {
       'Tencent.pipelines.TencentPipeline': 300,
       'Tencent.pipelines.TencentMysqlPipeline': 200
    }
# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 设置User-Agent
# USER_AGENT = 'Tencent (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 是否遵守robots协议,默认是True表示遵守,通常改为False
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并发请求数量,默认是16
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载延迟,类似time.sleep(2)
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# 重写默认的 request header
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道 300表示优先级,数字越小,优先级越高
ITEM_PIPELINES = {
   'Tencent.pipelines.TencentPipeline': 300,
   'Tencent.pipelines.TencentMysqlPipeline': 200
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

 6.编写管道文件pipelines.py,将数据存入MySQL数据库和CSV文件中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql

class TencentPipeline:
    # 处理数据的逻辑
    def process_item(self, item, spider):
        print(item)
        return item

# 创建处理MySQL的管道
class TencentMysqlPipeline:
    # 自定义爬虫时开启一次,可以用来链接数据库
    def open_spider(self, spider):
        self.db = pymysql.connect(host="localhost", user="root", password="root", database="tencent", port=3306, charset="utf8")
        # 创建游标对象,用于执行mysql语句
        self.cursor = self.db.cursor()
        print("开始爬虫")

    def process_item(self, item, spider):
        sql_insert = "insert into tencent(name,location,requirement,responsibility) values(%s,%s,%s,%s)"
        data = [
            item["job_name"],
            item["job_location"],
            item["job_requirement"],
            item["job_responsibility"]
        ]
        self.cursor.execute(sql_insert, data)
        self.db.commit()
        return item

    # 自定义关闭爬虫
    def close_spider(self, spider):
        self.cursor.close()
        self.db.close()
        print("退出爬虫")

7.编写词云图代码逻辑,实现词云图输出 

 在Tencent工程下创建word_cloud文件夹,把yingwu.jpg和STHUPO.TFF素材拷贝到文件夹中,并创建wc.py文件

import numpy as np
import pandas as pd
# jieba用于对象的分词
import jieba
# wordcloud词云转换  ImageColorGenerator:可设置图片的显示颜色
from wordcloud import WordCloud,ImageColorGenerator
# 读取图片image
from PIL import Image
import numpy as np
import matplotlib.pyplot as mp


# pandas读取csv文件
# 返回值为DataFrame
df = pd.read_csv("../Tencent/tencent.csv", engine="python")
# 获取job_responsibility列的数据并拼成一个字符串
job_responsibility = df["job_responsibility"].values
job_responsibility_str = "".join(job_responsibility)
# 通过jieba分词转换成对象,再用转换成列表
jieba_split = list(jieba.cut(job_responsibility_str))

text = " ".join(jieba_split)
# 读取词云图的模板并将其转换为Numpy数组
mask = Image.open("yinwu.jpg")
mask = np.array(mask)
# 创建词云图对象
# mask:词云图模板,stopwords:过滤的词,collocations:为False就会去掉重复的词语,background_color:背景色
stopwords = ["的", "和", "技", "品"]

wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")
# 生产词云图
word_image = wc.generate(text)
# 设置图片显示颜色
image_color =ImageColorGenerator(mask)
wc.recolor(color_func=image_color)
# 显示词云图
mp.imshow(word_image)
# 关闭刻度
mp.axis("off")
# 显示图像
mp.show()

 效果图如下:

 如果想批量的把关键词作为词云图的词库,则可以把上述代码进行改动,创建get_cloud_img函数,代码如下:

import numpy as np
import pandas as pd
# jieba用于对象的分词
import jieba
# wordcloud词云转换  ImageColorGenerator:可设置图片的显示颜色
from wordcloud import WordCloud,ImageColorGenerator
# 读取图片image
from PIL import Image
import numpy as np
import matplotlib.pyplot as mp


# pandas读取csv文件
# 返回值为DataFrame
df = pd.read_csv("../Tencent/tencent.csv", engine="python")

def get_cloud_img(data, label):
    # 获取job_responsibility列的数据并拼成一个字符串
    # job_responsibility = df["job_responsibility"].values
    # job_responsibility_str = "".join(job_responsibility)
    job_responsibility_str = "".join(data)
    # 通过jieba分词转换成对象,再用转换成列表
    jieba_split = list(jieba.cut(job_responsibility_str))

    text = " ".join(jieba_split)
    # 读取词云图的模板并将其转换为Numpy数组
    mask = Image.open("yinwu.jpg")
    mask = np.array(mask)
    # 创建词云图对象
    # mask:词云图模板,stopwords:过滤的词,collocations:为False就会去掉重复的词语,background_color:背景色
    stopwords = ["的", "和", "技", "品"]

    wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")
    # 生产词云图
    word_image = wc.generate(text)
    # 设置图片显示颜色
    image_color =ImageColorGenerator(mask)
    wc.recolor(color_func=image_color)
    # 显示词云图
    mp.imshow(word_image)
    # 关闭刻度
    mp.axis("off")
    # 保存图片
    mp.savefig("%s.png" % label)
    # 显示图像
    mp.show()

# 批量调用生产词云图函数
get_cloud_img(df["job_responsibility"].values, "腾讯招聘-需求词云图")
get_cloud_img(df["job_requirement"].values, "腾讯招聘-职责词云图")

  • 0
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值