Python3网络爬虫之Scrapy框架实现招聘数据抓取

最新推荐文章于 2024-08-16 16:08:24 发布

Tango糖果π

最新推荐文章于 2024-08-16 16:08:24 发布

阅读量1.3k

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/mahaokun/article/details/120469123

版权

Python 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

项目需求：

某招聘网上面有公司发布的的各种工作岗位，进入首页 https://careers.tencent.com/ 后可见到一个搜索框，如下图所示：

在搜索框输入岗位名称，跳转到如下图所示页面，页面上可见各种工作岗位信息，页面底部是页面选择按钮。

选中其中一个工作岗位点击进去，可见下图所示的岗位信息，其中包括岗位名称、地点、时间、工作职责和工作要求等信息。

现要求如下：

搭建腾讯招聘Scrapy框架
通过框架输入你要抓取的岗位名称，然后搜索结果里面的所有岗位的数据抓取下来，抓取内容包括岗位名称、地点、岗位类别、岗位需求、岗位职责、发布时间
将抓取的数据存入MySQL数据库和CSV文件中
制作岗位需求词云图

最终运行结果：

项目步骤：

搭建腾讯招聘Scrapy爬虫框架
抓取数据包，分析页面结构，厘清抓取思路和抓取策略
在items.py里面定义要抓取的数据字段
编写爬虫文件主体逻辑，实现数据的抓取
修改settings.py文件
编写管道文件pipelines.py，将数据存入MySQL数据库和CSV文件中
编写词云图代码逻辑，实现词云图输出

1.搭建腾讯招聘Scrapy爬虫框架

（1）安装Scrapy框架

pip install scrapy

（2）创建Scrapy项目

scrapy startproject Tencent

（3）创建完成后，切换到项目路径下

cd Tencent

（4）启动 Scrapy项目

scrapy genspider tencent careers.tencent.com

（5）运行 Scrapy项目

scrapy crawl tencent

或者在spiders文件夹同级的路径下创建run.py启动文件：

# -*- coding:utf-8 -*-
 
from scrapy import cmdline
 
cmdline.execute("scrapy crawl tencent".split())

2.抓取数据包，分析页面结构，厘清抓取思路和抓取策略

3.在items.py里面定义要抓取的数据字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass
    # 定义要抓取的数据结构
    job_name = scrapy.Field()
    job_location = scrapy.Field()
    job_requirement = scrapy.Field()
    job_responsibility = scrapy.Field()

4.编写爬虫文件主体逻辑，实现数据的抓取，tencent.py

import scrapy
import urllib.parse
import json
import math
from ..items import TencentItem

class TencentSpider(scrapy.Spider):
    # 对一级页面发送请求，获取岗位的post_id,通过post_id在构建二级页面url,才能获取岗位详情页的岗位信息
    # 首先获取岗位信息的总页数count，就可以获取所有符合搜索条件的岗位二级页url
    name = 'tencent'
    allowed_domains = ['careers.tencent.com']
    # start_urls = ['http://careers.tencent.com/']

    job = input("请输入你要搜索的工作岗位：")
    # 对job进行编码处理
    encode_job = urllib.parse.quote(job)
    # 一级页面地址（搜索页）
    one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1632547113170&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
    # 二级页面地址（岗位详情页）
    two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1632546677601&postId={}&language=zh-cn"
    # 重写start_urls
    start_urls = [one_url.format(encode_job,1)]

    def parse(self, response):
        # pass
        # 获取response返回的数据文本，并转换成Python json 字典形式
        json_dic = json.loads(response.text)
        # 获取"Data"节点下“Count”的值，得到搜索记录的总记录数，并int转换成整型
        job_count = int(json_dic["Data"]["Count"])
        # 通过总记录数获得总页数，ceil()函数实现向上取整
        total_pages = math.ceil(job_count / 10)
        # 构建每一页的url
        for page in range(1,total_pages+1):
            # 构建一级页面的url
            one_url = self.one_url.format(self.encode_job, page)
            # 对一级页面发起请求，获取所有岗位的post_id
            # 利用调度器实现，将url交给调度器进入队列
            # yield相当于return直接返回，可以参考 https://blog.csdn.net/mahaokun/article/details/120471305
            # callback回调自定义函数parse_post_ids，实现
            yield scrapy.Request(url=one_url, callback=self.parse_post_ids)

    # 自定义函数，实现一级页面Request的callback处理
    def parse_post_ids(self, response):
        # post_id_list列表获取respons json数据中的post_id数据集字典
        posts_list = json.loads(response.text)["Data"]["Posts"]
        for p in posts_list:
            post_id = p["PostId"]
            # 构建二级页面的url
            two_url = self.two_url.format(post_id)
            # 将url交给调度器进入队列
            yield scrapy.Request(url=two_url, callback=self.parse_job)

    # 自定义函数，实现二级页面Request的callback处理
    def parse_job(self, response):
        # 二级页面岗位详情解析逻辑
        item = TencentItem()
        job = json.loads(response.text)["Data"]
        # job_name = job["RecruitPostName"]
        # job_location = job["LocationName"]
        # job_requirement = job["Requirement"]
        # job_responsibility = job["Responsibility"]
        # print(job_name)
        # print(job_location)
        # print(job_requirement)
        # print(job_responsibility)
        item['job_name'] = job["RecruitPostName"]
        item['job_location'] = job["LocationName"]
        item['job_requirement'] = job["Requirement"]
        item['job_responsibility'] = job["Responsibility"]

        yield item

5.修改settings.py文件

（1）修改是否遵守robots协议，默认是True表示遵守，通常改为False
     ROBOTSTXT_OBEY = False
（2）修改最大并发请求数量，默认是16
     CONCURRENT_REQUESTS = 1
（3）修改下载延迟,类似time.sleep(2)
     DOWNLOAD_DELAY = 2
（4）修改默认的 request header，加入User-Agent
     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
（5）修改管道，300表示优先级，数字越小，优先级越高
    ITEM_PIPELINES = {
       'Tencent.pipelines.TencentPipeline': 300,
       'Tencent.pipelines.TencentMysqlPipeline': 200
    }

# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 设置User-Agent
# USER_AGENT = 'Tencent (+http://www.yourdomain.com)'

# Obey robots.txt rules
# 是否遵守robots协议，默认是True表示遵守，通常改为False
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并发请求数量，默认是16
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载延迟,类似time.sleep(2)
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# 重写默认的 request header
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道 300表示优先级，数字越小，优先级越高
ITEM_PIPELINES = {
   'Tencent.pipelines.TencentPipeline': 300,
   'Tencent.pipelines.TencentMysqlPipeline': 200
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

6.编写管道文件pipelines.py，将数据存入MySQL数据库和CSV文件中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql

class TencentPipeline:
    # 处理数据的逻辑
    def process_item(self, item, spider):
        print(item)
        return item

# 创建处理MySQL的管道
class TencentMysqlPipeline:
    # 自定义爬虫时开启一次，可以用来链接数据库
    def open_spider(self, spider):
        self.db = pymysql.connect(host="localhost", user="root", password="root", database="tencent", port=3306, charset="utf8")
        # 创建游标对象，用于执行mysql语句
        self.cursor = self.db.cursor()
        print("开始爬虫")

    def process_item(self, item, spider):
        sql_insert = "insert into tencent(name,location,requirement,responsibility) values(%s,%s,%s,%s)"
        data = [
            item["job_name"],
            item["job_location"],
            item["job_requirement"],
            item["job_responsibility"]
        ]
        self.cursor.execute(sql_insert, data)
        self.db.commit()
        return item

    # 自定义关闭爬虫
    def close_spider(self, spider):
        self.cursor.close()
        self.db.close()
        print("退出爬虫")

7.编写词云图代码逻辑，实现词云图输出

在Tencent工程下创建word_cloud文件夹，把yingwu.jpg和STHUPO.TFF素材拷贝到文件夹中，并创建wc.py文件

import numpy as np
import pandas as pd
# jieba用于对象的分词
import jieba
# wordcloud词云转换  ImageColorGenerator:可设置图片的显示颜色
from wordcloud import WordCloud,ImageColorGenerator
# 读取图片image
from PIL import Image
import numpy as np
import matplotlib.pyplot as mp


# pandas读取csv文件
# 返回值为DataFrame
df = pd.read_csv("../Tencent/tencent.csv", engine="python")
# 获取job_responsibility列的数据并拼成一个字符串
job_responsibility = df["job_responsibility"].values
job_responsibility_str = "".join(job_responsibility)
# 通过jieba分词转换成对象，再用转换成列表
jieba_split = list(jieba.cut(job_responsibility_str))

text = " ".join(jieba_split)
# 读取词云图的模板并将其转换为Numpy数组
mask = Image.open("yinwu.jpg")
mask = np.array(mask)
# 创建词云图对象
# mask：词云图模板，stopwords：过滤的词，collocations：为False就会去掉重复的词语，background_color：背景色
stopwords = ["的", "和", "技", "品"]

wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")
# 生产词云图
word_image = wc.generate(text)
# 设置图片显示颜色
image_color =ImageColorGenerator(mask)
wc.recolor(color_func=image_color)
# 显示词云图
mp.imshow(word_image)
# 关闭刻度
mp.axis("off")
# 显示图像
mp.show()

效果图如下：

如果想批量的把关键词作为词云图的词库，则可以把上述代码进行改动，创建get_cloud_img函数，代码如下：

import numpy as np
import pandas as pd
# jieba用于对象的分词
import jieba
# wordcloud词云转换  ImageColorGenerator:可设置图片的显示颜色
from wordcloud import WordCloud,ImageColorGenerator
# 读取图片image
from PIL import Image
import numpy as np
import matplotlib.pyplot as mp


# pandas读取csv文件
# 返回值为DataFrame
df = pd.read_csv("../Tencent/tencent.csv", engine="python")

def get_cloud_img(data, label):
    # 获取job_responsibility列的数据并拼成一个字符串
    # job_responsibility = df["job_responsibility"].values
    # job_responsibility_str = "".join(job_responsibility)
    job_responsibility_str = "".join(data)
    # 通过jieba分词转换成对象，再用转换成列表
    jieba_split = list(jieba.cut(job_responsibility_str))

    text = " ".join(jieba_split)
    # 读取词云图的模板并将其转换为Numpy数组
    mask = Image.open("yinwu.jpg")
    mask = np.array(mask)
    # 创建词云图对象
    # mask：词云图模板，stopwords：过滤的词，collocations：为False就会去掉重复的词语，background_color：背景色
    stopwords = ["的", "和", "技", "品"]

    wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")
    # 生产词云图
    word_image = wc.generate(text)
    # 设置图片显示颜色
    image_color =ImageColorGenerator(mask)
    wc.recolor(color_func=image_color)
    # 显示词云图
    mp.imshow(word_image)
    # 关闭刻度
    mp.axis("off")
    # 保存图片
    mp.savefig("%s.png" % label)
    # 显示图像
    mp.show()

# 批量调用生产词云图函数
get_cloud_img(df["job_responsibility"].values, "腾讯招聘-需求词云图")
get_cloud_img(df["job_requirement"].values, "腾讯招聘-职责词云图")