项目需求:
某招聘网上面有公司发布的的各种工作岗位,进入首页
https://careers.tencent.com/
后可见 到一个搜索框,如下图所示:
在搜索框输入岗位名称,跳转到如下图所示页面,页面上可见各种工作岗位信息,页面底部是页面选择按钮。
选中其中一个工作岗位点击进去,可见下图所示的岗位信息,其中包括岗位名称、地点、时间、工作职责和工作要求等信息。
现要求如下:
- 搭建腾讯招聘Scrapy框架
- 通过框架输入你要抓取的岗位名称,然后搜索结果里面的所有岗位的数据抓取下来,抓取内容包括岗位名称、地点、岗位类别、岗位需求、岗位职责、发布时间
- 将抓取的数据存入MySQL数据库和CSV文件中
- 制作岗位需求词云图
最终运行结果:
项目步骤:
- 搭建腾讯招聘Scrapy爬虫框架
- 抓取数据包,分析页面结构,厘清抓取思路和抓取策略
- 在items.py里面定义要抓取的数据字段
- 编写爬虫文件主体逻辑,实现数据的抓取
- 修改settings.py文件
- 编写管道文件pipelines.py,将数据存入MySQL数据库和CSV文件中
- 编写词云图代码逻辑,实现词云图输出
1.搭建腾讯招聘Scrapy爬虫框架
(1)安装Scrapy框架
pip install scrapy
(2)创建Scrapy项目
scrapy startproject Tencent
(3)创建完成后,切换到项目路径下
cd Tencent
(4)启动 Scrapy项目
scrapy genspider tencent careers.tencent.com
(5)运行 Scrapy项目
scrapy crawl tencent
或者在spiders文件夹同级的路径下创建run.py启动文件:
# -*- coding:utf-8 -*-
from scrapy import cmdline
cmdline.execute("scrapy crawl tencent".split())
2.抓取数据包,分析页面结构,厘清抓取思路和抓取策略
3.在items.py里面定义要抓取的数据字段
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# pass
# 定义要抓取的数据结构
job_name = scrapy.Field()
job_location = scrapy.Field()
job_requirement = scrapy.Field()
job_responsibility = scrapy.Field()
4.编写爬虫文件主体逻辑,实现数据的抓取,tencent.py
import scrapy
import urllib.parse
import json
import math
from ..items import TencentItem
class TencentSpider(scrapy.Spider):
# 对一级页面发送请求,获取岗位的post_id,通过post_id在构建二级页面url,才能获取岗位详情页的岗位信息
# 首先获取岗位信息的总页数count,就可以获取所有符合搜索条件的岗位二级页url
name = 'tencent'
allowed_domains = ['careers.tencent.com']
# start_urls = ['http://careers.tencent.com/']
job = input("请输入你要搜索的工作岗位:")
# 对job进行编码处理
encode_job = urllib.parse.quote(job)
# 一级页面地址(搜索页)
one_url = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1632547113170&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn"
# 二级页面地址(岗位详情页)
two_url = "https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1632546677601&postId={}&language=zh-cn"
# 重写start_urls
start_urls = [one_url.format(encode_job,1)]
def parse(self, response):
# pass
# 获取response返回的数据文本,并转换成Python json 字典形式
json_dic = json.loads(response.text)
# 获取"Data"节点下“Count”的值,得到搜索记录的总记录数,并int转换成整型
job_count = int(json_dic["Data"]["Count"])
# 通过总记录数获得总页数,ceil()函数实现向上取整
total_pages = math.ceil(job_count / 10)
# 构建每一页的url
for page in range(1,total_pages+1):
# 构建一级页面的url
one_url = self.one_url.format(self.encode_job, page)
# 对一级页面发起请求,获取所有岗位的post_id
# 利用调度器实现,将url交给调度器进入队列
# yield相当于return直接返回,可以参考 https://blog.csdn.net/mahaokun/article/details/120471305
# callback回调自定义函数parse_post_ids,实现
yield scrapy.Request(url=one_url, callback=self.parse_post_ids)
# 自定义函数,实现一级页面Request的callback处理
def parse_post_ids(self, response):
# post_id_list列表获取respons json数据中的post_id数据集字典
posts_list = json.loads(response.text)["Data"]["Posts"]
for p in posts_list:
post_id = p["PostId"]
# 构建二级页面的url
two_url = self.two_url.format(post_id)
# 将url交给调度器进入队列
yield scrapy.Request(url=two_url, callback=self.parse_job)
# 自定义函数,实现二级页面Request的callback处理
def parse_job(self, response):
# 二级页面岗位详情解析逻辑
item = TencentItem()
job = json.loads(response.text)["Data"]
# job_name = job["RecruitPostName"]
# job_location = job["LocationName"]
# job_requirement = job["Requirement"]
# job_responsibility = job["Responsibility"]
# print(job_name)
# print(job_location)
# print(job_requirement)
# print(job_responsibility)
item['job_name'] = job["RecruitPostName"]
item['job_location'] = job["LocationName"]
item['job_requirement'] = job["Requirement"]
item['job_responsibility'] = job["Responsibility"]
yield item
5.修改settings.py文件
(1)修改是否遵守robots协议,默认是True表示遵守,通常改为False ROBOTSTXT_OBEY = False (2)修改最大并发请求数量,默认是16 CONCURRENT_REQUESTS = 1 (3)修改下载延迟,类似time.sleep(2) DOWNLOAD_DELAY = 2 (4)修改默认的 request header,加入User-Agent 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36' (5)修改管道,300表示优先级,数字越小,优先级越高 ITEM_PIPELINES = { 'Tencent.pipelines.TencentPipeline': 300, 'Tencent.pipelines.TencentMysqlPipeline': 200 }
# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Tencent'
SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 设置User-Agent
# USER_AGENT = 'Tencent (+http://www.yourdomain.com)'
# Obey robots.txt rules
# 是否遵守robots协议,默认是True表示遵守,通常改为False
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# 最大并发请求数量,默认是16
CONCURRENT_REQUESTS = 1
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载延迟,类似time.sleep(2)
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# 重写默认的 request header
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道 300表示优先级,数字越小,优先级越高
ITEM_PIPELINES = {
'Tencent.pipelines.TencentPipeline': 300,
'Tencent.pipelines.TencentMysqlPipeline': 200
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
6.编写管道文件pipelines.py,将数据存入MySQL数据库和CSV文件中
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysql
class TencentPipeline:
# 处理数据的逻辑
def process_item(self, item, spider):
print(item)
return item
# 创建处理MySQL的管道
class TencentMysqlPipeline:
# 自定义爬虫时开启一次,可以用来链接数据库
def open_spider(self, spider):
self.db = pymysql.connect(host="localhost", user="root", password="root", database="tencent", port=3306, charset="utf8")
# 创建游标对象,用于执行mysql语句
self.cursor = self.db.cursor()
print("开始爬虫")
def process_item(self, item, spider):
sql_insert = "insert into tencent(name,location,requirement,responsibility) values(%s,%s,%s,%s)"
data = [
item["job_name"],
item["job_location"],
item["job_requirement"],
item["job_responsibility"]
]
self.cursor.execute(sql_insert, data)
self.db.commit()
return item
# 自定义关闭爬虫
def close_spider(self, spider):
self.cursor.close()
self.db.close()
print("退出爬虫")
7.编写词云图代码逻辑,实现词云图输出
在Tencent工程下创建word_cloud文件夹,把yingwu.jpg和STHUPO.TFF素材拷贝到文件夹中,并创建wc.py文件
import numpy as np
import pandas as pd
# jieba用于对象的分词
import jieba
# wordcloud词云转换 ImageColorGenerator:可设置图片的显示颜色
from wordcloud import WordCloud,ImageColorGenerator
# 读取图片image
from PIL import Image
import numpy as np
import matplotlib.pyplot as mp
# pandas读取csv文件
# 返回值为DataFrame
df = pd.read_csv("../Tencent/tencent.csv", engine="python")
# 获取job_responsibility列的数据并拼成一个字符串
job_responsibility = df["job_responsibility"].values
job_responsibility_str = "".join(job_responsibility)
# 通过jieba分词转换成对象,再用转换成列表
jieba_split = list(jieba.cut(job_responsibility_str))
text = " ".join(jieba_split)
# 读取词云图的模板并将其转换为Numpy数组
mask = Image.open("yinwu.jpg")
mask = np.array(mask)
# 创建词云图对象
# mask:词云图模板,stopwords:过滤的词,collocations:为False就会去掉重复的词语,background_color:背景色
stopwords = ["的", "和", "技", "品"]
wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")
# 生产词云图
word_image = wc.generate(text)
# 设置图片显示颜色
image_color =ImageColorGenerator(mask)
wc.recolor(color_func=image_color)
# 显示词云图
mp.imshow(word_image)
# 关闭刻度
mp.axis("off")
# 显示图像
mp.show()
效果图如下:
如果想批量的把关键词作为词云图的词库,则可以把上述代码进行改动,创建get_cloud_img函数,代码如下:
import numpy as np
import pandas as pd
# jieba用于对象的分词
import jieba
# wordcloud词云转换 ImageColorGenerator:可设置图片的显示颜色
from wordcloud import WordCloud,ImageColorGenerator
# 读取图片image
from PIL import Image
import numpy as np
import matplotlib.pyplot as mp
# pandas读取csv文件
# 返回值为DataFrame
df = pd.read_csv("../Tencent/tencent.csv", engine="python")
def get_cloud_img(data, label):
# 获取job_responsibility列的数据并拼成一个字符串
# job_responsibility = df["job_responsibility"].values
# job_responsibility_str = "".join(job_responsibility)
job_responsibility_str = "".join(data)
# 通过jieba分词转换成对象,再用转换成列表
jieba_split = list(jieba.cut(job_responsibility_str))
text = " ".join(jieba_split)
# 读取词云图的模板并将其转换为Numpy数组
mask = Image.open("yinwu.jpg")
mask = np.array(mask)
# 创建词云图对象
# mask:词云图模板,stopwords:过滤的词,collocations:为False就会去掉重复的词语,background_color:背景色
stopwords = ["的", "和", "技", "品"]
wc = WordCloud(font_path="STHUPO.TTF", mask=mask, stopwords=stopwords, collocations=False, background_color="white")
# 生产词云图
word_image = wc.generate(text)
# 设置图片显示颜色
image_color =ImageColorGenerator(mask)
wc.recolor(color_func=image_color)
# 显示词云图
mp.imshow(word_image)
# 关闭刻度
mp.axis("off")
# 保存图片
mp.savefig("%s.png" % label)
# 显示图像
mp.show()
# 批量调用生产词云图函数
get_cloud_img(df["job_responsibility"].values, "腾讯招聘-需求词云图")
get_cloud_img(df["job_requirement"].values, "腾讯招聘-职责词云图")