Python进阶之Scrapy抓取腾讯招聘数据
- 需求:通过Scrapy实现抓取腾讯招聘详情页数据
1. 创建Scrapy项目
scrapy startproject qqSpider
cd qqSpider
scrapy genspider hr tencent.com
2. 分析页面
- 分析目标: https://careers.tencent.com/
1. 寻找初始url
- 点击所有职位,页面跳转至https://careers.tencent.com/search.html
- 发现html源码中并无相关职位数据
- 通过检查network的XHR过滤后的数据,发现一个Query?开头的文件
- Preview发现正是我们要的数据,数据格式为json格式
2. 确定初始url
- 确定初始url为:https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1598527234342&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
- 去除无用数据,保留https://careers.tencent.com/tencentcareer/api/post/Query?&pageIndex={}&pageSize=10
- 通过设置pageIndex=后的数值进行翻页
- 通过修改pageSize=后的数值确定每页显示的职位数量,默认是一页显示10个
3. 寻找详情页url
- 随意打开某个职位的链接,进入https://careers.tencent.com/jobdesc.html?postId=1298927654904274944
- 与寻找初始页面的情况基本相同,发现源码中仍然没有职位信息
- 通过network的XHR,发现有两个数据,一个返回本条职位的信息,一个返回本页所有6个职位的信息
4. 确定详情页url
- 选择返回本页数据的json文件:https://careers.tencent.com/tencentcareer/api/post/ByRelated?timestamp=1598527677534&postId=1298927654904274944&num=7&language=zh-cn
- 去除无用timestamp等不影响结果的数据选取:https://careers.tencent.com/tencentcareer/api/post/ByPostId?&postId={}
- 通过修改postid=后的数字,对应不同的职位
3.示例程序
1. hr.py
- 注意:
- 1.右键qqSpider目录设置为source root,文件夹变蓝色,就可以直接import下边的类
- 2.meta={‘item’:item}可以传参给callback指定的类
- 3.callback回调函数是传递scarpy.Request()执行结果的方式,注意类中要使用self.class
- 4.xhr数据一般为json格式,通过json.loads()转为字典格式,注意导入json模块
import scrapy
import json
# 太复杂不建议使用下边方式
# from qqSpider.qqSpider.items import QqspiderItem
# 右键qqSpider目录设置为source root,文件夹变蓝色,就可以直接import下边的类了
from items import QqspiderItem
# 需求:爬取腾讯招聘详情页内容
class HrSpider(scrapy.Spider):
name = 'hr'
allowed_domains = ['tencent.com']
# 起始页url:https://careers.tencent.com/tencentcareer/api/post/Query?&pageIndex=1&pageSize=10
page_one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?&pageIndex={}&pageSize=10'
# 详情页url:https://careers.tencent.com/tencentcareer/api/post/ByPostId?&postId=1298869166773641216
page_detail_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?&postId={}'
start_urls = [page_one_url.format(1)]
# print(start_urls)
def parse(self, response):
for page in range(1,11):
url = self.page_one_url.format(page)
yield scrapy.Request(
url=url,
# 回调函数指定接收结果的函数
callback=self.parse_one
)
def parse_one(self,respsonse):
# 返回数据为json格式,要进行转换
data = json.loads(respsonse.text)
# print(data)
for job in data['Data']['Posts']:
# print(job)
# item = {}
item = QqspiderItem()
item['job_name'] = job['ProductName']
item['job_type'] = job['CategoryName']
post_id = job['PostId']
#拼接详情页url
detail_url = self.page_detail_url.format(post_id)
# print(detail_url)
yield scrapy.Request(
url=detail_url,
# 通过meta传参,把parse_one函数中的item的值传递给callback对象
meta={'item':item},
# detail_url页面访问结果传给pase_two函数
callback = self.parse_two
)
def parse_two(self,response):
# 获取meta传参结果的两种方式
# item = response.meta['item']
item = response.meta.get('item')
# print(item)
data = json.loads(response.text)
# print(data)
item['job_rep'] = data['Data']['Responsibility']
item['job_req'] = data['Data']['Requirement']
# print(item)
# 注意,scrapy的打印顺序与一般的打印不同
yield item
2. settings.py
注意:设置LOG_LEVEL、打开ITEM_PIPELINES、注释掉ROBOTSTXT_OBEY
BOT_NAME = 'qqSpider'
SPIDER_MODULES = ['qqSpider.spiders']
NEWSPIDER_MODULE = 'qqSpider.spiders'
LOG_LEVEL = 'WARNING'
ITEM_PIPELINES = {
'qqSpider.pipelines.QqspiderPipeline': 300,
}
3. items.py
- items.py用来定义爬虫中使用的item键值名,如果hr.py中写错键值,会直接报错
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class QqspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 如果hr.py里边字典的key值定义错误,则直接会报错误
job_name = scrapy.Field()
job_type = scrapy.Field()
job_rep = scrapy.Field()
job_req = scrapy.Field()
4. pipelines.py
- 注意:
- 1.可以通过spider.name ==和isinstance(item,QqspiderItem)判断数据来源
- 2.设置完不要直接运行,要检查爬虫程序是否写了yield返回数据
- 3.检查settings里是否开启了管道
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from items import QqspiderItem
class QqspiderPipeline:
def process_item(self, item, spider):
# 判断数据来源,判断item是否来源于QqspiderItem的数据
# 也可以用spider.name进行判断
print('-'*50)
if isinstance(item,QqspiderItem):
print('当前数据来源于QqspiderItem')
if spider.name == 'hr':
print('当前数据来源于hr')
# 注意
# 1.设置完不要直接运行,要检查爬虫程序是否写了yield
# 2.检查settings里是否开启了管道
return item
print('-' * 50)
5. start.py
# !/usr/bin/python
# Filename: start.py
# Data : 2020/08/27
# Author : --king--
# ctrl+alt+L自动加空格格式化
from scrapy import cmdline
# cmdline.execute(['scrapy','crawl','hr'])
cmdline.execute('scrapy crawl hr'.split())