python爬虫腾讯招聘网岗位数据(requests版和scrapy版) --存入json

最新推荐文章于 2023-11-22 13:19:58 发布

俞泰鑫

最新推荐文章于 2023-11-22 13:19:58 发布

阅读量911

点赞数

分类专栏： spider #python #多线程文章标签： python spider

本文链接：https://blog.csdn.net/god_yutaixin/article/details/103166185

版权

#python 同时被 3 个专栏收录

46 篇文章 2 订阅

订阅专栏

spider

23 篇文章 0 订阅

订阅专栏

#多线程

1 篇文章 0 订阅

订阅专栏

需求

抓取职位名称，工作职责，岗位要求，发布时间，地点

流程

一级页面(数据为搜索的岗位列表)处理：目标：提取数据：二级页面的链接
1. 上面有二级页面(具体岗位信息)的链接，看是静态页面还是动态页面(在源码中搜页面中的词),发现是动态页面，动态页面获取数据用json
2. 马不停蹄去抓包：Network -->preview -->找到一级页面的数据所在的包，在它们的Headers -->General -->RequestURL找到后端传给前端数据的接口
3. 分析接口和queryStringParams，找到每个页面和接口url的规律
  1.接口url
  https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1574243208260&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
  2. queryStringParams
  pageIndex: 1
  keyword:python
  pageSize: 10
  3. 发现变量
  pageindex:规律为从第一页为1开始，递增1
  keyword：你搜索的关键字
  4. 改写一级页面数据包的接口
  https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1574243208260&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn
4. 在地址栏打开一级页面json数据包的接口，找到了二级页面链接，使用xpath取出来
  http://careers.tencent.com/jobdesc.html?postId=1123175129107402752
二级页面处理：目标：提取数据：职位名称，工作职责，岗位要求
1. 发现也是动态页面
2. 马不停蹄去抓包，找到需求所在包
3. 分析接口和queryStringParams，找到每个页面和接口url的规律
  1. 接口url
    https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1574243615482&postId=1176675067816316928&language=zh-cn
  2. queryStringParams
    postId: 1176675067816316928，每个职位有一个postId
  3. 发现变量postid，postid可以从一级页面的json数据中提取出来
  4. 改写二级页面json数据包接口
    https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1574243615482&postId={}&language=zh-cn
4. 进入该数据包json，找到需求对应的key,如下：
  1. RecruitPostName：职位名称
  2. Responsibility：工作职责
  3. Requirement：岗位要求
  4. LocationName：地点
  5. LastUpdateTime：更新时间
5. 数据预处理后给到管道文件保存
关于队列
一级页面和二级页面各自创建队列存放url

requests版本

import requests
import json
from fake_useragent import UserAgent
from threading import Thread,Lock
import time
from queue import Queue
from urllib import parse

class TencentJobSpider:
	def __init__(self):
		self.one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1574243208260&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
		self.two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1574243615482&postId={}&language=zh-cn'
		#初始化一级页面队列
		self.one_q = Queue()
		#初始化二级页面队列
		self.two_q = Queue()
		self.lock = Lock()

		#计数
		self.i = 0

		#初始化json文件
		self.f = open('tencent.json','a')
		self.item_list = []

	#功能函数，请求并获取响应html,若是静态页面返回html，若是动态页面返回json
	def get_html(self,url):
		headers = {'User-Agent':UserAgent().random}
		html = requests.get(url=url,headers=headers).text
		return html
	
	#拼接一级页面url(参数为keyword和index),然后一级页面url入队列
	def url_in(self):
		#拼接url地址
		keyword = input('请输入职位类别：')
		keyword = parse.quote(keyword)

		#获取某关键词下的岗位总页数：例：Python岗位的总页数
		total = self.get_total(keyword)
		
		for index in range(1,total+1):	#总页数还不知道
			one_url = self.one_url.format(keyword,index)
			#url入队
			self.one_q.put(one_url)
	
	#获取一级页面中某个岗位类别的总页数的具体函数
	def get_total(self,keyword):
		#思路：去抓包，找(页面总数)和每页显示该岗位数(10)的关系
		#发现json串中有该关键词岗位总数count，querystringparams中有每页显示岗位数(10)
		#从Json中拿出count
		url  = self.one_url.format(keyword,1)
		html = json.loads(self.get_html(url))
		count = int(html['Data']['Count'])
		if count % 10 == 0:
			total = count // 10
		else:
			total = count // 10 + 1
		return total

	#一级页面的线程事件函数，目标：干两件事，提取postid,拼接二级页面链接并将链接放入二级页面url队列：self.two_q
	def parse_one_page(self):
		while True:
			if not self.one_q.empty():
				one_url = self.one_q.get()
				html = json.loads(self.get_html(one_url))
				#遍历json串，拿出postid，数据结构与页面Json串相关，无需看懂
				for job in html['Data']['Posts']:
					post_id = job['PostId']
					two_url = self.two_url.format(post_id)
					#二级页面url入队，所以必须parse_one_page()函数先执行完，二级页面的线程才能真正干活
					self.two_q.put(two_url)
			#当一级页面队列get()空后，至此一级页面事情做完，跳出循环
			else:
				break

	#二级页面的线程事件函数，目标：抓取职位名称，工作职责，岗位要求,时间，地点
	def parse_two_page(self):
		while True:
			#当设置timeout时，时间过了会抛出异常，用try来捕获异常，避免程序被打断
			try:
				two_url = self.two_q.get(timeout=3)	#block参数默认为True
				html = json.loads(self.get_html(two_url))
				print(html)
				#在页面中查看接口json的数据格式后确认如何取出数据
				item = {}
				item['name'] = html['Data']['RecruitPostName']
				item['city'] = html['Data']['LocationName']
				item['duty'] = html['Data']['Responsibility']
				item['requ'] = html['Data']['Requirement']
				item['time'] = html['Data']['LastUpdateTime']
				print(item)
				
				#操作共享资源,上锁
				self.lock.acquire()
				#计数
				self.i += 1
				#将数据放入数组中，准备写入json文件
				self.item_list.append(item)
				self.lock.release()

			except Exception as e:
				break
		
	def run(self):
		#先将一级页面url入队
		self.url_in()
		
		one_list = []
		two_list = []
		for i in range(3):
			t = Thread(target=self.parse_one_page)
			one_list.append(t)
			t.start()
		for i in range(5):
			t = Thread(target=self.parse_two_page)
			two_list.append(t)
			t.start()
		
		#回收
		for one in one_list:
			one.join()
		for two in two_list:
			two.join()

		print('数量:',self.i)

		#存入json文件
		json.dump(self.item_list,self.f,ensure_ascii=False)
		self.f.close()

if __name__ == '__main__':
	begin = time.time()
	spider = TencentJobSpider()
	spider.run()
	end = time.time()
	print('执行时间:%.2f' % (end-begin)

scrapy版本

items.py中定义爬取的数据结构：名称+类别+职责+要求+地址+时间

job_name = scrapy.Field()
job_type = scrapy.Field()
job_duty = scrapy.Field()
job_require = scrapy.Field()
job_address = scrapy.Field()
job_time = scrapy.Field()

爬虫文件Tencent.py中

import scrapy
from urllib import parse	#给查询字符串编码用
import requests
import json
from ..items import TencentItem

class TencentSpider(scrpay.Spider):
	name = 'tencent'
	allowed_domians = ['careers.tencent.com']
	#岗位列表页面url
	one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1574243208260&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
	#岗位详情页面url
	two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1574243615482&postId={}&language=zh-cn'
	keyword = input('请输入工作：')	#得对查询字符串编码，查询字符串中不能有中文
	keyword = parse.quote(keyword)
	
	#1. 准备一级页面url，并将1级页面url入队列
	def start_request(self):
		#获取某岗位的列表页的总页数
		total = self.get_total()
		for index in range(1,total+1):
			url = self.one_url.format(self.keyword,index)
			#交给调度器入队列
			yield scrapy.Request(url=url,
								 callback=self.parse	#指定响应数据回来后的解析函数
								 )
	
	#1. 准备1级页面url：获取该岗位列表页总页数：总页数在json包中有
	def get_total(self):
		#先向改岗位的第一个列表页发一个请求
		url = self.one_url.format(self.keyword,1)
		#发请求获取响应：json
		html = requests.get(url=url,headers={'User-Agent':''}).json()
		count = int(html['Data':'Count'])
		if count % 10 == 0:
			total = count //10
		else:
			total = count // 10 + 1
		return total

	#2. 解析一级页面的方法，提取postId(职位id,二级页面要用)，同时拼接成二级页面的url交给调度器
	def parse(self,response):	
		#响应回来的response为json对象，需要将json对象转为python对象
		html = json.loads(response.text)
		#提取postId，一页中有10个职位
		for job in html['Data']['Posts']:
			post_id = job['PostId']
			url = self.two_url.format(post_id)
			#将二级页面链接交给调度器去获取响应内容
			yield scrapy.Request(url=url,
								 callback=self.parse_two_page
								 )

	#2. 解析二级页面返回结果,并将整理后的数据交给管道保存
	def parse_two_page(self,response):	#每一个response都是一个具体职位信息
		#取出Json数据,并转为pytho对象
		html = json.loads(response.text)
		item = TencentItem()
		item['job_name'] = html['Data']['RecruitPostName']
		item['job_type'] = html['Data']['CategoryName']
		item['job_duty'] = html['Data']['Responsibility']
		item['job_require'] = html['Data']['Requirement']
		item['job_address'] = html['Data']['LocationName']
		item['job_time'] = html['Data']['LastUpdateTime']
	
		#将数据给到管道
		yield item

管道文件

class TencentPipeline(object):
	def process_item(self,item,spider):
		print(dict(item))
		return item

settings.py

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
		'Accept':'...',
		'Accept-Language':'...',
		'User-Agent':'...'
		}

#设置管道文件保存数据
ITEM_PIPELINES = {'Tencent.pipelines.TencentPipeline':300,}

项目主目录下创建run.py

from scrapy import cmdline

cmdline.execute('scrapy crawl tencent -o tencent_job.csv'.split())	#数据持久化成csv文件

俞泰鑫

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python爬虫腾讯招聘网岗位数据(requests版和scrapy版) --存入json

需求抓取职位名称，工作职责，岗位要求，发布时间，地点流程一级页面处理：目标：提取二级页面的链接上面有二级页面(具体岗位信息)的链接，看是静态页面还是动态页面(在源码中搜页面中的词),发现是动态页面，获取数据用json马不停蹄去抓包：Network -->preview -->找到一级页面的数据所在的包，在它们的Headers -->General -->Re...
复制链接

扫一扫