目的:使用scrapy框架爬取腾讯社会招聘的职位信息
查看网页结构,数据信息在标签代码中,也可以查看其位置以及各项详细信息。
使用scrapy框架进行数据的爬取并存储在本地文件中:需要重写三个文件,分别为items.py ,自定义spider文件以及负责数据存储的pipelines.py。
items.py:定义爬取数据的字段信息
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TencentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 职位名
position_name = scrapy.Field()
# 详情连接
position_link = scrapy.Field()
# 职位类别
position_Type = scrapy.Field()
# 招聘人数
people_Num = scrapy.Field()
# 工作地点
work_Location = scrapy.Field()
# 发布时间
publish_Time = scrapy.Field()
tencent_spider.py:定义请求信息以及爬取
# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem
class TencentSpiderSpider(scrapy.Spider):
#爬虫名称
name = 'tencent_spider'
#爬虫域
allowed_domains = ['tencent.com']
url = "http://hr.tencent.com/position.php?&start="
offset = 0
#请求列表
start_urls = [
url + str(offset)
]
#解析函数
def parse(self, response):
for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
#初始化模型对象
item = TencentItem()
item['position_name'] = each.xpath("./td[1]/a/text()").extract()[0]
#详情链接
item['position_link'] = each.xpath("./td[1]/a/@href").extract()[0]
try:
#职位类别,有些数据此字段为空,出现索引错误
item['position_Type'] = each.xpath("./td[2]/text()").extract()[0]
except IndexError:
item['position_Type'] = ''
#招聘人数
item['people_Num'] = each.xpath("./td[3]/text()").extract()[0]
#工作地点
item['work_Location'] = each.xpath("./td[4]/text()").extract()[0]
#发布时间
item['publish_Time'] = each.xpath("./td[5]/text()").extract()[0]
yield item
#循环,每次爬完一页的信息后,修改offset的值
if self.offset < 1000:
self.offset += 10
#拼接新的请求链接,然后调用回调函数处理响应的请求
yield scrapy.Request(self.url + str(self.offset), callback=self.parse)
pipelines.py:数据的解析与储存,将数据存储在tencent.json文件中,因为数据中含有中文,所以在使用json.dump()以及file.open()的时候要使用编码。
# -*- coding: utf-8 -*-
import json
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class TencentPipeline(object):
"""
功能:保存item数据
"""
count = 1
def open_spider(self,spider):
self.filename = open("tencent.json", "w",encoding="utf-8")
def process_item(self, item, spider):
text =str(self.count)+":" + json.dumps(dict(item), ensure_ascii = False) + "\n"
self.filename.write(text)
self.count += 1
return item
def close_spider(self, spider):
self.filename.close()
setting文件中释放请求头信息以及pipelines信息
执行:scrapy crawl tencent_spider
执行结果示例:
1:{"position_name": "SA-腾讯社交广告金融行业运营经理(微信广告 深圳)", "position_link": "position_detail.php?id=43501&keywords=&tid=0&lid=0", "position_Type": "产品/项目类", "people_Num": "2", "work_Location": "深圳", "publish_Time": "2018-08-18"}
2:{"position_name": "WXG07-113 广州研发部后台安全策略工程师(广州)", "position_link": "position_detail.php?id=43505&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "1", "work_Location": "广州", "publish_Time": "2018-08-18"}
3:{"position_name": "OMG192-渠道运营管理(北京&上海)", "position_link": "position_detail.php?id=43494&keywords=&tid=0&lid=0", "position_Type": "市场类", "people_Num": "1", "work_Location": "上海", "publish_Time": "2018-08-18"}
4:{"position_name": "MIG08-IOS开发工程师(腾讯WiFi管家)", "position_link": "position_detail.php?id=43496&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "2", "work_Location": "广州", "publish_Time": "2018-08-18"}
5:{"position_name": "22851-腾讯视频电视剧内容与剧本评估(北京)", "position_link": "position_detail.php?id=43497&keywords=&tid=0&lid=0", "position_Type": "内容编辑类", "people_Num": "1", "work_Location": "北京", "publish_Time": "2018-08-18"}
6:{"position_name": "17759-产品分析经理(北京)", "position_link": "position_detail.php?id=43500&keywords=&tid=0&lid=0", "position_Type": "市场类", "people_Num": "1", "work_Location": "北京", "publish_Time": "2018-08-18"}
7:{"position_name": "25929-互娱沙盒类手游服务端开发(深圳)", "position_link": "position_detail.php?id=43487&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "2", "work_Location": "深圳", "publish_Time": "2018-08-18"}
8:{"position_name": "25929-沙盒游戏运营WEB开发工程师(深圳)", "position_link": "position_detail.php?id=43489&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "1", "work_Location": "深圳", "publish_Time": "2018-08-18"}
9:{"position_name": "25929-互娱沙盒类手游客户端开发(深圳)", "position_link": "position_detail.php?id=43490&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "4", "work_Location": "深圳", "publish_Time": "2018-08-18"}
10:{"position_name": "25929-web前端开发工程师(上海)", "position_link": "position_detail.php?id=43491&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "1", "work_Location": "上海", "publish_Time": "2018-08-18"}