【python爬虫自学】(scrapy实例)----爬取腾讯社会招聘职位信息

目的:使用scrapy框架爬取腾讯社会招聘的职位信息

查看网页结构,数据信息在标签代码中,也可以查看其位置以及各项详细信息。

使用scrapy框架进行数据的爬取并存储在本地文件中:需要重写三个文件,分别为items.py ,自定义spider文件以及负责数据存储的pipelines.py。

items.py:定义爬取数据的字段信息

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 职位名
    position_name = scrapy.Field()
    # 详情连接
    position_link = scrapy.Field()
    # 职位类别
    position_Type = scrapy.Field()
    # 招聘人数
    people_Num = scrapy.Field()
    # 工作地点
    work_Location = scrapy.Field()
    # 发布时间
    publish_Time = scrapy.Field()

tencent_spider.py:定义请求信息以及爬取

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem

class TencentSpiderSpider(scrapy.Spider):
    #爬虫名称
    name = 'tencent_spider'
    #爬虫域
    allowed_domains = ['tencent.com']
    url = "http://hr.tencent.com/position.php?&start="
    offset = 0
    #请求列表
    start_urls = [
        url + str(offset)
    ]
    #解析函数
    def parse(self, response):
        for each in response.xpath("//tr[@class='even'] | //tr[@class='odd']"):
            #初始化模型对象
            item = TencentItem()
            item['position_name'] = each.xpath("./td[1]/a/text()").extract()[0]
            #详情链接
            item['position_link'] = each.xpath("./td[1]/a/@href").extract()[0]
            try:
                #职位类别,有些数据此字段为空,出现索引错误
                item['position_Type'] = each.xpath("./td[2]/text()").extract()[0]
            except IndexError:
                item['position_Type'] = ''
            #招聘人数
            item['people_Num'] = each.xpath("./td[3]/text()").extract()[0]
            #工作地点
            item['work_Location'] = each.xpath("./td[4]/text()").extract()[0]
            #发布时间
            item['publish_Time'] = each.xpath("./td[5]/text()").extract()[0]

            yield item
        #循环,每次爬完一页的信息后,修改offset的值
        if self.offset < 1000:
            self.offset += 10
        #拼接新的请求链接,然后调用回调函数处理响应的请求
        yield scrapy.Request(self.url + str(self.offset), callback=self.parse)

pipelines.py:数据的解析与储存,将数据存储在tencent.json文件中,因为数据中含有中文,所以在使用json.dump()以及file.open()的时候要使用编码。

# -*- coding: utf-8 -*-
import json
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TencentPipeline(object):
    """
       功能:保存item数据
   """
    count = 1
    def open_spider(self,spider):
        self.filename = open("tencent.json", "w",encoding="utf-8")

    def process_item(self, item, spider):
        text =str(self.count)+":" + json.dumps(dict(item), ensure_ascii = False) + "\n"
        self.filename.write(text)
        self.count += 1
        return item

    def close_spider(self, spider):
        self.filename.close()

setting文件中释放请求头信息以及pipelines信息

执行:scrapy crawl tencent_spider

执行结果示例:

1:{"position_name": "SA-腾讯社交广告金融行业运营经理(微信广告 深圳)", "position_link": "position_detail.php?id=43501&keywords=&tid=0&lid=0", "position_Type": "产品/项目类", "people_Num": "2", "work_Location": "深圳", "publish_Time": "2018-08-18"}
2:{"position_name": "WXG07-113 广州研发部后台安全策略工程师(广州)", "position_link": "position_detail.php?id=43505&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "1", "work_Location": "广州", "publish_Time": "2018-08-18"}
3:{"position_name": "OMG192-渠道运营管理(北京&上海)", "position_link": "position_detail.php?id=43494&keywords=&tid=0&lid=0", "position_Type": "市场类", "people_Num": "1", "work_Location": "上海", "publish_Time": "2018-08-18"}
4:{"position_name": "MIG08-IOS开发工程师(腾讯WiFi管家)", "position_link": "position_detail.php?id=43496&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "2", "work_Location": "广州", "publish_Time": "2018-08-18"}
5:{"position_name": "22851-腾讯视频电视剧内容与剧本评估(北京)", "position_link": "position_detail.php?id=43497&keywords=&tid=0&lid=0", "position_Type": "内容编辑类", "people_Num": "1", "work_Location": "北京", "publish_Time": "2018-08-18"}
6:{"position_name": "17759-产品分析经理(北京)", "position_link": "position_detail.php?id=43500&keywords=&tid=0&lid=0", "position_Type": "市场类", "people_Num": "1", "work_Location": "北京", "publish_Time": "2018-08-18"}
7:{"position_name": "25929-互娱沙盒类手游服务端开发(深圳)", "position_link": "position_detail.php?id=43487&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "2", "work_Location": "深圳", "publish_Time": "2018-08-18"}
8:{"position_name": "25929-沙盒游戏运营WEB开发工程师(深圳)", "position_link": "position_detail.php?id=43489&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "1", "work_Location": "深圳", "publish_Time": "2018-08-18"}
9:{"position_name": "25929-互娱沙盒类手游客户端开发(深圳)", "position_link": "position_detail.php?id=43490&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "4", "work_Location": "深圳", "publish_Time": "2018-08-18"}
10:{"position_name": "25929-web前端开发工程师(上海)", "position_link": "position_detail.php?id=43491&keywords=&tid=0&lid=0", "position_Type": "技术类", "people_Num": "1", "work_Location": "上海", "publish_Time": "2018-08-18"}

 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值