利用scrapy实现对腾讯招聘岗位抓取

最新推荐文章于 2022-11-17 20:52:57 发布

大大枫free

最新推荐文章于 2022-11-17 20:52:57 发布

阅读量523

点赞数

分类专栏： # 爬虫实操案例 python爬虫

本文链接：https://blog.csdn.net/qq_37397335/article/details/106722715

版权

python爬虫同时被 2 个专栏收录

12 篇文章 1 订阅

订阅专栏

爬虫实操案例

1 篇文章 0 订阅

订阅专栏

忙碌一周多终于算是过来，这个时间中一直想写点什么，但是又不知道写点什么，刚好前一段时间有个朋友说帮忙爬点职位信息，这不今天就试着用scrapy框架去腾讯家溜达了一圈，同时也用多线程试了下，不得不说scrapy是真快，闲话不说，干货走起！！！

目标：腾讯招聘职位信息中的：名称（job_name）、类别(job_type)、职责(job_duty)、要求(job_require)、地址(job_address)、时间（job_time）

要求存入MySQL数据库和CSV文件

第一步：确定URL地址及目标

【1】URL: 百度搜索腾讯招聘 - 查看工作岗位
【2】目标:抓取职位的如下信息
a> 职位名称
b> 职位地址
c> 职位类别（技术类、销售类…）
d> 发布时间
e> 工作职责
f> 工作要求

第二步：要求与分析

【1】通过查看网页源码,得知所需数据均为动态加载
【2】通过F12抓取网络数据包,进行分析
【3】一级页面抓取数据: postid
【4】二级页面抓取数据: 名称+地址+类别+时间+职责+要求
【5】存入MySQL 和 CSV

谷歌浏览器中打开百度然后没谁的搜索腾讯招聘官网：https://careers.tencent.com/
在这里插入图片描述
OK，找到URL了

1.创建项目+爬虫文件

1.创建项目:	scrapy startproject Tencent
			cd Tencent
			scrapy genspider tencent careers.tencent.com

手动输入一个职位查询例如：python
通过右键查看网页源代码，我们发下腾讯招聘首页的这个网页源代码中没有我要的二级页面的URL 也发现这个页面的加载方式为动态加载。

于是通过抓包，F12 —>Network 里面的获取的内容进行分析，一级页面我们需要抓取的数据为：postid
在这里插入图片描述

一级页面的json地址：

"""index在变,timestamp未检查"""
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1563912271089&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn

然后进入二级页面，通过抓包分析，发现:

二级页面地址：

"""postId在变,在一级页面中可拿到"""
https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1563912374645&postId={}&language=zh-cn

通过分析之后，我们下面就可以开始写代码啦，个人使用scrapy的习惯，先去定义派去的数据结构，在items.py中

import scrapy

class TencentItem(scrapy.Item):
    # 名称+类别+职责+要求+地址+时间
    job_name = scrapy.Field()
    job_type = scrapy.Field()
    job_duty = scrapy.Field()
    job_require = scrapy.Field()
    job_address = scrapy.Field()
    job_time = scrapy.Field()
    # 具体职位链接
    job_url = scrapy.Field()
    post_id = scrapy.Field()

定义完抓取的数据结构，于是就来写爬虫文件在spiders 下面的tencent.py文件中开搞

# -*- coding: utf-8 -*-
import scrapy
from urllib import parse
import requests
import json
from ..items import TencentItem


class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['careers.tencent.com']
    # 定义常用变量
    one_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1566266592644&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword={}&pageIndex={}&pageSize=10&language=zh-cn&area=cn'
    two_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1566266695175&postId={}&language=zh-cn'
    headers = {'User-Agent': 'Mozilla/5.0'}
    keyword = input('请输入职位类别:')
    keyword = parse.quote(keyword)

    # 重写start_requests()方法
    def start_requests(self):
        total = self.get_total()
        # 生成一级页面所有页的URL地址,交给调度器
        for index in range(1,total+1):
            url = self.one_url.format(self.keyword,index)
            yield scrapy.Request(url=url,callback=self.parse_one_page)

    # 获取总页数
    def get_total(self):
        url = self.one_url.format(self.keyword, 1)
        html = requests.get(url=url, headers=self.headers).json()
        count = html['Data']['Count']
        total = count//10 if count%10==0 else count//10 + 1

        return total

    def parse_one_page(self, response):
        html = json.loads(response.text)
        for one in html['Data']['Posts']:
            # 此处是不是有URL需要交给调度器去入队列了？ - 创建item对象！
            item = TencentItem()
            item['post_id'] = one['PostId']
            item['job_url'] = self.two_url.format(item['post_id'])
            # 创建1个item对象,请将其交给调度器入队列
            yield scrapy.Request(url=item['job_url'],meta={'item':item},callback=self.detail_page)

    def detail_page(self,response):
        """二级页面: 详情页数据解析"""
        item = response.meta['item']
        # 将响应内容转为python数据类型
        html = json.loads(response.text)
        # 名称+类别+职责+要求+地址+时间
        item['job_name'] = html['Data']['RecruitPostName']
        item['job_type'] = html['Data']['CategoryName']
        item['job_duty'] = html['Data']['Responsibility']
        item['job_require'] = html['Data']['Requirement']
        item['job_address'] = html['Data']['LocationName']
        item['job_time'] = html['Data']['LastUpdateTime']

        # 至此: 1条完整数据提取完成,没有继续送往调度器的请求了,交给管道文件
        yield item

中途遇到的问题，采用scrapy shell 来调试的，ps:遇到问题不可怕，一点点调试，不会百度，肯定有办法解决的。
因为要数据入库于是就要提前建库建表：

create database tencentdb charset utf8;
use tencentdb;
create table tencenttab(
job_name varchar(500),
job_type varchar(200),
job_duty varchar(5000),
job_require varchar(5000),
job_address varchar(100),
job_time varchar(100)
)charset=utf8;

创完库和表后，就该去管道文件pipelines.py中开搞啦！

class TencentPipeline(object):
    def process_item(self, item, spider):
        return item

import pymysql
import csv
from .settings import *

class TencentMysqlPipeline(object):
    def open_spider(self,spider):
        """爬虫项目启动时,连接数据库1次"""
        self.db = pymysql.connect(MYSQL_HOST,MYSQL_USER,MYSQL_PWD,MYSQL_DB,charset=CHARSET)
        self.cursor = self.db.cursor()
        self.f = open('tencetn.csv','w') 
        self.writer = csv.writer(self.f)
        

    def process_item(self,item,spider):
        ins='insert into tencenttab values(%s,%s,%s,%s,%s,%s)'
        job_li = [
            item['job_name'],
            item['job_type'],
            item['job_duty'],
            item['job_require'],
            item['job_address'],
            item['job_time']
        ]
        self.cursor.execute(ins,job_li)
        self.db.commit()
        self.writer.writerrow(job_li)

        return item

    def close_spider(self,spider):
        """爬虫项目结束时,断开数据库1次"""
        self.cursor.close()
        self.db.close()
        self.f.close()

倒数第二步就是要配置settings.py

ROBOTS_TXT = False
DOWNLOAD_DELAY = 0.5
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent': 'Mozilla/5.0',
}
ITEM_PIPELINES = {
   'Tencent.pipelines.TencentPipeline': 300,
   'Tencent.pipelines.TencentMysqlPipeline': 301,
}
# MySQL相关变量
MYSQL_HOST = 'localhost'
MYSQL_USER = 'xxxx'  #你的数据库的用户名
MYSQL_PWD = 'xxxxx' #你的mysql数据库的密码
MYSQL_DB = 'tencentdb'
CHARSET = 'utf8'

在终端中进入 Tencent文件下的Tencent—>spiders 然后输入scrapy crawl tencent 回车就OK啦。可以在控制台中看到数据信息的哦。
或者在与scrapy.cfg文件的同级目录下创建run.py 然后在里面写如下代码运行run.py即可

from scrapy import cmdline

cmdline.execute('scrapy crawl tencenyt'.split())

scrapy 中提供了更简单的方法把提取的数据存csv和json文件

【1】存入csv文件
    scrapy crawl xxxx -o xxxx.csv
 
【2】存入json文件
    scrapy crawl xxxx -o xxxx.json

【3】注意: settings.py中设置导出编码 - 主要针对json文件
    FEED_EXPORT_ENCODING = 'utf-8'

总结：写爬虫有时候很简单的，一定要明确需求之后，把思路整理出来，按照思路一步步不断试错，出现问题解决问题。

大大枫free

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
利用scrapy实现对腾讯招聘岗位抓取

忙碌一周多终于算是过来，这个时间中一直想写点什么，但是又不知道写点什么，刚好前一段时间有个朋友说帮忙爬点职位信息，这不今天就试着用scrapy框架去腾讯家溜达了一圈，同时也用多线程试了下，不得不说scrapy是真快，闲话不说，干货走起！！！目标：腾讯招聘职位信息中的：名称（job_name）、类别(job_type)、职责(job_duty)、要求(job_require)、地址(job_address)、时间（job_time）要求存入MySQL数据库和CSV文件谷歌浏览器中打开百度然后没谁的搜索
复制链接

扫一扫