使用scrapy爬腾讯招聘网站

佳辰辰辰辰

于 2023-11-12 10:20:02 发布

阅读量219

点赞数 2

文章标签： scrapy

本文链接：https://blog.csdn.net/2301_78179333/article/details/134357220

版权

使用scrapy爬腾讯招聘网站

爬取一级页面
爬取详情页信息
利用mysql保存数据

爬取一级页面

首先，我们的目标是先要爬取腾讯招聘的一级页面(url为:腾讯招聘)
爬取该网站里面的岗位标题、链接、以及标题下面的关键字
如下：
在这里插入图片描述
打开pycharm，命令:
1.创建项目
scrapy startproject 项目名

2.明确目标,我们的目标就是上面说的三个字段
在items.py文件中进行建模
代码如下:

class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # 数据建模  一级页面
    title = scrapy.Field()
    link = scrapy.Field()
    detail = scrapy.Field()

3.创建爬虫
scrapy genspider 爬虫名允许的域
我这边是scrapy genspider tencent careers.tencent.com
然后我们就要对该爬虫文件进行编写

4.分析网站
首先，我们分析网站，发现该网站是异步的，可以通过翻页左上角是否刷新判断，无转动为异步，反之就是同步

右击打开检查，->network->Fetch/XHR,找到包含我们需要的3个字段的响应的请求，可以拿到去json解析网站进行观察

找到目标后，获取url
在这里插入图片描述
https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1699753777180&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=1&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

嗯，挺长的

在python中利用jsonpath进行解析

    def parse(self, response):
        json = response.json()
        title = jsonpath(json, '$..RecruitPostName')
        link = jsonpath(json, '$..PostURL')
        detail = [jsonpath(json, '$..BGName'), jsonpath(json, '$..CategoryName'),
                  jsonpath(json, '$..RequireWorkYearsName'),
                  jsonpath(json, '$..LastUpdateTime')]
        # 使用zip函数将相同位置的元素打包成元组，然后转换成列表 其中*是用来进行解包（unpacking）的操作符
        zip_detail = list(zip(*detail))

        # print(zip_detail) 列表里面包含的元素是元组
        for titles, links, details,ids in zip(title, link, zip_detail,id):
            item = TencentItem()
            item['title'] = titles
            item['link'] = links
            # 将details元组转为字符串
            myDetails = ' '.join(details)
            item['detail'] = myDetails
            # 通过生成器返回数据给引擎 有yield的函数就是一个生成器
            yield item

这样，我们一级页面就解析完成了，接下来我们点击一级页面进行详情页，爬取详情页的岗位职责以及岗位要求

爬取详情页信息

首先，在items.py文件里面建立字段

class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # 数据建模  一级页面
    title = scrapy.Field()
    link = scrapy.Field()
    detail = scrapy.Field()

    # 二级(详情页)页面  链接详情内容里面的岗位职责以及岗位要求
    responsibility = scrapy.Field()
    requirements = scrapy.Field()

然后分析网站，会发现，要取到岗位职责与岗位要求，还是只能在json文件中解析出来，老样子，跟上面步骤一样，找到对应的url
会发现是如下请求
在这里插入图片描述

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1699754448671&postId=1604656191269511168&language=zh-cn
然后分析url发现，没有了timestamp(时间戳)与language这两个参数显示出来的网站是一样的，说明我们只需要对postId进行动态改动就好了
在这里插入图片描述

在爬虫文件中进行解析
完整代码如下

import scrapy
from jsonpath import jsonpath
from exam.items import TencentItem
import time
class TencentSpider(scrapy.Spider):
    name = "tencent"
    allowed_domains = ["careers.tencent.com"]
    start_urls = ["https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1699698376347&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=1&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn"]

    # 进行解析
    def parse(self, response):
        json = response.json()
        title = jsonpath(json, '$..RecruitPostName')
        link = jsonpath(json, '$..PostURL')
        detail = [jsonpath(json, '$..BGName'), jsonpath(json, '$..CategoryName'),
                  jsonpath(json, '$..RequireWorkYearsName'),
                  jsonpath(json, '$..LastUpdateTime')]
        id = jsonpath(json,'$..PostId')

        # 访问的json格式的响应体的重复地址
        baseUrl = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?postId='

        # 使用zip函数将相同位置的元素打包成元组，然后转换成列表 其中*是用来进行解包（unpacking）的操作符
        zip_detail = list(zip(*detail))

        # print(zip_detail) 列表里面包含的元素是元组
        for titles, links, details,ids in zip(title, link, zip_detail,id):
            item = TencentItem()
            item['title'] = titles
            item['link'] = links
            # 将details元组转为字符串
            myDetails = ' '.join(details)
            item['detail'] = myDetails

            # 通过生成器返回数据给引擎 有yield的函数就是一个生成器
            # yield item

            # 如果要对详情页发请求 需要把url打包成请求给引擎 其中scrapy里面的request要大写 通过callback回调函数
            yield scrapy.Request(url=baseUrl+ids,callback=self.parseData,meta={"items":item})
    # 解析详情页的方法
    def parseData(self,response):
        mess = response.json()
        responsibility = jsonpath(mess,'$..Responsibility')
        requirements = jsonpath(mess,'$..Requirement')
        item = response.meta['items']
        item['responsibility'] = ''.join(responsibility)
        item['requirements'] = ''.join(requirements)
        yield item

完成！到最后一步，保存数据

利用mysql保存数据

两步：
在pipelines.py文件中定义对数据处理的管道
在settings.py文件中注册启用管道

代码处理如下:

class TencentPipelineSQL:
    def __init__(self):
        self.db = pymysql.connect(user='root',password='wang',database='py_sql',charset='utf8')
        self.c = self.db.cursor()

    def process_item(self,item,spider):
        try:
            sql = 'insert into tencent (title,link,detail,responsibility,requirements) values (%s,%s,%s,%s,%s)'
            self.c.execute(sql,[item['title'],item['link'],item['detail'],item['responsibility'],item['requirements']])
            self.db.commit()
        except Exception as e:
            print('+++++++', e)
            logging.error(f'数据存储异常，原因{e}')
        return item
    def close_spider(self,spider):
        self.db.cursor()

在settings.py文件中启用管道(我这边pipelines.py文件定义了两个类所以是两个)

定义一个爬虫启动文件start.py

# 爬虫 启动！！！
from scrapy import cmdline
cmdline.execute(['scrapy','crawl','tencent','--nolog'])

右击启动就ok，就不需要命令行输入

去数据库看看效果
在这里插入图片描述
ok，我们也是完成了好吧

希望这文章可以对你学习爬虫有所帮助，如果写得哪里不好，欢迎批评与建议

佳辰辰辰辰

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫