Python scrapy简单爬取腾讯社招

最新推荐文章于 2024-04-06 03:53:10 发布

Sorry,Hey

最新推荐文章于 2024-04-06 03:53:10 发布

阅读量226

点赞数

分类专栏： Python 文章标签： Python 爬虫 scrapy

本文链接：https://blog.csdn.net/weixin_43440600/article/details/89429554

版权

Python 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

新手，如果有错的大家谅解

一. 爬取目标网站：https://hr.tencent.com/position.php?&start=10#a
二. 要爬取的内容
1. 职业名称：positionName
2. 职业名称所对应的链接：positionLink
3. 职业类别：positionType
4. 人数：peopleNumber
5. 地点：workLocation
6. 发布时间：publishTime
三. 准备工作：

新建一个scrapy项目，我的项目命名为tencent。命令：scrapy startproject tencent
作为我一样的小白，就直接把整个项目都直接拖进PyCharm里编写了
在网站中按F12打开网站源码，按Ctrl+Shift+C，然后将鼠标点击要检测元素的位置，如查看第一行的职业名称

这时就可以看到对应元素的源码

我们可以看到所有的要查找元素都是在下面的两个中

这里用“广研UI设计师”作为例子

我们可以看到他的职业名称，对应的URL，类别，人数，地点，发布时间等信息，逐项点击我们要爬取的内容，右键->Copy->CopyXPath（我这里以xpath为例，还可以用css等）我们会得到如下代码

//*[@id="position"]/div[1]/table/tbody/tr[2]/td[1]/a

打开我们的谷歌插件（XPath Helper）。黏贴我们的代码可以看到
在这里插入图片描述
就可以看到我们查找的职业名称。所以我们可以找出我们要爬取的对应数据。
以：//tr[@class=‘even’ or @class=‘odd’]这个XPath路径作为节点我们可以得出以下XPath路径
1. 职业名称：positionName：./td[1]/a/text()
2. 职业名称所对应的链接：positionLink：./td[1]/a/@href
3. 职业类别：positionType：./td[2]/text()
4. 人数：peopleNumber：./td[3]/text()
5. 地点：workLocation：./td[4]/text()
6. 发布时间：publishTime：./td[5]/text()

四. 编写scrapy

编写items.py设计出对应的TencentItem类

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    positionName = scrapy.Field()
    positionLink = scrapy.Field()
    positionType = scrapy.Field()
    peopleNumber = scrapy.Field()
    workLocation = scrapy.Field()
    publishTime = scrapy.Field()

新建并编写tencent.py（新建命令：scrapy genspider tencent）

import scrapy
from tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
    name = "Tencent"     #爬虫名称
    start_urls = [			#开始爬取的网址
        "https://hr.tencent.com/position.php?&start=10#a"
    ]

    def parse(self,response):		#开始爬取
        node_list = response.xpath("//tr[@class='even' or @class='odd']")	#爬取根节点

        for node in node_list:
            item = TencentItem()					#引用items中的TencentItem类
            item["positionName"] = node.xpath("./td[1]/a/text()").extract()[0]  #名称
            item["positionLink"] = node.xpath("./td[1]/a/@href").extract()[0]		#对应的链接

            if len(node.xpath("./td[2]/text()"))!=0:			#有的职位类型可能为空，检测一下
                item["positionType"] = node.xpath("./td[2]/text()").extract()[0]
            else:
                item["positionType"] = ""
            item["peopleNumber"] = node.xpath("./td[3]/text()").extract()[0]   #人数
            item["workLocation"] = node.xpath("./td[4]/text()").extract()[0]   #地点
            item["publishTime"] = node.xpath("./td[5]/text()").extract()[0]  #时间

            yield item   #返回item类
        if len(response.xpath("//a[@class='noactive' and @id='next']")) == 0:  #提取下一页的连接
            next_url = response.xpath("//*[@id='next']/@href").extract()[0]  
            yield scrapy.Request("https://hr.tencent.com/"+next_url,callback=self.parse)  #返回下一页的内容

编写pipelines.py（通道，我在这里将数据写到一个json文件中）

# -*- coding: utf-8 -*-

import json
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TencentPipeline(object):

    def __init__(self):
        self.f = open("tencent.json",'wb')

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii = False) + "\n"
        self.f.write(content.encode('utf-8'))
        return item

    def close_spider(self,spider):
        self.f.close()

这段代码我没加注释（因为有的地方我也不太懂哈哈）如果那里没看懂联系我大家一起琢磨琢磨

在settings.py中开启通道（就是把通道的注释取消掉）

在这里还可以加user-agent和Accept
在这里插入图片描述

OK，在下面的控制台那个地方输入运行命令（命令：scrapy crawl Tencent）其中Tencent为自己定义的爬虫名。

数据哗哗的跳，等待完成·······

完成后在自己创建的.json 中可以看到数据（没整理，有点恶心哈哈）

五. 总结（敲黑板了）
一次成功的记录很小，可能会看到各种报错，我因为环境啦，包啦，代码少个符号啦，数据的格式等等出现了很多错误，所以大家遇到错误别气馁，百度。烦的实在不行了就等会再试，可能就成功了哈哈哈哈哈哈哈嗝。有问题可以评论。

Sorry,Hey

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Python scrapy简单爬取腾讯社招

新手，如果有错的大家谅解一. 爬取目标网站：https://hr.tencent.com/position.php?&start=10#a二. 要爬取的内容1. 职业名称：positionName2. 职业名称所对应的链接：positionLink3. 职业类别：positionType4. 人数：peopleNumber5. 地点：workLocation6. 发布时...
复制链接

扫一扫

专栏目录