python爬虫腾讯招聘职位之Scrapy实战

最新推荐文章于 2022-06-27 23:21:17 发布

陌小

最新推荐文章于 2022-06-27 23:21:17 发布

阅读量4.4k

点赞数 1

分类专栏： python scrapy 爬虫文章标签： python json

本文链接：https://blog.csdn.net/weixin_44356081/article/details/109141103

版权

python 同时被 3 个专栏收录

36 篇文章 0 订阅

订阅专栏

爬虫

9 篇文章 1 订阅

订阅专栏

scrapy

4 篇文章 0 订阅

订阅专栏

系列文章目录

python爬虫腾讯招聘职位之Scrapy实战

提前准备插件安装：

pip install scrapy

这里是运行成功的截图
在这里插入图片描述

python install Twisted

这里是运行成功的截图

在这里插入图片描述

前言

随着我们对爬虫的了解，以前我们用requests可以请求进行解析网页可以提供我们想要的数据，现在我们网页的数据量很多的时候，我们就要应用Scrapy异步爬虫进行爬取网页，下面由我向大家介绿一下Scrapy实战爬取腾讯招聘的职位

一、编写Tenxun.py爬虫文件

图二

此文件为核心文件，我们在设计爬虫网页时，要在这里进行设计。，这里我将把源码公开，进行讲解。首先创建一个scrapy项目，下面是实例代码

scrapy startproject demoTenXun

上面的是运行成功的代码截图二，下面我们要在dmoTenXun下面spider文件夹里新建一个Tenxun.py文件进行编写。上面的是图三是我们通过F12进行的网页上的数据，我们可以清楚看到此为爬虫中的一种“ajax渲染”下面我们要在dmoTenXun下面spider文件夹里新建一个Tenxun.py文件进行编写。

import scrapy
import json
from demoTenXun.items import DemotenxunItem
class TenXunSpider(scrapy.Spider):

    name = 'Tenxun'    #爬虫名称运行时只要这个爬虫名就可以了
    allowed_domains = ['careers.tencent.com']
    start_urls=['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1602982179339&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=python&pageIndex=1&pageSize=10&language=zh-cn&area=cn']
    offer=1
    def parse(self, response):
        # 通过josn读取数据
        datas=json.loads(response.text)
        for data in datas['Data']['Posts']:
            # 创建一个item对象
            item =DemotenxunItem()
            item['RecruitPostName']=data['RecruitPostName']
            item['Responsibility']=data['Responsibility']
            item['LastUpdateTime']=data['LastUpdateTime']
            item['LocationName']=data['LocationName']
            yield item
        self.offer +=1
        # 这里加一个判断
        if self.offer <=109:
            #下一次编写的url
            next_url='https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1602982179339&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=python&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(self.offer)

            yield scrapy.Request(next_url,self.parse)

二、在item.py列表里进行设置数据表

代码如下（示例）：

import scrapy


class DemotenxunItem(scrapy.Item):
    # define the fields for your item here like:
    RecruitPostName = scrapy.Field()    #岗位名称
    Responsibility =scrapy.Field()      #岗位职责
    LastUpdateTime=scrapy.Field()       #发布时间
    LocationName=scrapy.Field()         #发布地点
    pass

三、在pipelines.py列表里进行设置数据表

代码如下（示例）：

import json
import codecs
class DemotenxunPipeline:
    def __init__(self):
        self.file=codecs.open('tensun.csv','a',encoding='GBK')
    def process_item(self, item, spider):
        line = json.dumps(dict(item),ensure_ascii=False) +'\n'
        self.file.write(line)
        return item
        # return item
    def spider_close(self):
        self.file.close()

四、在settings.py文件里配置爬虫

下面有些地方修改

#把这个注释去掉
ITEM_PIPELINES = {
   'demoTenXun.pipelines.DemotenxunPipeline': 300,
}
3在这里加入你的表头
DEFAULT_REQUEST_HEADERS = {
  # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  # 'Accept-Language': 'en',
    'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0'
}
#改为False
ROBOTSTXT_OBEY = False

五、运行爬虫

下面为格式

scrapy crawl +你的爬虫名字（在TenXun.py）中找到你的name=''

下面为代码

scrapy crawl Tenxun

总结

提示：以上就是今天要讲的内容，本文仅仅简单介绍了Scrapy的使用，但Scrapy提供了大量能使我们快速便捷地爬取数据的方法。

人生若知，我用python

陌小

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录