scrapy构造请求实现腾讯招聘网岗位爬虫

最新推荐文章于 2020-08-23 04:29:22 发布

like吃果果

最新推荐文章于 2020-08-23 04:29:22 发布

阅读量374

点赞数

本文链接：https://blog.csdn.net/baidu_38263564/article/details/85401133

版权

scrapy是一个常用的爬虫框架，使用之前需要先新建一个项目，然后再构建爬虫，数据处理入库

先来一张爬虫入库后的数据吧

一、创建项目

当然了创建之前肯定是需要已经安装了scrapy模块的，这里不再赘述，默认已安装了scrapy、MongoDB

1、创建项目和爬虫

#创建scrapy爬虫项目
scrapy startproject tencent
#进入项目文件夹
cd tencent
#创建爬虫,hr是爬虫名，tencent.com是爬虫范围
scrapy crawl hr tencent.com

2、使用pycharm打开创建好的项目，如图：

红色标注的的py文件即为创建好的爬虫文件，双击即可编辑。

3、打开后修改start_url修改为含有爬取数据响应的请求，图中为修改后的

class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['tencent.com']
    #最开始请求的url地址
    start_urls = ['https://hr.tencent.com/position.php']

4、完成parse方法的编写，即提取你想要的数据

        def parse(self, response):
        tr_list = response.xpath('//table[@class="tablelist"]/tr')[1:-1]
        for tr in tr_list:
            item ={}
            item['class'] = tr.xpath('.//td[2]/text()').extract_first()
            item['title'] = tr.xpath('.//td[1]/a/text()').extract_first()
            item['position'] = tr.xpath('.//td[4]/text()').extract_first()
            item['publish_date'] = tr.xpath('.//td[5]/text()').extract_first()
            #将结果传递给piplines
            yield item

5、构造url地址和请求，实现翻页

        #构造下一页地址
        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        if next_url !="javascript:;":
            next_url = "https://hr.tencent.com/" + next_url
            #构造请求
            yield scrapy.Request(
                next_url,
                callback=self.parse
            )

至此，爬虫部分编写完成，接下来需要编写piplines进行数据入库

6、数据处理入库

from pymongo import MongoClient
#实例化一个MongoClient
client = MongoClient()
#构造一个collection
collection = client['tencent']['hr']

class HrPipeline(object):
    def process_item(self, item, spider):
        #方便查看效果，实际项目中无需打印
        print(item)
        #插入数据
        collection.insert(item)
        return item

处理入库后，爬虫工作全部完成，本文中，在piplines中处理入库，未进行其它操作，如有其他需求可在piplines中继续进行其他操作。

# -*- coding: utf-8 -*-
import scrapy


class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['tencent.com']
    #最开始请求的url地址
    start_urls = ['https://hr.tencent.com/position.php']

    def parse(self, response):
        tr_list = response.xpath('//table[@class="tablelist"]/tr')[1:-1]
        for tr in tr_list:
            item ={}
            item['class'] = tr.xpath('.//td[2]/text()').extract_first()
            item['title'] = tr.xpath('.//td[1]/a/text()').extract_first()
            item['position'] = tr.xpath('.//td[4]/text()').extract_first()
            item['publish_date'] = tr.xpath('.//td[5]/text()').extract_first()
            #将结果传递给piplines
            yield item
        #构造下一页地址
        next_url = response.xpath("//a[@id='next']/@href").extract_first()
        if next_url !="javascript:;":
            next_url = "https://hr.tencent.com/" + next_url
            #构造请求
            yield scrapy.Request(
                next_url,
                callback=self.parse
            )

like吃果果

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
scrapy构造请求实现腾讯招聘网岗位爬虫

scrapy是一个常用的爬虫框架，使用之前需要先新建一个项目，然后再构建爬虫，数据处理入库先来一张爬虫入库后的数据吧一、创建项目当然了创建之前肯定是需要已经安装了scrapy模块的，这里不再赘述，默认已安装了scrapy、MongoDB1、创建项目和爬虫#创建scrapy爬虫项目scrapy startproject tencent#进入项目文件夹cd tencen...
复制链接

扫一扫