scrapy框架开发爬虫实战——采集BOSS直聘信息【爬虫进阶】

最新推荐文章于 2024-07-16 16:00:11 发布

脱氨垃圾

最新推荐文章于 2024-07-16 16:00:11 发布

阅读量4.4k

点赞数 6

分类专栏： python爬虫文章标签： scrapy 进阶 Header cookie redirect(302)

本文链接：https://blog.csdn.net/qq_36109528/article/details/100107602

版权

本文介绍了如何使用scrapy框架开发爬虫，详细讲解了从创建项目、定义爬取字段、处理请求头和cookie、解析分页数据到存储MongoDB的全过程，最终将代码托管到GitHub。

摘要由CSDN通过智能技术生成

项目GitHub

https://github.com/liuhf-jlu/scrapy-BOSS-

爬取任务

时间：2019年8月28日

爬取内容：BOSS直聘上的北京市python岗位的招聘信息

链接：https://www.zhipin.com

创建项目

#创建项目
scrapy startproject BJ

创建爬虫

#进入项目目录下
cd BJ
#创建爬虫 scrapy genspider [爬虫名称][爬取范围]
scrapy genspider boss_zhipin 'zhipin.com'

scrapy.cfg 项目配置文件
items.py 数据存储模板，用于结构化数据
pipelines.py 数据处理
settings.py 配置文件
middlewares.py 定义项目中间件
spiders 爬虫目录

明确爬虫需求，设计爬虫代码

1、定义入口URL，start_urls

起始页

第一页

第二页

第三页

下一页按钮的链接

https://www.zhipin.com/c101010100/?query=python&page={}&ka=page-next

通过上面可以发现url的变化规律即翻页规律，定义爬虫的start_urls=第一页的链接。

start_urls = ['https://www.zhipin.com/c101010100/?query=python&page=1&ka=page-1']

2、items定义我们要爬取的字段（以后还可以扩充）

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BjItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    job_title = scrapy.Field()  # 岗位
    compensation = scrapy.Field()  # 薪资
    company = scrapy.Field()  # 公司
    address = scrapy.Field()  # 地址
    seniority = scrapy.Field()  # 工作年薪
    education = scrapy.Field()  # 教育程度
    company_type = scrapy.Field()  # 公司类型
    company_finance &

最低0.47元/天解锁文章

脱氨垃圾

关注

6
点赞
踩
35

收藏

觉得还不错? 一键收藏
1
评论
scrapy框架开发爬虫实战——采集BOSS直聘信息【爬虫进阶】

项目GitHubhttps://github.com/liuhf-jlu/scrapy-BOSS-爬取任务时间：2019年8月28日爬取内容：BOSS直聘上的北京市python岗位的招聘信息链接：https://www.zhipin.com创建项目#创建项目scrapy startproject BJ创建爬虫#进入项目目录下cd BJ...
复制链接

扫一扫

专栏目录