Python3 Scrapy爬虫框架(Scrapy/scrapy-redis)

最新推荐文章于 2024-05-22 16:20:08 发布

LZ_Luzhuo

最新推荐文章于 2024-05-22 16:20:08 发布

阅读量4.3k

点赞数 3

分类专栏： Python 文章标签： Scrapy scrapy-redis

本文链接：https://blog.csdn.net/Rozol/article/details/80010173

版权

Python3 Scrapy爬虫框架(Scrapy/scrapy-redis)

本文由 Luzhuo 编写,转发请保留该信息.
原文: https://blog.csdn.net/Rozol/article/details/80010173

Scrapy

Scrapy 是 Python 写的, 主要用于爬取网站数据, 爬过的链接会自动过滤
使用的 Twisted 异步网络框架
官网: https://scrapy.org/
文档: https://docs.scrapy.org/en/latest/
中文文档: http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
安装: pip install Scrapy
PyDispatcher-2.0.5 Scrapy-1.5.0 asn1crypto-0.24.0 cffi-1.11.5 cryptography-2.2.2 cssselect-1.0.3 parsel-1.4.0 pyOpenSSL-17.5.0 pyasn1-0.4.2 pyasn1-modules-0.2.1 pycparser-2.18 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0
其他依赖库: pywin32-223

常用命令

创建项目: scrapy startproject mySpider
爬虫
- 创建爬虫:
  - scrapy genspider tieba tieba.baidu.com
  - scrapy genspider -t crawl tieba tieba.baidu.com
- 启动爬虫: scrapy crawl tieba
分布式爬虫:
- 启动爬虫: scrapy runspider tieba.py
- 发布指令: lpush tieba:start_urls http://tieba.baidu.com/f/index/xxx

框架

Scrapy Engine(引擎): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯
Scheduler(调度器): 负责接收 引擎 发来请求, 加入队列, 当引擎需要时交还给 引擎
Downloader(下载器): 负责下载 引擎 发送的所有请求, 将结果交还给 引擎, 引擎交给Spider来处理
Spider(爬虫): 负责从结果里提取数据, 获取 Item字段 需要的数据, 将需要跟进的URL提交给 引擎, 引擎交给 调度器
Item Pipeline(管道): 负责处理Spider中获取到的Item, 并进行处理与保存
Downloader Middlewares(下载中间件): 扩展下载功能的组件
Spider Middlewares(Spider中间件): 扩展Spider中间通信的组件

项目目录

scrapy.cfg：项目的配置文件
mySpider：项目
mySpider/items.py：目标文件
mySpider/pipelines.py：管道文件
mySpider/settings.py: 配置文件
mySpider/spiders: 存储爬虫代码目录

Scrapy的使用

操作步骤

创建项目
- scrapy startproject mySpider

编写提取的内容(items.py)

class TiebaItem(scrapy.Item):
    # 编写要存储的内容
    # 贴吧名
    name = scrapy.Field()
    # 简介
    summary = scrapy.Field()
    # 贴吧总人数
    person_sum = scrapy.Field()
    # 贴吧帖子数
    text_sum = scrapy.Field()

创建爬虫

cd 到项目目录(mySpider)下
scrapy genspider tieba tieba.baidu.com
爬虫文件创建在mySpider.spiders.tieba

编写爬虫

# -*- coding: utf-8 -*-
import scrapy
from mySpider.items import TiebaItem

class TiebaSpider(scrapy.Spider):
    name = 'tieba'  # 爬虫名
    allowed_domains = ['tieba.baidu.com']  # 爬虫作用范围

    page = 1
    page_max = 2 #30
    url = 'http://tieba.baidu.com/f/index/forumpark?cn=%E5%86%85%E5%9C%B0%E6%98%8E%E6%98%9F&ci=0&pcn=%E5%A8%B1%E4%B9%90%E6%98%8E%E6%98%9F&pci=0&ct=1&st=new&pn='

    start_urls = [url + str(page)]  # 爬虫起始地址

    # 处理响应文件
    def parse(self, response):
        # scrapy 自带的 xpath 匹配
        # .css('title::text') / .re(r'Quotes.*') / .xpath('//title')
        tieba_list = response.xpath('//div[@class="ba_content"]')  # 数据根目录

        for tieba in tieba_list:
            # 从网页中获取需要的数据
            name = tieba.xpath('./p[@class="ba_name"]/text()').extract_first()  # .extract_first() 转成字符串
            summary = tieba.xpath('./p[@class="ba_desc"]/text()').extract_first()
            person_sum = tieba.xpath('./p/span[@class="ba_m_num"]/text()').extract_first()
            text_sum = tieba.xpath('./p/span[@class="ba_p_num"]/text()').extract_first()

            item = TiebaItem()
            item['name'] = name
            item['summary'] = summary
            item['person_sum'] = person_sum
            item['text_sum'] = text_sum

            # 将获取的数据交给管道
            yield item

        if self.page < self.page_max:
            self.page += 1
        # 将新请求交给调度器下载
        # yield scrapy.Request(next_page, callback=self.parse)
        yield response.follow(self.url + str(self.page), callback=self.parse)  # 回调方法可以自己写个, 也可以用旧的parse

创建管道文件, 存储内容(pipelines.py)

编写settings.py文件

# 管道文件 (优先级同上)
ITEM_PIPELINES = {
   # 'mySpider.pipelines.MyspiderPipeline': 300,
    'mySpider.pipelines.TiebaPipeline': 100,
}

编写代码

import json
class TiebaPipeline(object):
    # 初始化
    def __init__(self):
        self.file = open('tieba.json', 'w', encoding='utf-8')

    # spider开启时调用
    def open_spider(self, spider):
        pass

    # 必写的方法, 处理item数据
    def process_item(self, item, spider):
        jsontext = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(jsontext)
        return item

    # spider结束后调用
    def close_spider(self, spider):
        self.file.close()

运行爬虫
- 启动
  - scrapy crawl tieba

request和response

# -*- coding: utf-8 -*-
import scrapy


class TiebaSpider(scrapy.Spider):
    name = 'reqp'
    allowed_domains = ['www.baidu.com']
    # start_urls = ['http://www.baidu.com/1']

    # 默认 start_urls 使用的是GET请求, 重写该方法, 注释掉 start_urls 就可以在第一次请求时自定义发送请求类型
    def start_request

最低0.47元/天解锁文章

LZ_Luzhuo

关注

3
点赞
踩
19

收藏

觉得还不错? 一键收藏
3
评论
Python3 Scrapy爬虫框架(Scrapy/scrapy-redis)

Python3 Scrapy爬虫框架(Scrapy/scrapy-redis)本文由 Luzhuo 编写,转发请保留该信息. 原文: https://blog.csdn.net/Rozol/article/details/80010173 Scrapy Scrapy 是 Python 写的, 主要用于爬取网站数据, 爬过的链接会自动过滤使用的 Twisted...
复制链接

扫一扫