Scrapy爬虫框架使用简述

最新推荐文章于 2024-04-15 12:00:00 发布

村西那条弯弯的河流

最新推荐文章于 2024-04-15 12:00:00 发布

阅读量545

点赞数

分类专栏： Pyhton

本文链接：https://blog.csdn.net/weixin_41267342/article/details/107442754

版权

Pyhton 专栏收录该内容

11 篇文章 2 订阅

订阅专栏

本文案例所有Scrapy为2.2.0，Python为3.7，开发工具为Pycharm，学习资料来源于B站。

本文项目代码百度云网盘：链接：https://pan.baidu.com/s/1jP6ONSD7paXkesNRppO2kw
提取码：7hao

一、Scrapy简介：

1、scrapy框架的架构图如下

2、各个组件的功能

（1）、引擎(Scrapy Engine)
负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等。

（2）、调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址。

（3）、下载器(Downloader)
用于下载网页内容, 并将网页内容返回给Engine，下载器是建立在twisted这个高效的异步模型上的

(4)、爬虫(Spiders)
Spiders是开发人员自定义的类，用来解析responses，并且提取items，或者发送新的请求

        （5）、项目管道(Item Piplines)
             在items被提取后负责处理它们，主要包括清理、验证、持久化（比如存到数据库）等操作
        （6）、爬虫中间件(Spider Middlewares)

               下载器中间件(Downloader Middlewares)位于Scrapy引擎和下载器之间，主要用来处理从Engine传到Downloader的请求request，已经从Downloader传到Engine的响应response，
          你可用该中间件做以下几件事：
    ①、 process a request just before it is sent to the Downloader (i.e. right before Scrapy sends the request to the website);
    ②、 change received response before passing it to a spider;
    ③、send a new Request instead of passing received response to a spider;
    ④ 、pass response to a spider without fetching a web page;
    ⑤、 silently drop some requests.

二、Scrapy使用：

1、安装：

pip install scrapy(或者Scrapy，首字符大小写都行)

2、通过命令生成scrapy框架目录。

scrapy startproject HeimaTeacher

其中 scrapy startproject 是固定，HeimaTeacher是项目的目录名，自己定义。

执行命令：

执行完命令结果：

3、创建爬虫文件：

在HeimaTeacher目录下执行： scrapy genspider heimateacher "itcast.cn"

其中 scrapy genspider 是固定语法。heimateacher 是爬虫项目名，在运行时要用，需要自己定义，必输项，也可以在生成爬虫文件后在文件中修改，"itcast.cn" 爬取域名，命令中必输项，但生成文件后可以修改也可以不用。

4、项目中重要文件的讲解：

（1）、heimateacher.py：

import scrapy

from ..items import HeimateacherItem

class HeimateacherSpider(scrapy.Spider):
    name = 'heimateacher'
    allowed_domains = ['itcast.cn']
    start_urls = ['http://itcast.cn/']

    def parse(self, response):
        teacher_list = response.xpath("//div[@class='main_bot']")
        # items = []
        for teacher in teacher_list:
            item = HeimateacherItem()
            teacher_name = teacher.xpath("h2/text()").extract()[0]
            info = teacher.xpath("h2/span/text()").extract()[0]
            le = len(teacher.xpath("h3/span/text()"))
            if le>0:
                des1 = teacher.xpath("h3/span/text()").extract()[0]
            else:
                des1 = ""
            if le > 1:
                des2 = teacher.xpath("h3/span/text()")[1].extract()
            else:
                des2 = ""
            if len(teacher.xpath("p/text()")) > 0:
                result_des = teacher.xpath("p/text()").extract()[0].strip()
            else:
                result_des = ""
            if len(teacher.xpath("p/span/text()")) > 0:
                result = teacher.xpath("p/span/text()").extract()[0].strip()
            else:
                result = ""
            item["name"] = teacher_name
            item["info"] = info
            item["des1"] = des1
            item["des2"] = des2
            item["result_des"] = result_des
            item["result"] = result
            yield item

①、该文件为自己通过scrapy genspider heimateacher "itcast.cn" 命令创建。

②、其中name即爬虫名字，启动爬虫项目时要用,allowed_domains为允许爬取的域名，可以不要，start_urls为要爬取的项目的url，列表的形式存储，可以爬取多个url。

③、parse中写爬取数据后的解析。解析response中数据时，有三种解析方式，一种是xpath，第二种是css,第三种是正则，这个根据个人喜好选择。

④、HeimateacherItem为item.py中的类，设置爬取数据。

⑤、response.xpath()获取到的为xpath对象列表，转化为数据需要加.extract()，获取其中第几个需要根据下标获取。

⑥、注意返回值要用yield,不能用return，因为yield具有return功能的同时，还能继续执行for循环。如果要用return，需要在for循环外层添加列表变量，在for循环内部将对象添加到列表中，但此种方式存在的问题是当数据量比较大时会比较占内存。

（2）item.py文件：

# Define here the models for your scraped items

# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class HeimateacherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 定义爬取时需要的字段
    name = scrapy.Field()
    level = scrapy.Field()
    info = scrapy.Field()
    des1 = scrapy.Field()
    des2 = scrapy.Field()
    result_des = scrapy.Field()
    result = scrapy.Field()
    pass

①该文件定义的变量，要和heimateacher.py中保持一致，供保存数据时使用。

（3）pipelines.py：

import json

class HeimateacherPipeline:

    def __init__(self):
        self.file = open("pip_json_data.json", "wb")

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(content.encode("utf-8"))
        return item

    def close_spider(self, spider):
        self.file.close()

①、该文件用于处理爬取的数据。

②、函数__init__(self):用于定义初始化变量，只会加载一次，例如开启要写入的文件，函数 close_spider用于爬取结束时处理资源，例如关闭文件流。

③、函数process_item中的item即为parse返回的数据，以管道流的形式不断接受yiel返回的数据。

（4）settings.py：

该文件为scrapy的配置文件，

①、一定要将ROBOTSTXT_OBEY = True改为ROBOTSTXT_OBEY = False，该变量是要遵守robot协议，如果遵守了robot协议，很多网站上的东西就爬取不下来了。

②、ITEM_PIPELINES中为pipelines中的类，可以配置多个，后面的数字代表执行的先后顺序，数字越小，执行级别优先级越高。

（5）middlewares.py：

该文件可以用于配置浏览器请求头，代理的ip。

5、运行项目：

方式一：通过命令运行：进入到HeimaTeacher目录下，执行命令： scrapy crawl heimateacher

方式二：创建执行文件run.py,放到HeimaTeacher目录下，在文件中配置爬虫项目名，run.py中代码如下

from scrapy.cmdline import execute

import sys
import os

sys.path.append(os.path.dirname(os.path.abspath(__file__)))
print(os.path.dirname(os.path.abspath(__file__)))
execute(["scrapy", "crawl", "heimateacher"])  # 这个heimateacher是你自己爬虫的名字，就是上面所讲的scrapyname

运行该文件run.py即可得到爬取的数据,数据文件pip_json_data.json在运行项目的目录。

村西那条弯弯的河流

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Scrapy爬虫框架使用简述

一、Scrapy简介： 1、scrapy框架的架构图如下 2、各个组件的功能（1）、引擎(Scrapy Engine) 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等。（2）、调度器(Scheduler) 用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先...
复制链接

扫一扫