Python爬虫实战 | (20) Scrapy入门实例

最新推荐文章于 2024-06-03 09:58:37 发布

CoreJT

最新推荐文章于 2024-06-03 09:58:37 发布

阅读量798

点赞数 2

分类专栏： Python3网络爬虫从理论到实践Base 文章标签： Python爬虫实战 Scrapy

本文链接：https://blog.csdn.net/sdu_hao/article/details/97132429

版权

Python3网络爬虫从理论到实践Base 专栏收录该内容

30 篇文章 48 订阅

订阅专栏

在本篇博客中，我们将使用Scrapy框架完成一个入门爬虫程序。

在命令行创建scrapy项目

首先在命令行进入PyCharm的项目目录，然后执行 scrapy startproject 项目名(如ScrapyExample)，生产爬虫项目。会自动生成项目结构和一些文件：

在命令行常见Spider

Spider 是一个自定义的类， Scrapy 用它来从网页里抓取内容，并解析抓取的结果。这个类必须继承Spider 类（scrapy.Spider），需定义Spider 的名称和起始请求，以及解析爬取结果的方法。

进入之前生成的spiders目录，执行下面的命令：

命令：scrapy genspider Spider名称网站域名

例：scrapy genspider quotes quotes.toscrape.com

此时会在spiders目录下生成一个以爬虫名字命名的.py文件：

创建Item

Item 是保存爬取数据的容器。创建Item 需要继承scrapy.Item 类，并且定义类型为scrapy.Field 的字段。

首先我们来看一下，我们之前要爬取的那个网站是什么，打开http://quotes.toscrape.com/:

网站上主要是一些名人名言，每一条包含三个部分：名言、作者、标签。

接下来我们要自定义items.py(原本是空的，只有主要结构)，定义我们想要的字段：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QuoteItem(scrapy.Item):#类名默认是项目名+Item，可以修改.如QuoteItem
    # define the fields for your item here like:
    # name = scrapy.Field()
    text = scrapy.Field() #名言
    author = scrapy.Field() #作者
    tags = scrapy.Field() #标签
    pass

编辑spider中的parse方法(用于解析response)

对response 变量包含的内容进行解析，可以使用CSS选择器或Xpath选择器，解析结果赋值给Item中的字段。quotes.py:

# -*- coding: utf-8 -*-
import scrapy
from ScrapyExample.items import QuoteItem  #把QuoteItem类导入 二者建立关联


class QuotesSpider(scrapy.Spider):#自定义爬虫类 继承scrapy.Spider
    name = 'quotes'     #爬虫名字
    allowed_domains = ['quotes.toscrape.com']   #待爬取网站域名
    start_urls = ['http://quotes.toscrape.com/']  #待爬取网站的起始网址

    def parse(self, response):  #解析/提取规则
        '''
        <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">

            <a class="tag" href="/tag/change/page/1/">change</a>

            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>

            <a class="tag" href="/tag/thinking/page/1/">thinking</a>

            <a class="tag" href="/tag/world/page/1/">world</a>

        </div>
    </div>
        '''
        quotes = response.css('.quote') #获取当页所有名言 div标签
        for quote in quotes:
            item = QuoteItem()
            #.text css选择器 ::text获取节点的文本内容，结果是列表，用extract_first()获取第一个元素
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract() #获取整个列表
            yield item

        # 下一个要爬取的页面url
        '''
        <li class="next">
                <a href="/page/2/">Next <span aria-hidden="true">→</span></a>
            </li>
        '''
        next = response.css('.next a::attr(href)').extract_first()
        url = response.urljoin(next)
        yield scrapy.Request(url=url,callback=self.parse)
        # 当请求完成后，引擎将响应作为参数传递给回调函数 继续解析

在命令行运行

在spiders目录下执行下面的命令。

scrapy crawl Spider名称

运行并显示结果，例：scrapy crawl quotes

scrapy crawl Spider名称 –o 文件名

运行并将结果保存到文件（json、csv、xml等），例：scrapy crawl quotes –o output.json

进阶

使用 Item Pipeline

如果想进行更复杂的操作，如将结果保存到MongoDB 数据库，或者筛选某些有用的Item ，则可以定义Item Pileline 来实现。当Item 生成后，它会自动被送到Item Pipeline 进行处理，常用ItemPipeline 来做如下操作：

1）清理HTML 数据

2）验证爬取数据，检查爬取字段

3）查重并丢弃重复内容

4）将爬取结果保存到数据库

实现 Item Pipeline（修改pipelines.py）

定义一个类并实现process_item()，必须返回包含数据的字典或Item 对象，或者抛出Dropltem 异常。process_item()方法主要用到了两个参数：一个参数是item ，每次Spider 生成的Item 都会作为参数传递过来；一个参数是spider ，就是Spider 的实例。启用Item Pipeline后， Item Pipeline 会自动调用process_item()方法。

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from scrapy.exceptions import DropItem

class ScrapyexamplePipeline(object):
    def process_item(self, item, spider):
        return item

#定义Item处理的类 筛掉text长度大于50的Item
class TextPipeline(object):

    def __init__(self):
        self.limit = 50

    #该方法必须定义，而且必须要有item和spider两个参数
    #其他方法可以随便写
    def process_item(self,item,spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
            return item
        else:
            return DropItem('Missing Text')

#定义数据库存储类 将数据存储到mongodb数据库
class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    # 从配置文件setting.py中获取mongo_uri，mongo_db 需要自己在setting.py中定义
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    # 连接并打开数据库
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    # 该方法必须定义，而且必须要有item和spider两个参数 其他方法可以随便写
    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))  # 将数据插入集合 要转换为字典形式 键值对
        return item

    # 关闭连接
    def close_spider(self, spider):
        self.client.close()

注意要把pipeline在setting.py里面进行注册，告诉scrapy增加了pipeline（把下面的代码加到setting.py中）：

ITEM_PIPELINES = {
    'ScrapyExample.pipelines.TextPipeline': 300,
    'ScrapyExample.pipelines.MongoPipeline': 400,
}

再运行scrapy crawl quotes，便可把数据存到mongodb数据库中，可以选择从数据库中导出为各种形式的文件。

爬取效果：

修改User-Agent

Scrapy 发送的Request 使用的User-Agent 是Scrapy/1.6.0(+http: //scrapy.org),

由Scrapy 内置的UserAgentMiddleware 设置， UserAgentMiddleware 的源码如下：

两种方式：

修改settings里面的USER-AGENT变量(推荐)

通过Downloader Middleware 的process_request()方法修改

在middlewares.py 中添加下面这个类，对Downloader Middleware做修改：

class RandomUserAgentDownloaderMiddleware(object):
    def __init__(self):
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.5959.400 SLBrowser/10.0.3544.400',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'
        ]

    def process_request(self, request, spider):
        request.headers['User-Agent'] = random.choice(self.user_agents)

    def process_response(self, request, response, spider):
        response.status = 200
        return response

第一种修改方式：

在setting.py中添加：

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

第二种修改方式：

之前在middlewares.py 中添加了RandomUserAgentDownloaderMiddleware类，在settings.py中对他进行注册：

DOWNLOADER_MIDDLEWARES = {
    'ScrapyExample.middlewares.RandomUserAgentDownloaderMiddleware': 543,
}

在执行 scrapy crawl quotes

完整项目

CoreJT

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
Python爬虫实战 | (20) Scrapy入门实例

在本篇博客中，我们将使用Scrapy框架完成一个入门爬虫程序。在命令行创建scrapy项目首先在命令行进入PyCharm的项目目录，然后执行 scrapy startproject 项目名(如ScrapyExample)，生产爬虫项目。会自动生成项目结构和一些文件：在命令行常见SpiderSpider 是一个自定义的类， Scrapy 用它来从网页里抓取内容，并解析抓取的结果。这个...
复制链接

扫一扫

专栏目录