python学习笔记之爬虫框架scrapy(十七)

一、安装

执行以下命令安装scrapy

pip install scrapy

注意:
Scrapy是用纯Python编写的,并且依赖于一些关键的Python包(以及其他一些包):

  • lxml,高效的XML和HTML解析器
  • parsel,是在lxml之上编写的HTML / XML数据提取库
  • w3lib,一个用于处理URL和网页编码的多功能助手
  • 扭曲的异步网络框架
  • 加密pyOpenSSL,以处理各种网络级安全需求

二、Scrapy教程

本教程将指导您完成以下任务:

  • 创建一个新的Scrapy项目
  • 编写蜘蛛以爬网站点并提取数据
  • 使用命令行导出抓取的数据
  • 更改蜘蛛以递归地跟随链接
  • 使用蜘蛛参数

2.1 新建一个scrapy项目

scrapy startproject tutorial

这将创建一个tutorial包含以下内容的目录:

scrapy.cfg            # deploy configuration file

tutorial/             # project's Python module, you'll import your code from here
    __init__.py

    items.py          # project items definition file

    middlewares.py    # project middlewares file

    pipelines.py      # project pipelines file

    settings.py       # project settings file

    spiders/          # a directory where you'll later put your spiders
        __init__.py

2.2 定义蜘蛛类

蜘蛛是您定义的类,Scrapy用于从网站(或一组网站)中获取信息。他们必须 Spider继承并定义要发出的初始请求,可以选择如何跟随页面中的链接,以及如何解析下载的页面内容以提取数据。

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

我们的Spider子类化scrapy.Spider 并定义了一些属性和方法:

  • name:标识蜘蛛。它在一个项目中必须是唯一的,也就是说,不能为不同的Spider设置相同的名称。
  • start_requests():必须返回一个可迭代的请求(您可以返回一个请求列表或编写一个生成器函数),Spider将从中开始爬行。随后的请求将从这些初始请求中依次生成。
  • parse():一种方法,将调用该方法来处理针对每个请求下载的响应。response参数是一个实例,TextResponse它保存页面内容,并具有其他有用的方法来处理它。
  • 该parse()方法通常解析响应,将提取的数据作为dict提取,还查找要遵循的新URL并Request从中创建新请求()。

2.3 运行爬虫

进入tutorial目录,执行以下的命令

scrapy crawl quotes

该命令使用quotes我们刚刚添加的名称运行Spider ,它将发送对该quotes.toscrape.com域的一些请求。您将获得类似于以下的输出:

2020-05-22 22:38:06 [scrapy.core.engine] INFO: Spider opened
2020-05-22 22:38:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-22 22:38:06 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-22 22:38:08 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-22 22:38:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-05-22 22:38:09 [quotes] DEBUG: Saved file quotes-2.html
2020-05-22 22:38:09 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-05-22 22:38:09 [quotes] DEBUG: Saved file quotes-1.html
2020-05-22 22:38:09 [scrapy.core.engine] INFO: Closing spider (finished)

检查当前目录中的文件。已经创建了两个新文件:quotes-1.html和quotes-2.html,按照我们的parse方法说明,其内容分别为URL 。
在这里插入图片描述

2.4 提取数据

先输入如下命令:

scrapy shell "http://quotes.toscrape.com/page/1/"

使用外壳,您可以尝试使用带有响应对象的CSS选择元素:

In [4]: response.css('title::text').getall()
Out[4]: ['Quotes to Scrape']

In [5]: response.css('title')
Out[5]: [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

In [6]: response.css('title').getall()
Out[6]: ['<title>Quotes to Scrape</title>']

调用的结果.getall()是一个列表:选择器有可能返回多个结果,因此我们将它们全部提取出来。当您知道只想要第一个结果时,在这种情况下,您可以执行以下操作:

>>> response.css('title::text').get()
'Quotes to Scrape'

除了getall()和 get()方法之外,您还可以使用re()方法使用正则表达式进行提取 :

In [7]: response.css('title::text').re(r'Quotes.*')
Out[7]: ['Quotes to Scrape']

In [8]: response.css('title::text').re(r'Q\w+')
Out[8]: ['Quotes']

In [9]: response.css('title::text').re(r'(\w+) to (\w+)')
Out[9]: ['Quotes', 'Scrape']
In [13]: author = quote.css("small.author::text").get()

In [14]: author
Out[14]: 'Albert Einstein'

示例图:
在这里插入图片描述
示例代码:

In [15]: tags = quote.css("div.tags a.tag::text").getall()

In [16]: tags
Out[16]: ['change', 'deep-thoughts', 'thinking', 'world']

2.5 在蜘蛛中提取数据

到目前为止,它没有特别提取任何数据,只是将整个HTML页面保存到本地文件中。让我们将上面的提取逻辑集成到我们的Spider中。

Scrapy Spider通常会生成许多字典,其中包含从页面提取的数据。为此,我们yield在回调中使用Python关键字,修改quotes_spider.py内容为如下所示:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

如果运行此蜘蛛,它将输出提取的数据和日志:

2020-05-22 23:30:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
2020-05-22 23:30:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
2020-05-22 23:30:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
2020-05-22 23:30:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
2020-05-22 23:30:16 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}

2.6存储抓取的数据

存储已抓取数据的最简单方法是使用Feed输出,并使用以下命令:

scrapy crawl quotes -o quotes.json

quotes.json内容为:

[
{"text": "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", "author": "Marilyn Monroe", "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"]},
{"text": "\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d", "author": "J.K. Rowling", "tags": ["courage", "friends"]},
{"text": "\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d", "author": "Albert Einstein", "tags": ["simplicity", "understand"]},
{"text": "\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d", "author": "Bob Marley", "tags": ["love"]},
{"text": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d", "author": "Dr. Seuss", "tags": ["fantasy"]},
{"text": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d", "author": "Douglas Adams", "tags": ["life", "navigation"]},
{"text": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d", "author": "Elie Wiesel", "tags": ["activism", "apathy", "hate", "indifference", "inspirational", "love", "opposite", "philosophy"]},
{"text": "\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d", "author": "Friedrich Nietzsche", "tags": ["friendship", "lack-of-friendship", "lack-of-love", "love", "marriage", "unhappy-marriage"]},
{"text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain", "tags": ["books", "contentment", "friends", "friendship", "life"]},
{"text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders", "tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]},
{"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]},
{"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]},
{"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": ["life", "love"]},
{"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]},
{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]}
]
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值