初试Scrapy(一)—Scrapy环境搭建

初试Scrapy(一)—Scrapy环境搭建

一直想学习下Python,但是看书太慢,而且每次看书不超过十分钟就会不自觉的拿起手机,然后当放下手机的时候,就不知道自己看到哪里,所以这里索性结合有名的爬虫框架Scrapy,来学习下这个框架,进而学习下Python这门语言,在这里通过一系列的小文章记录自己的学习过程,欢迎拍砖。

一、环境搭建

本来是想在Win7折腾的,但是后面发现在安装Scrapy的时候一直报错,最后在网上查询有说是跟装的vs的版本有关系,Python2.7要对应的vs版本为2008,我装的2012,要卸妆重装,懒得捣腾,直接在切换Centos 7上开搞。

1. 安装pip

1) 首先检查Linux有没有安装Python-pip包,直接执行 yum install python-pip

图1

2) 没有python-pip包就执行命令 yum -y install epel-release

图2

3) 执行成功之后,再次执行yum install python-pip

图3
图4

4) 对安装好的pip进行升级 pip install –upgrade pip

图5

2. 安装Scrapy

在Centos7上安装Scrapy就直接通过如下命令搞定,不会像Win7要考虑什么vs的版本。

$ pip install scrapy

至此,我们的Scrapy简单的环境就搭建完成,在Centos7上操作非常简单,在Win7上我是折腾半天没搞定最后投降了。

二、运行实例

我们直接通过Scrapy官方文档给出的例子里测试下我们搭建的环境,在我们工作目录下面直接创建quotes_spider.py文件,然后写入如下代码:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

通过如下命令操作:

scrapy runspider quotes_spider.py -o quotes.json

如果没有错误,我们要的结果就会在quotes.json文件中,其内容如下:

[
{"text": "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.", "author": "Jane Austen"},
{"text": "A day without sunshine is like, you know, night.", "author": "Steve Martin"},
{"text": "Anyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.", "author": "Garrison Keillor"},
{"text": "Beauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.", "author": "Jim Henson"},
{"text": "All you need is love. But a little chocolate now and then doesn't hurt.", "author": "Charles M. Schulz"},
{"text": "Remember, we're madly in love, so it's all right to kiss me anytime you feel like it.", "author": "Suzanne Collins"},
{"text": "Some people never go crazy. What truly horrible lives they must lead.", "author": "Charles Bukowski"},
{"text": "The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.", "author": "Terry Pratchett"},
{"text": "Think left and think right and think low and think high. Oh, the thinks you can think up if only you try!", "author": "Dr. Seuss"},
{"text": "The reason I talk to myself is because I’m the only one whose answers I accept.", "author": "George Carlin"},
{"text": "I am free of all prejudice. I hate everyone equally.", "author": "W.C. Fields"},
{"text": "A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.", "author": "Jane Austen"}
]

可以看到基于Scrapy爬虫框架来实现一个简单的爬虫功能还是非常简单的,短短十多行代码就可以搞定,代码不做过多的解释,都是基于Scrapy文档来进行操作的。后续会对Scrapy进一步的学习。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值