【第一集】编写第一个Scrapy爬虫-CSDN博客

本文链接：https://blog.csdn.net/qq_35350265/article/details/103927970

0.自己新建一个项目文件夹

1.cmd进入到项目文件夹新建scrapy项目

2.分析要爬取的页面

3.在scrapytest项目中新建book_spider.py爬虫

4.配置文件

5.编写scrapy_start.py爬虫启动程序

6.效果展示

0.自己新建一个项目文件夹

1.cmd进入到项目文件夹新建scrapy项目

scrapy startproject scrapytest

2.分析要爬取的页面

http://books.toscrape.com/

(1)数据信息

可以看到，每本书的信息包裹在 <article class=“product_pod”> 元素中：

h3>a 表示 h3 元素下名为 a 的元素

书名信息 在其下h3>a 元素的 title 属性中，如：

<a href=“catalogue/a-light-in-the-attic_1000/index.html” title=“A Light in the Attic”>A Light in the ...</a>；

书价信息在其下<p class=“price_color”>元素的文本中，如：

<p class="price_color">£51.77</p>

可以在 html 元素上使用右键快捷菜单，拷贝元素信息：

(2)链接信息

可以看到， next 按钮的 URL 在 ul.pager>li.next>a 元素的 href 属性中，是一个相对URL 地址，如：

<a href="catalogue/page2.html">next</a>

在这里点号ul.pager表示匹配class 类名为pager的ul元素

3.在scrapytest项目中新建book_spider.py爬虫

book_spider.py

import scrapy

class BooksSpider(scrapy.Spider):
    #每个爬虫的唯一标识,采用类属性
    name = "books"
    
    #定义爬虫爬取的起始点，起始点可以是多个，这里只有一个
    start_urls = ["http://books.toscrape.com/"]
    
    def parse(self, response):
        #提取数据
        #每一本书的信息在<artical class="product_pod">中，我们使用
        #css()方法找到所有这样的article元素，并以此迭代
        for book in response.css("article.product_pod"):
            #书名信息在article>h3>a元素的title属性中
            #例如<a href="catalogue/a-light-in-the-attic_1000/index.html"
            #title="A Light in the Attic">A Light in the ...</a>
            name = book.xpath("./h3/a/@title").extract_first()
            
            #书价信息在<p class=“price_color”>元素的文本中,
            #如：<p class="price_color">£51.77</p>
            price = book.css('p.price_color::text').extract_first()
            #数据处理生成item
            yield {
                "name" : name,
                "price" : price,
            }
            
        #提取链接
        #下一页的url在ul.pager>li.next>a元素中
        #例如<li class="next"><a href="catalogue/page-2.html">next</a></li>
        next_url = response.css("ul.pager li.next a::attr(href)").extract_first()
        if next_url:
            #如果找到下一页的URL，得到绝对路径，构造新的Request对象
            next_url = response.urljoin(next_url)
            yield scrapy.Request(next_url, callback=self.parse)

4.配置文件

settings

BOT_NAME = 'scrapytest'

SPIDER_MODULES = ['scrapytest.spiders']
NEWSPIDER_MODULE = 'scrapytest.spiders'
ROBOTSTXT_OBEY = True
#把爬取的文件导成csv
FEED_URI="./%(name)s_%(time)s.csv"
FEED_FORMAT="csv"

5.编写scrapy_start.py爬虫启动程序

scray crawl '爬虫name'

这样，我们每次运行，运行scrapy_start.py，即可，不用到命令行执行运行命令

from scrapy import cmdline
 
cmdline.execute("scrapy crawl books".split())