scrapy-我的第一个scrapy爬虫

最新推荐文章于 2022-12-09 14:56:33 发布

山河锦绣放眼好风光

最新推荐文章于 2022-12-09 14:56:33 发布

阅读量87

点赞数

分类专栏：笔记文章标签： python

本文链接：https://blog.csdn.net/weixin_47249161/article/details/114338805

版权

笔记专栏收录该内容

22 篇文章 0 订阅

订阅专栏

import scrapy
from scrapy.linkextractors import LinkExtractor #linkExtractor是用来指定页面提取规则的extract_links()是用来获取连接的
from ..items import BaiduyueduItem #这是我的用来封装数据的items.py文件中的自定义类的名字。
#..代表此文件和spider文件在同一文件夹中

class BaiduyueduSpider(scrapy.Spider):
    name = 'baiduyuedu'  # 爬虫名字
    start_urls = ['https://yuedu.baidu.com/rank/hotsale?pn=0']  # 爬虫开始爬取的页面

    def parse(self, response): #此函数负责解析从1到n个界面，然后调用parse_books（）方法来解析每本书籍的具体信息，并数据用预先定义好的 BaiduyueduItem来组织管理数据（也就是我们在最终的保存文件中看到的数据保存格式）
        le = LinkExtractor(restrict_css='a.al.title-link')  # 描述提取guize
        for link in le.extract_links(response):  # 在response中按照le规则来提取数据
            yield scrapy.Request(link.url, callback=self.parse_book)  # 发送请求，并解析其中的数据
        url = response.css('div.pager a.next::attr(href)').extract_first()  # 获取下一页的链接
        if url:  # 如果下一页存在
            url = response.urljoin(url)  # 从response中提取基础链接，并拼接上下一页的相对链接
            yield scrapy.Request(url, callback=self.parse)  # 向下一页发送请求，结果为response，并使用自身的parse方法来解析页面

        # 开始定义每一本书的页面解析函数
        def parse_book(self, response):
            item = BaiduyueduItem()  # 实例化字段存储对象
            sel = response.css('div.content-block')  # 获取指定区域的较大范围的定位
            item['name'] = sel.css('h1.book-title::attr(title)').extract_first() # 把列表中的第一个元素写入到name段
            item['rating'] = sel.css('div.doc-info-score span.doc-info-score-value::text').extract_first()
            item['authors'] = sel.css('ul li.doc-info-author a::text').extract_first()

            yield item  # 把item列表中的各个项交给引擎，引擎在根据item组织数据

在items.py中的代码为

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaiduyueduItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    rating = scrapy.Field()
    authors = scrapy.Field()
    # publisher = scrapy.Field()
    # tags = scrapy.Field()
    # price = scrapy.Field()

注意：在配置文件中，如果选择不遵守君子协定，可以添加如下这行代码，这个请求头参数在百度阅读网站中被允许爬取我们指定的内容

USER_AGENT = 'Baiduspider'

山河锦绣放眼好风光

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
1
评论
scrapy-我的第一个scrapy爬虫

import scrapyfrom scrapy.linkextractors import LinkExtractor #linkExtractor是用来指定页面提取规则的extract_links()是用来获取连接的from ..items import BaiduyueduItem #这是我的用来封装数据的items.py文件中的自定义类的名字。#..代表此文件和spider文件在同一文件夹中class BaiduyueduSpider(scrapy.Spider): name =
复制链接

扫一扫