scrapy之Rules案例

针对列表页+详情页这种模式,对链接抽取进行逻辑限定。可以获取到 start_urls 页面的所有超链接,然后通过限定条件进行逻辑限定。

案例:爬取豆瓣新书的书名,价格,评分

核心的爬虫代码

import re

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    name = 'example'
    # allowed_domains = ['example.com']
    start_urls = ['https://book.douban.com/']

    rules = (
        Rule(LinkExtractor(allow=(r"https://book.douban.com/subject/\d+/\?icn=index-latestbook-subject")), callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        self.logger.info(f"爬取的url:{response.url}")
        # 在这里定义解析页面的逻辑
        # 可以从 response 中提取数据
        html=response.text
        # 获取书名,价格,评分,
        title=response.xpath("//h1/span/text()").get()

        price=re.findall("<span class=\"pl\">定价:</span>(.*?)<br/>",html)

        if price:
            price=price[0] if "元" in price[0] else price[0]+"元"
        else:
            price="未知"

        rating_num=response.xpath("//div[@class='rating_self']/text()").get()
        rating_num=re.findall("<strong class=\"ll rating_num \" property=\"v:average\"> (.*?) </strong>",html)[0]
        print(title,price,rating_num)

结果:

  • 3
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
下面是一个使用 Scrapy 和 MongoDB 的简单示例,它将爬取豆瓣电影 Top 250 的数据并存储到 MongoDB 数据库中: 1. 安装 Scrapy 和 pymongo: ``` pip install scrapy pymongo ``` 2. 创建 Scrapy 项目: ``` scrapy startproject douban ``` 3. 在 `settings.py` 文件中配置 MongoDB: ``` MONGODB_HOST = 'localhost' MONGODB_PORT = 27017 MONGODB_DBNAME = 'douban' MONGODB_COLLECTION = 'movies' ``` 4. 创建一个名为 `items.py` 的文件,定义要爬取的数据字段: ``` import scrapy class DoubanItem(scrapy.Item): title = scrapy.Field() rating = scrapy.Field() director = scrapy.Field() actors = scrapy.Field() year = scrapy.Field() country = scrapy.Field() category = scrapy.Field() ``` 5. 创建一个名为 `douban_spider.py` 的文件,定义爬虫: ``` import scrapy from douban.items import DoubanItem from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class DoubanSpider(CrawlSpider): name = 'douban' allowed_domains = ['movie.douban.com'] start_urls = ['https://movie.douban.com/top250'] rules = ( Rule(LinkExtractor(allow=('subject/\d+/$')), callback='parse_item'), Rule(LinkExtractor(allow=('top250\?start=\d+')), follow=True) ) def parse_item(self, response): item = DoubanItem() item['title'] = response.css('h1 span::text').get() item['rating'] = response.css('strong.rating_num::text').get() item['director'] = response.css('a[rel="v:directedBy"]::text').get() item['actors'] = response.css('a[rel="v:starring"]::text').getall() item['year'] = response.css('span.year::text').get() item['country'] = response.css('span[property="v:initialReleaseDate"]::text').re_first(r'(\S+)\s+\(\S+\)') item['category'] = response.css('span[property="v:genre"]::text').getall() yield item ``` 6. 运行爬虫: ``` scrapy crawl douban ``` 7. 在 MongoDB 中查看数据: ``` > use douban > db.movies.find().pretty() ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值