Python爬虫——scrapy_crawlspider读书网

错过人间飞鸿

已于 2023-10-11 19:05:52 修改

阅读量1.5k

点赞数

分类专栏： Python爬虫文章标签： python 爬虫 scrapy

于 2023-08-19 16:20:41 首次发布

本文链接：https://blog.csdn.net/m0_63757342/article/details/132381128

版权

Python爬虫专栏收录该内容

35 篇文章 1 订阅

订阅专栏

创建crawlspider爬虫文件：

scrapy genspider -t crawl 爬虫文件名 爬取的域名

scrapy genspider -t crawl read https://www.dushu.com/book/1206.html

LinkExtractor 链接提取器通过它，Spider可以知道从爬取的页面中提取出哪些链接，提取出的链接会自动生成Request请求对象

class ReadSpider(CrawlSpider):
    name = "read"
    allowed_domains = ["www.dushu.com"]
    start_urls = ["https://www.dushu.com/book/1206_1.html"]
	# LinkExtractor 链接提取器通过它，Spider可以知道从爬取的页面中提取出哪些链接。提取出的链接会自动生成Request请求对象
    rules = (Rule(LinkExtractor(allow=r"/book/1206_\d+\.html"), callback="parse_item", follow=False),)

    def parse_item(self, response):
        name_list = response.xpath('//div[@class="book-info"]//img/@alt')
        src_list = response.xpath('//div[@class="book-info"]//img/@data-original')


        for i in range(len(name_list)):
            name = name_list[i].extract()
            src = src_list[i].extract()

            book = ScarpyReadbook41Item(name=name, src=src)
            yield book

开启管道
写入文件

class ScarpyReadbook41Pipeline:
    def open_spider(self, spider):
        self.fp = open('books.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.fp.write(str(item))
        return item

    def close_spider(self, spider):
        self.fp.close()

运行之后发现没有第一页数据
需要在start_urls里加上_1，不然不会读取第一页数据

start_urls = ["https://www.dushu.com/book/1206_1.html"]

错过人间飞鸿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Python爬虫——scrapy_crawlspider读书网

scrapy_crawlspider读书网
复制链接

扫一扫

专栏目录

Python爬虫——scrapy_crawlspider读书网

“相关推荐”对你有帮助么？