记一个简单的scrapy爬虫程序

最新推荐文章于 2022-06-25 01:12:24 发布

Tanukiiii

最新推荐文章于 2022-06-25 01:12:24 发布

阅读量225

点赞数 1

分类专栏： Python

本文链接：https://blog.csdn.net/lmysrjhjnr/article/details/87971100

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

import scrapy
from urllib import request

class bookSpider(scrapy.Spider):
name = "bookSpider"
start_urls = ['https://read.qidian.com/chapter/sMwmRYRKF1KLTMDvzUJZaQ2/eGngSvaVqnlOBDFlr9quQA2']

def parse(self, response):
    divs = response.xpath('//*[@class="read-content j_readContent"]')
    zhangjie = ""
    #将每一个p标签的文字循环输出
    for p in divs.xpath('.//p/text()'):
        zhangjie = zhangjie + p.extract().strip()
    #txt命名
    chaptername = response.xpath('//*[@class="j_chapterName"]/text()').extract()[0]
    fileName = chaptername +".txt"
    with open(fileName, "a") as f:
        f.write(zhangjie)
        #f.write(next_url)
        f.write('\n')
        f.close()
    #自动爬下一页
    next_url = response.xpath('//*[@id="j_chapterNext"]/@href').extract()[0]
    if next_url is not None:
        next_url = request.urljoin(response.url,next_url)
        yield scrapy.Request(next_url, callback=self.parse)

补充：由于在最后一章会找不到next_url的xpath，所以在最后一章爬下来之后，会出现

	next_url = response.xpath('//*[@id="j_chapterNext"]/@href').extract()[0]
	IndexError: list index out of range

为了避免xpath为空的错误，将上方的代码改为

chaptername = response.xpath('//*[@class="j_chapterName"]/text()').extract_first()
next_url = response.xpath('//*[@id="j_chapterNext"]/@href').extract_first()

将extract()[0]改为extract_first()可以避免报错

Tanukiiii

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
记一个简单的scrapy爬虫程序

import scrapyfrom urllib import requestclass bookSpider(scrapy.Spider):name = “bookSpider”start_urls = [‘https://read.qidian.com/chapter/sMwmRYRKF1KLTMDvzUJZaQ2/eGngSvaVqnlOBDFlr9quQA2’]def parse...
复制链接

扫一扫