Scray爬取小说

LAN_KINGDOM

已于 2024-07-13 15:28:05 修改

阅读量527

点赞数 12

文章标签： python scrapy

于 2024-07-13 15:23:01 首次发布

本文链接：https://blog.csdn.net/qq_39000057/article/details/140399976

版权

创建一个scrapy项目
编写主程序文件代码
配置settings文件
编写pipelines文件代码
目标地址：顶点小说_顶点中文免费阅读网 (ddyueshu.com)
爬取目标

1.创建一个scrapy项目

scrapy startproject dingdianxiaoshuo

进入到项目里

cd + 项目路径

创建CrawlSpider

scrapy genspider -t crawl 爬虫名 (allowed_url)

scrapy genspider -t crawl xs ddyueshu.com

创建一个窗口

code .

2.编写wallpaper.py文件

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
"""
使用scrapycrawl类爬取小说数据
"""
class XsSpider(CrawlSpider):
    name = "xs"
    allowed_domains = ["ddyueshu.com"]
    start_urls = ["https://www.ddyueshu.com/1_1651/"]

    rules = (
        Rule(LinkExtractor(restrict_xpaths="//div[@class='box_con']/div/dl/dd[7]/a"),callback="parse_item",follow=True),
        Rule(LinkExtractor(restrict_xpaths="//div[@class='bottem2']/a[3]"), callback="parse_item", follow=True)
        )

    def parse_item(self, response):
        title = response.xpath("//div[@class='bookname']/h1/text()").get()
        content = response.xpath("//div[@id='content']/text()").extract()

        yield{
            'title':title,
            'content':content
        }

Rule分别定位小说第一章的xpath链接和下一章链接的xpath,因为使用scrapycrawl类爬取小说要是在第一章的里面(第一章外门弟子_修罗武神_玄幻小说_顶点小说 (ddyueshu.com))开始爬取数据会丢失第一章的数据，所以要在上一层(修罗武神最新章节_修罗武神无弹窗全文阅读_顶点小说 (ddyueshu.com))开始。

3.配置settings.py文件

添加useragent,把robot协议注释，每爬取一页等待2秒，打开管道

4.编写pipeline.py文件

from itemadapter import ItemAdapter
import re

class DingdianxiaoshuoPipeline:
    def open_spider(self,spider):
        self.file = open('xiaoshuo.txt','w',encoding='utf-8')
    def process_item(self, item, spider):
       self.file.write(item['title'])
        #item返回的是列表类型，需要转换成字符串才能写入
       content = self.file.write((re.sub(r'\t+','\n',''.join(item['content'])))+'\n\n\n')
       self.file.flush()
       return item
    def close_spider(self,spider):
        self.file.close()

效果展示