scrapy爬取百万小说

最新推荐文章于 2022-04-20 13:31:33 发布

pjiang000

最新推荐文章于 2022-04-20 13:31:33 发布

阅读量211

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/weixin_44412864/article/details/89790070

版权

爬虫专栏收录该内容

41 篇文章 4 订阅

订阅专栏

#爬取小说实列
第一步：创建一个scrapy工程【命令行中输入 scrapy startproject demo】
第二步：进入这个工程中，创建一个爬虫【scrapy nss zhuangji.org】
①：nss文件
第三部：在spiders中进入到这个nss.py这个文件：
I：出现一个NssSpider类（spider.Spider)，其中有三个成员变量【name&start_urls】
II：有一个parse解析的方法【用response.xpath(’’).extract_first() || .extract()】就可以了
III: yield返回字典即可{}
IV：得到下一个url
V：再进行自身调用【yield.scrapy.Request(next_url,callback=self.parse)】【parse没有（）】
②：main文件【运行文件，extract函数】（可以为任意的名字）
【标准代码】：

from scrapy.cmdline import execute
execute("scrapy crawl nns".split())
//或者使用如下的方法
execute(["scrapy","crawl","nns"])

③：pipelines文件
第一步：创建并打开一个文件
第二本：进行写入
第三部：文件的关闭

④ setttings文件
USER-AGENT: 需要重写写入
ROBOTTEXT_OBEY: False
ITEM_PIPELINES: 开启（原本是注释掉的【ctrl+/】）

爬取小说的代码如下：
I：nss文件

import scrapy


class NssSpider(scrapy.Spider):
    name = 'nss'
    # allowed_domains = ['zhuangji.org']
    start_urls = ['https://www.zhuaji.org/read/785/320784.html']

    def parse(self, response):
        title=response.xpath('//div[@class="title"]/h1/text()').extract_first()
        content=''.join(response.xpath('//div[@id="content"]/text()').extract())

        yield{
            "title":title,
            "content":content,
        }
        next_url=response.xpath('//div[@class="page"]/a[4]/@href').extract_first()
        base_url='https://www.zhuaji.org{0}'.format(next_url)
        yield scrapy.Request(base_url,callback=self.parse)

Ⅱ：pipelines文件：

calss Demo3Pipeline(object):
    def open_spider(self,spider):
    	self.file=open("xs.txt","w",encoding="utf-8")
    def process_item(self,	item,spider):
    	title=item["title"]
    	content=item["content"]
    	info=title+"\n"+content+"\n"
    	self.file.write(info)
    	return item
    def close_file(self,spider):
    	self.file.close()

爬取小说引起的问题：文件大小大于2.56M，自动中断
解决方法：打开pacharm中的bin目录下的idea.properties的这个文件
将dea.max.intellisense.filesize 参数改为99999