Scrapy实战（笔趣阁）

重生之我要奋发图强

已于 2024-12-23 12:09:12 修改

阅读量812

点赞数 10

文章标签： scrapy 爬虫

于 2024-06-03 01:00:03 首次发布

本文链接：https://blog.csdn.net/weixin_51885096/article/details/139399761

版权

1.项目介绍

今天学习Scrapy框架，完成笔趣阁小说的爬取。

2.成果展示

3.实现流程

1.安装scrapy三方包

pip install scrapy

2.创建scrapy文件（novel_spider为文件名）

scrapy startproject novel_spider

3.进入文件目录

cd novel_spider

4.创建爬虫（biquge为爬虫文件名称，bigee.cc为爬取的网站）

scrapy genspider biquge bigee.cc

5.修改biquge.py文件内容

其中start_urls为具体的网站（可以自己进入笔趣阁选择一部小说）

import scrapy

class BiqugeSpider(scrapy.Spider):
    name = "biquge"
    allowed_domains = ["bigee.cc"]
    start_urls = ["https://www.bigee.cc/book/6909/1.html"]

    def parse(self, response):
        #章节名称
        title = response.xpath("//div[@class='content']//h1[@class='wap_none']/text()").extract_first()
        #章节内容
        content = response.xpath("//div[@class='Readarea ReadAjax_content']/text()").extract()
        #下一章链接
        next_url = response.xpath("//div[@class='Readpage pagedown']//a[3]/@href").get()

        #将结果推送到pipeline
        yield{
            'title': title,
            'content': content,
        }
        #推送request请求，发送数据之后再回到parse方法进行解析
        yield scrapy.Request('https://www.bigee.cc/'+next_url, callback=self.parse)

xpath使用教程（超超超简易版）：

实战示范（提前安装xpath helper）

xpath就是通过css筛选内容，对照上面三张图片划出的三个区域，学习一些简单的xpath提取还是没有问题的。需要注意的是/text()表示获取文本内容，/@href则是获取它的链接。

6.pipelines数据通道

class NovelSpiderPipeline:
    def open_spider(self, spider):
        self.file = open('novel.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        self.file.write(item['title']+'\n')
        self.file.write(''.join(item['content'])+'\n\n\n')
        return item

    def close_spider(self, spider):
        self.file.close()

在python中，字符串与列表并不可以进行拼接，所以需要用join函数拼接，最后，记得在settings.py中去掉ITEM_PIPELINES的注释，打开数据通道。

7.执行项目

创建begin.py文件

from scrapy.cmdline import execute

execute(['scrapy', 'crawl', 'biquge'])

运行begin.py即可

4.总结

1.掌握Scrapy框架的基本架构；

2.深入学习xpath并进行训练；

3.通过pipelines将数据存储到MySQL数据库。