scrapy 爬虫框架实战一：爬取小说

最新推荐文章于 2025-01-11 10:43:14 发布

极客✌

最新推荐文章于 2025-01-11 10:43:14 发布

阅读量812

点赞数

文章标签： python

本文链接：https://blog.csdn.net/qq_43469111/article/details/104769660

版权

爬取地址：https://www.zwdu.com/book/7846/195393.html

1、创建爬虫项目
在命令行输入：

scrapy startproject xiaoshuo

2、创建自己的爬虫文件

scrapy genspider xiaoshuo zwdu.com

3、修改

start_urls = ['https://www.zwdu.com/book/7846/195393.html']

4、修改settings

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, likeGecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 '
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   'xiaoshuo.pipelines.XiaoshuoPipeline': 300,
}#开启pipelines

5、开始解析
得到标题：
在这里插入图片描述
得到小说内容

得到下一章url

代码如下：

    def parse(self, response):
        title = response.xpath('//h1/text()').extract_first()
        content = ''.join(response.xpath('//div[@id="content"]/text()').extract())  #得到列表所以变成字符串
        yield  {
            'title': title,
            'content': content
        }
        next_url = response.xpath('//div[@class="bottem2"]/a[3]/@href').extract_first()
        #base_url = 'https://www.zwdu.com/ book/7846/{}'.format(next_url)
        if next_url.find('.html') != -1:
            yield scrapy.Request(response.urljoin(next_url), callback=self.parse)#自动补齐前面的url
            #yield 推送到pipelines

6、配置pipelines，将小说保存至文件中

class XiaoshuoPipeline(object):
    def open_spider(self,spider):
        self.file = open('wldf.txt', 'w',encoding='utf-8')
    def process_item(self, item, spider):
        title = item['title']
        content = item['content']
        info = title + '\n' + content + '\n'
        self.file.write(info)
        self.file.flush()
        return item
    def close_spider(self,spider):
        self.file.close()

7、新建main文件，启动爬虫
在这里插入图片描述
代码如下：

from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'zww'])

scrapy 爬虫框架实战一 ：爬取小说

scrapy 爬虫框架实战一：爬取小说