学习python的第九天爬小说《爱情公寓》爬取章节标题、网页地址及章节内容，将爬取的内容保存到按章节名称对应文本文件中

本文链接：https://blog.csdn.net/qq_41138009/article/details/105520789

爬小说《爱情公寓》

学python的第九天（2020.04.14）

爬小说《爱情公寓》爬取章节标题、网页地址及章节内容，将爬取的内容保存到按章节名称对应文本文件中
这次轮到检测自己这几天的学习情况了，话不多说往下看。

网站内容（https://www.kanunu8.com/book2/10923/）

小说目录
在这里插入图片描述
章节内容

代码如下

#coding:utf-8
#coding: GBK
import re
import requests
import os  #创建文件夹需要引用os库
html =requests.get('https://www.kanunu8.com/book2/10923/').content.decode('gbk','ignore')  #注意 decode，这里很容易出现编码问题，吃过几次亏了
title =re.findall('<a href="1942...html">(.*?)</a></td>',html,re.S) #爬取章节
title_url =re.findall('<td width="25%"><a href="(.*?).html"',html,re.S)  #爬取章节地址
for i in range(len(title_url)):  #章节地址修正
    title_url[i]='https://www.kanunu8.com/book2/10923/'+ title_url[i]+'.html'
    print(title_url[i])
os.makedirs('爱情公寓1',exist_ok=True) #新建文件夹爱情公寓1，如果不存在就创建
# 开始使用单线程进行章节内容爬取，爬取规则：将获取的内容根据章节名称新建文本文件，并把内容保存到文件中
for i in range(len(title_url)):
    # file_path =os.path.join('爱情公寓1',title[i],'.txt')
    content_html =requests.get(title_url[i]).content.decode('gbk','ignore')  #爬取章节网页保存到变量中
    content =re.search('<p>(.*?)</p>',content_html,re.S).group(1) #获取正文内容
    content =content.replace('&nbsp;&nbsp;&nbsp;&nbsp;','')
    content = content.replace('<br />', '')
    with open(os.path.join('爱情公寓1',title[i]+'.txt'),'w',encoding='utf-8')as f:
        f.write(format(content))