爬虫——爬取网站中的免费小说

最新推荐文章于 2024-04-19 13:38:30 发布

置顶走向远方的路

最新推荐文章于 2024-04-19 13:38:30 发布

阅读量1.1k

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/mayongfei_a/article/details/90083212

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

爬取小说

#导入模块爬虫：requests 和正则：re
import requests
import re

#输入网址
url=‘http://www.shuquge.com/txt/5809/index.html’
response=requests.get(url,)
response.encoding=‘utf-8’ # 定义编码格式（或用gbk）
#print(response.text)

#小说名例：《元尊》
title=“元尊”
f=open("%s.txt"%title,“w”,encoding=‘utf-8’)
#print(response.text)

print("----------------------------------------------------------------------------------------------------------------------------------")
print("----------------------------------------------------------------------------------------------------------------------------------")

#目标小说的目录代码
html=response.text

#d1=re.search(r’

《元尊》正文 (. ?)元尊相关推荐:’,html,re.S).group(1)
d1=re.search(r’ 《元尊》正文 (.?) 元尊相关推荐：’,html,re.S).group(1)
print(d1)
#mulu1=re.search(r’(正文)?(.?)’,d1)
mulu1=re.findall(r’(正文 )?(.?)<’,d1)
print(mulu1)

#访问每一章下载
for i in mulu1:
chapter_title = i[-1]
chapter_url = ‘http://www.shuquge.com/txt/5809/’ + i[0]
print(chapter_url,chapter_title)
# # # 下载章节内容
chapter_response = requests.get(chapter_url)
chapter_response.encoding = ‘UTF-8’
chapter_html = chapter_response.text
# print(chapter_html)

chapter_content=re.search(r'<body id="wrapper"> (.*?)All Rights Reserved.',chapter_html,re.S)[0]
# chapter_content=chapter_content[0]
# print(chapter_content)


# #清洗数据
chapter_content = chapter_content.replace(' ','')
chapter_content = chapter_content.replace('<br/>', '')
chapter_content = chapter_content.replace('&nbsp;', '')

# # # 保存
print(chapter_content)
f.write(chapter_title)
f.write(chapter_content)

走向远方的路

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫——爬取网站中的免费小说

导入包爬虫：requests 和正则：reimport requestsimport re#输入网址url=‘http://www.shuquge.com/txt/5809/index.html’response=requests.get(url,)response.encoding=‘utf-8’ # 定义编码格式（或用gbk）print(response.text)#...
复制链接

扫一扫

专栏目录