心灯录-爬虫

最新推荐文章于 2024-02-12 12:00:10 发布

影修

最新推荐文章于 2024-02-12 12:00:10 发布

阅读量249

点赞数

分类专栏： Python 笔记文章标签：爬虫 python 开发语言

本文链接：https://blog.csdn.net/qq_30803523/article/details/121202061

版权

笔记同时被 2 个专栏收录

22 篇文章 0 订阅

订阅专栏

Python

18 篇文章 1 订阅

订阅专栏

该代码实现了一个简单的Python爬虫，用于从'http://www.daode.org/rdbook/xdl/index.html'抓取《心灯录》的章节，并将每个章节的内容保存为UTF-8编码的TXT文件。爬虫首先创建目标目录，然后逐页爬取每个章节的标题和内容，通过BeautifulSoup解析HTML，过滤掉不需要的部分，最后将章节内容写入对应的TXT文件。

摘要由CSDN通过智能技术生成

心灯录-爬虫

import requests
import os
from bs4 import BeautifulSoup

if __name__ == '__main__':
	if not os.path.exists('D:/心灯录'):
		os.mkdir('D:/心灯录')
	url = 'http://www.daode.org/rdbook/xdl/index.html'
	# 伪装
	headers = {
		'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
	}
	# 获取html源
	response1 = requests.get(url=url, headers=headers)
	# 乱码处理
	response1.encoding = 'gb18030'
	# 取正文
	page_text1 = response1.text
	# 获取xml结构
	soup1 = BeautifulSoup(page_text1, 'lxml')
	# 只取a标签
	a_list1 = soup1.select('a')
	a_list = a_list1
	flag = 5
	for a in a_list:
		# 前几个报none，我直接过滤
		if flag > 0:
			flag -= 1
			continue
		# 取每章标题，后面当做txt名
		chapter_title = a.string
		# 取每章链接，去除前缀
		add_name = a['href'].strip('../../')
		# 默认网址 + 提取页面地址 进行拼接
		chapter_url = 'http://www.daode.org/rdbook/xdl/' + add_name
		# 依然取HTML
		chapter_response = requests.get(url=chapter_url, headers=headers)
		# 修正乱码
		chapter_response.encoding = 'gb18030'
		# 直接取正文
		chapter_text = chapter_response.text
		# 取xml标签层级结构
		chapter_soup = BeautifulSoup(chapter_text, 'lxml')
		# 正文在p标签，如果取到为空，直接跳过
		if (chapter_soup.find('p') == None):
			continue
		else:
			# 否则直接取出来
			chapter_content = chapter_soup.find('p', class_='style15').text
		# 写入文件，  路径 + 文章章节名字 + .txt
		with open('D:/心灯录/' + chapter_title + '.txt', 'w', encoding='utf-8') as fp:
			fp.write(chapter_content)
			fp.close()
		chapter_response.close()
		print(chapter_title, '正文字数：' + str(len(chapter_content)), '下载完成！！！')

影修

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
心灯录-爬虫

心灯录-爬虫import requestsimport osfrom bs4 import BeautifulSoupif __name__ == '__main__': if not os.path.exists('D:/心灯录'): os.mkdir('D:/心灯录') url = 'http://www.daode.org/rdbook/xdl/index.html' # 伪装 headers = { 'User-Agent': 'Mozilla/5.0 (Windows N
复制链接

扫一扫