python 解析html 并保存

qq76211822

已于 2024-07-27 16:52:39 修改

阅读量167

点赞数 1

分类专栏： python 文章标签： python html 开发语言

于 2024-07-27 16:33:27 首次发布

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/sz76211822/article/details/140737501

版权

python 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

# install beautifulsoup4

import os
import re
import chardet
import requests
from bs4 import BeautifulSoup

def IsArrayEmpty(list):
	return not list

def DownloadFile(path, url, strFilePath):
	try:
		response = requests.get(url)
		response.raise_for_status()
		with open(strFilePath, "wb") as file:
			file.write(response.content)
	except Exception as e:
		print('---------------------------------------------')
		print('文件路径:', path)
		print('文件下载失败:', url)
		print('失败原因:', str(e))
		
def ReadFile(strFilePath):#读取文件
	with open(strFilePath, 'rb') as f:
		varContent = f.read()
		encoding = chardet.detect(varContent)['encoding']
		varContent = varContent.decode(encoding)
		return varContent
	return ""

def ParseFile(path, strReplace):
	strData = ReadFile(path)

	#html
	soup = BeautifulSoup(strData, 'html.parser')
	
	#查找所有的 img
	strContent = soup.find_all('img')
	if IsArrayEmpty(strContent) == False:
		for varIndex in strContent:
			#打印单个img
			#print(varIndex)
			
			#打印属性
			href = varIndex['src']
			#print(href)
			
			#打印文件名
			nPos = href.rfind("/")
			imageName = href[nPos + 1:]
			#print(imageName)
			
			imageName = strReplace + imageName
			#print(imageName)
			
			varIndex['src'] = imageName
			
			#BeautifulSoup 保存文件
			with open(path, 'w', encoding='utf-8') as file:
				file.write(soup.prettify())
			

def enumerate_folder(strFolder, strReplace):
    for root, dirs, files in os.walk(strFolder):
        for file in files:
            path = os.path.join(root, file)	
            ParseFile(path, strReplace)
        for dir in dirs:
            pass
			
enumerate_folder("./", "http://*.*.*.*:4004/test/")

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

qq76211822 CSDN认证博客专家 CSDN认证企业博客

码龄16年

424: 原创

1万+: 周排名

2506: 总排名

116万+: 访问

: 等级

1万+: 积分

302: 粉丝

555: 获赞

200: 评论

1237: 收藏

私信

关注

热门文章

分类专栏

前端 3篇
网络 10篇
调试 15篇
windows驱动 17篇
基础 69篇
数据库 10篇
OPENCV 39篇
H.264 3篇
libcurl 10篇
ffmpeg 43篇
ZLIB 4篇
文件解析 5篇
Qt 74篇
SDL2.0 12篇
Linux 134篇
Mingw 17篇
线性代数
java 1篇
python 11篇

最新评论

windows Tcp Client 自动重连封装
CSDN-Ada助手: 推荐网络技能树：https://edu.csdn.net/skill/network?utm_source=AI_act_network
Ubuntu18.04 sudo apt update无法解析域名的解决方案
m0_59275805: 给ipv4手动分配地址了，改为自动
Windows msys2编译ffmpeg之ERROR: cuvid requested, but not all dependencies are satisfied: cuda/ffnvcodec
qq76211822: 你难道搞不出来？
Windows msys2编译ffmpeg之ERROR: cuvid requested, but not all dependencies are satisfied: cuda/ffnvcodec
tatianyi: 你自己试过这些指令吗？
Ubuntu18.04 sudo apt update无法解析域名的解决方案
2301_77734144: 为啥我按照你这个改，还是会报错呢，跟没改之前的错误一样 wang@wang-virtual-machine:~/OSKernel/OsKernel/lab0$ sudo apt update 错误:1 http://security.ubuntu.com/ubuntu bionic-security InRelease 无法解析域名“security.ubuntu.com” 错误:2 http://cn.archive.ubuntu.com/ubuntu bionic InRelease 无法解析域名“cn.archive.ubuntu.com” 错误:3 http://cn.archive.ubuntu.com/ubuntu bionic-updates InRelease 无法解析域名“cn.archive.ubuntu.com” 错误:4 http://cn.archive.ubuntu.com/ubuntu bionic-backports InRelease 无法解析域名“cn.archive.ubuntu.com” 正在读取软件包列表... 完成正在分析软件包的依赖关系树正在读取状态信息... 完成有 54 个软件包可以升级。请执行 ‘apt list --upgradable’ 来查看它们。 W: 无法下载 http://cn.archive.ubuntu.com/ubuntu/dists/bionic/InRelease 无法解析域名“cn.archive.ubuntu.com” W: 无法下载 http://cn.archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease 无法解析域名“cn.archive.ubuntu.com” W: 无法下载 http://cn.archive.ubuntu.com/ubuntu/dists/bionic-backports/InRelease 无法解析域名“cn.archive.ubuntu.com” W: 无法下载 http://security.ubuntu.com/ubuntu/dists/bionic-security/InRelease 无法解析域名“security.ubuntu.com

大家在看

家庭网络中，路由器和交换机的连接顺序 22

最新文章

2024

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。