python中html.replace()_对从包含Unicode的HTML文件中读取的字符串使用.replace（）方法...

最新推荐文章于 2023-12-17 16:37:11 发布

cholejoan

最新推荐文章于 2023-12-17 16:37:11 发布

阅读量370

点赞数

文章标签： python中html.replace()

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_32309375/article/details/113649224

版权

我想将.html文件作为原始文本读取，并将包含unicode字符的子字符串的实例替换为另一个子字符串。假设文件mm03.html只包含一行文本：«test»

我想读取mm03.html，将其原始文本解析为字符串，然后调用replace，这样输出结果如下所示：

^{pr2}$

我第一次尝试这样做时，我写了以下代码。。。在# -*- coding: utf-8 -*-

import codecs

htmlBase = codecs.open("mm03.html",'r')

htmlFill = htmlBase.read()

print htmlFill

htmlFill = htmlFill.replace("«test»","TEST")

print htmlFill

htmlBase.close()

…期望它首先打印上面列出的原始行，然后再打印第二行。相反，它列出了第一行两次。在

好吧。所以可能是Unicode解码问题，对吧？也许吧，但当我根据这个网站上找到的与Unicode相关的建议修改代码时，各种阴影的问题仍然存在。此外，通过将htmlBase显式定义为。。。在htmlBase = """«test»"""

…这让我相信在python中读取html文件有些东西我不知道。我尝试过在'w'模式下打开mmo3.html，但这似乎不起作用，而且会破坏原始文件。从只读文件中读取的字符串本身应该是只读的没有多大意义，但我可能错了。在

下面是我仔细研究过的几个脚本/输出对。在在要替换的字符串之前添加未加引号的字符'u'# -*- coding: utf-8 -*-

import codecs

htmlBase = codecs.open("mm03.html",'r')

htmlFill = htmlBase.read()

print htmlFill

htmlFill = htmlFill.replace(u"«test»","TEST")

print htmlFill

htmlBase.close()

输出：½test╗

Traceback (most recent call last):

File "test2.py", line 6, in

htmlFill = htmlFill.replace(u"┬½test┬╗","TEST")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

将.decode('utf-8')应用于从.read()传递的字符串# -*- coding: utf-8 -*-

import codecs

htmlBase = codecs.open("mm03.html",'r')

htmlFill = htmlBase.read().decode('utf-8')

print htmlFill

htmlFill = htmlFill.replace(u"«test»","TEST")

print htmlFill

htmlBase.close()

输出：Traceback (most recent call last):

File "test2.py", line 4, in

htmlFill = htmlBase.read().decode('utf-8')

File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode

return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte

将.encode('utf-8')应用于从.read()传递的字符串# -*- coding: utf-8 -*-

import codecs

htmlBase = codecs.open("mm03.html",'r')

htmlFill = htmlBase.read().encode('utf-8')

print htmlFill

htmlFill = htmlFill.replace(u"«test»","TEST")

print htmlFill

htmlBase.close()

输出：Traceback (most recent call last):

File "test2.py", line 4, in

htmlFill = htmlBase.read().encode('utf-8')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

将.decode('utf-8')应用于从.read()传递的字符串，目标子字符串上没有“u”后缀# -*- coding: utf-8 -*-

import codecs

htmlBase = codecs.open("mm03.html",'r')

htmlFill = htmlBase.read().decode('utf-8')

print htmlFill

htmlFill = htmlFill.replace("«test»","TEST")

print htmlFill

htmlBase.close()

输出：Traceback (most recent call last):

File "test2.py", line 4, in

htmlFill = htmlBase.read().decode('utf-8')

File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode

return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte

将.encode('utf-8')应用于从.read()传递的字符串，目标子字符串上没有“u”后缀# -*- coding: utf-8 -*-

import codecs

htmlBase = codecs.open("mm03.html",'r')

htmlFill = htmlBase.read().encode('utf-8')

print htmlFill

htmlFill = htmlFill.replace("«test»","TEST")

print htmlFill

htmlBase.close()

输出：Traceback (most recent call last):

File "test2.py", line 4, in

htmlFill = htmlBase.read().encode('utf-8')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python中html.replace()_对从包含Unicode的HTML文件中读取的字符串使用.replace（）方法...

我想将.html文件作为原始文本读取，并将包含unicode字符的子字符串的实例替换为另一个子字符串。假设文件mm03.html只包含一行文本：«test»我想读取mm03.html，将其原始文本解析为字符串，然后调用replace，这样输出结果如下所示：^{pr2}$我第一次尝试这样做时，我写了以下代码。。。在# -*- coding: utf-8 -*-import codecshtmlBas...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。