text文件去除html格式,使用Python从文本文件中删除html格式“&gt；”csv.read文件

段山河

于 2021-06-09 10:17:55 发布

阅读量276

点赞数

文章标签： text文件去除html格式

查看here中的代码：import re, htmlentitydefs

##

# Removes HTML or XML character references and entities from a text string.

#

# @param text The HTML (or XML) source text.

# @return The plain text, as a Unicode string, if necessary.

def unescape(text):

def fixup(m):

text = m.group(0)

if text[:2] == "":

# character reference

try:

if text[:3] == "":

return unichr(int(text[3:-1], 16))

else:

return unichr(int(text[2:-1]))

except (ValueError, OverflowError):

pass

else:

# named entity

try:

text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])

except KeyError:

pass

return text # leave as is

return re.sub("?\w+;", fixup, text)

当然，这只处理HTML实体。文本中可能有其他分号，这些分号会扰乱CSV解析器。但我想你已经知道。。。在

{可能的更新

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
text文件去除html格式,使用Python从文本文件中删除html格式“&gt；”csv.read文件

查看here中的代码：import re, htmlentitydefs### Removes HTML or XML character references and entities from a text string.## @param text The HTML (or XML) source text.# @return The plain text, as a Unicode str...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。