pythondecode函数_【整理】Python中解码（decode）HTML中的实体（entity）+ 将name entity转为code point entity + 将code point ...

最新推荐文章于 2024-03-01 15:57:09 发布

weixin_39914868

最新推荐文章于 2024-03-01 15:57:09 发布

阅读量346

点赞数

文章标签： pythondecode函数

【Python中解码(decode)HTML中的实体(entity)】

使用Python时，有时候会遇到需要处理HTML代码。

而HTML代码中，有时候会出现所谓的实体，英文叫做Entity。

HTML Entity，总体来说，分两类：name entity：通过名字命名的实体，形式为&xxx;。比如©即对应着版权copyright的那个小标志：©。注意：这类(特殊)字符，往往在GBK等编码中，无法正常显示。所以，如果你想要把unicode的字符©在windows的cmd(默认为GBK编码)时，就只能看到"漏"，而不是’©’了。当然，对应的，将unicode的"©"编码为UTF-8格式，通过logging输出到(UTF-8编码的)文件中，就可以看到正常显示出来的"©"了。

code point entity：通过此特殊字符所对应的Unicode的值，即成为Unicode code point==code point==codepoint，中文翻译为码点。形式为 xx;，其中xxx是数字，可以是十进制的，也可以是(以x开头的)十六进制的。比如上述所举例的 © == © == © == ©，都指的是’©’这个特殊字符。

此处，想要把HTML Entity，不论是name entity，还是codepoint entity，都转换为对应的特殊字符的话，偶在参考了一些资料后，最终整理出下面的函数，方便大家使用：import re;

#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.

# so htmlentitydefs is only available between Python 2.3 and Python 2.7

import htmlentitydefs;

def decodeHtmlEntity(origHtml, decodedEncoding=""):

"""Decode html entity (name/decimal code point/hex code point) into unicode char (and then encode to decodedEncoding encoding char if decodedEncoding is not empty)

eg: from © or © or © or © to unicode '©', then encode to decodedEncoding if decodedEncoding is not empty

Note:

Some special char can NOT show in some encoding, such as © can NOT show in GBK

Related knowledge:

http://www.htmlhelp.com/reference/html40/entities/latin1.html

http://www.htmlhelp.com/reference/html40/entities/special.html

"""

decodedHtml = "";

#A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1

# 'zwnj': '‌',

# 'aring': '\xe5',

# 'gt': '>',

# 'yen': '\xa5',

#logging.debug("htmlentitydefs.entitydefs=%s", htmlentitydefs.entitydefs);

#A dictionary that maps HTML entity names to the Unicode codepoints

# 'aring': 229,

# 'gt': 62,

# 'sup': 8835,

# 'Ntilde': 209,

#logging.debug("htmlentitydefs.name2codepoint=%s", htmlentitydefs.name2codepoint);

#A dictionary that maps Unicode codepoints to HTML entity names

# 8704: 'forall',

# 8194: 'ensp',

# 8195: 'emsp',

# 8709: 'empty',

#logging.debug("htmlentitydefs.codepoint2name=%s", htmlentitydefs.codepoint2name);

#http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/

decodedEntityName = re.sub('&(?P[a-zA-Z]{2,10});', lambda matched: unichr(htmlentitydefs.name2codepoint[matched.group("entityName")]), origHtml);

#print "type(decodedEntityName)=",type(decodedEntityName); #type(decodedEntityName)=

decodedCodepointInt = re.sub('(?P\d{2,5});', lambda matched: unichr(int(matched.group("codePointInt"))), decodedEntityName);

#print "decodedCodepointInt=",decodedCodepointInt;

decodedCodepointHex = re.sub('(?P[a-fA-F\d]{2,5});', lambda matched: unichr(int(matched.group("codePointHex"), 16)), decodedCodepointInt);

#print "decodedCodepointHex=",decodedCodepointHex;

#logging.info("origHtml=%s", origHtml);

decodedHtml = decodedCodepointHex;

#logging.info("decodedHtml=%s", decodedHtml);

if(decodedEncoding):

# note: here decodedHtml is unicode

decodedHtml = decodedHtml.encode(decodedEncoding, 'ignore');

#print "after encode into decodedEncoding=%s, decodedHtml=%s"%(decodedEncoding, decodedHtml);

return decodedHtml;

【实现name entity和code point entity之间的互相转换】

而想要在name entity转换为code point entity，比如，从转换为

或者是要把code point entity转换为name entity的话，比如从转换为

可以用下面对应的，我所整理出来的函数：#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.

# so htmlentitydefs is only available between Python 2.3 and Python 2.7

import htmlentitydefs;

#------------------------------------------------------------------------------

def htmlEntityNameToCodepoint(htmlWithEntityName):

"""Convert html's entity name into entity code point

eg: from to

related knowledge:

http://www.htmlhelp.com/reference/html40/entities/latin1.html

http://www.htmlhelp.com/reference/html40/entities/special.html

"""

# 'aring': 229,

# 'gt': 62,

# 'sup': 8835,

# 'Ntilde': 209,

# "å":"å",

# "&gt": ">",

# "&sup": "⊃",

# "&Ntilde":"Ñ",

nameToCodepointDict = {};

for eachName in htmlentitydefs.name2codepoint:

fullName = "&" + eachName + ";";

fullCodepoint = "" + str(htmlentitydefs.name2codepoint[eachName]) + ";";

nameToCodepointDict[fullName] = fullCodepoint;

#"å" -> "å"

htmlWithCodepoint = htmlWithEntityName;

for key in nameToCodepointDict.keys() :

htmlWithCodepoint = re.compile(key).sub(nameToCodepointDict[key], htmlWithCodepoint);

return htmlWithCodepoint;

#------------------------------------------------------------------------------

def htmlEntityCodepointToName(htmlWithCodepoint):

"""Convert html's entity code point into entity name

eg: from to

related knowledge:

http://www.htmlhelp.com/reference/html40/entities/latin1.html

http://www.htmlhelp.com/reference/html40/entities/special.html

"""

# 8704: 'forall',

# 8194: 'ensp',

# 8195: 'emsp',

# 8709: 'empty',

# "∀": "∀",

# " ": " ",

# " ": " ",

# "∅": "∅",

codepointToNameDict = {};

for eachCodepoint in htmlentitydefs.codepoint2name:

fullCodepoint = "" + str(eachCodepoint) + ";";

fullName = "&" + htmlentitydefs.codepoint2name[eachCodepoint] + ";";

codepointToNameDict[fullCodepoint] = fullName;

#" " -> " "

htmlWithEntityName = htmlWithCodepoint;

for key in codepointToNameDict.keys() :

htmlWithEntityName = re.compile(key).sub(codepointToNameDict[key], htmlWithEntityName);

return htmlWithEntityName;

提示：

1. 想要了解更多的HTML Entity方面的内容的话，可以参考：

2.关于更多我所整理总结的Python方面的函数，可以去看：

3.当然，关于将html 的entity进行解码的话，可以参考附录1所总结的内容，使用HTMLParser中的unescape()：>>> import HTMLParser

>>> h = HTMLParser.HTMLParser()

>>> s = h.unescape('© 2010')

>>> s

u'\xa9 2010'

>>> print s

© 2010

>>> s = h.unescape('© 2010')

>>> s

u'\xa9 2010'

感兴趣的，自己去折腾吧。

【参考资料】

weixin_39914868

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pythondecode函数_【整理】Python中解码（decode）HTML中的实体（entity）+ 将name entity转为code point entity + 将code point ...

【Python中解码(decode)HTML中的实体(entity)】使用Python时，有时候会遇到需要处理HTML代码。而HTML代码中，有时候会出现所谓的实体，英文叫做Entity。HTML Entity，总体来说，分两类：name entity：通过名字命名的实体，形式为&xxx;。比如©即对应着版权copyright的那个小标志：©。注意：这类(特殊)字符，往往在GBK等编码中，...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。