【Python中解码(decode)HTML中的实体(entity)】
使用Python时,有时候会遇到需要处理HTML代码。
而HTML代码中,有时候会出现所谓的实体,英文叫做Entity。
HTML Entity,总体来说,分两类:name entity:通过名字命名的实体,形式为&xxx;。比如©即对应着版权copyright的那个小标志:©。注意:这类(特殊)字符,往往在GBK等编码中,无法正常显示。所以,如果你想要把unicode的字符©在windows的cmd(默认为GBK编码)时,就只能看到"漏",而不是’©’了。当然,对应的,将unicode的"©"编码为UTF-8格式,通过logging输出到(UTF-8编码的)文件中,就可以看到正常显示出来的"©"了。
code point entity:通过此特殊字符所对应的Unicode的值,即成为Unicode code point==code point==codepoint,中文翻译为码点。形式为 xx;,其中xxx是数字,可以是十进制的,也可以是(以x开头的)十六进制的。比如上述所举例的 © == © == © == ©,都指的是’©’这个特殊字符。
此处,想要把HTML Entity,不论是name entity,还是codepoint entity,都转换为对应的特殊字符的话,偶在参考了一些资料后,最终整理出下面的函数,方便大家使用:import re;
#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.
# so htmlentitydefs is only available between Python 2.3 and Python 2.7
import htmlentitydefs;
def decodeHtmlEntity(origHtml, decodedEncoding=""):
"""Decode html entity (name/decimal code point/hex code point) into unicode char (and then encode to decodedEncoding encoding char if decodedEncoding is not empty)
eg: from © or © or © or © to unicode '©', then encode to decodedEncoding if decodedEncoding is not empty
Note:
Some special char can NOT show in some encoding, such as © can NOT show in GBK
Related knowledge:
http://www.htmlhelp.com/reference/html40/entities/latin1.html
http://www.htmlhelp.com/reference/html40/entities/special.html
"""
decodedHtml = "";
#A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1
# 'zwnj': '',
# 'aring': '\xe5',
# 'gt': '>',
# 'yen': '\xa5',
#logging.debug("htmlentitydefs.entitydefs=%s", htmlentitydefs.entitydefs);
#A dictionary that maps HTML entity names to the Unicode codepoints
# 'aring': 229,
# 'gt': 62,
# 'sup': 8835,
# 'Ntilde': 209,
#logging.debug("htmlentitydefs.name2codepoint=%s", htmlentitydefs.name2codepoint);
#A dictionary that maps Unicode codepoints to HTML entity names
# 8704: 'forall',
# 8194: 'ensp',
# 8195: 'emsp',
# 8709: 'empty',
#logging.debug("htmlentitydefs.codepoint2name=%s", htmlentitydefs.codepoint2name);
#http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/
decodedEntityName = re.sub('&(?P[a-zA-Z]{2,10});', lambda matched: unichr(htmlentitydefs.name2codepoint[matched.group("entityName")]), origHtml);
#print "type(decodedEntityName)=",type(decodedEntityName); #type(decodedEntityName)=
decodedCodepointInt = re.sub('(?P\d{2,5});', lambda matched: unichr(int(matched.group("codePointInt"))), decodedEntityName);
#print "decodedCodepointInt=",decodedCodepointInt;
decodedCodepointHex = re.sub('(?P[a-fA-F\d]{2,5});', lambda matched: unichr(int(matched.group("codePointHex"), 16)), decodedCodepointInt);
#print "decodedCodepointHex=",decodedCodepointHex;
#logging.info("origHtml=%s", origHtml);
decodedHtml = decodedCodepointHex;
#logging.info("decodedHtml=%s", decodedHtml);
if(decodedEncoding):
# note: here decodedHtml is unicode
decodedHtml = decodedHtml.encode(decodedEncoding, 'ignore');
#print "after encode into decodedEncoding=%s, decodedHtml=%s"%(decodedEncoding, decodedHtml);
return decodedHtml;
【实现name entity和code point entity之间的互相转换】
而想要在name entity转换为code point entity,比如,从 转换为
或者是要把code point entity转换为name entity的话,比如从 转换为
可以用下面对应的,我所整理出来的函数:#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.
# so htmlentitydefs is only available between Python 2.3 and Python 2.7
import htmlentitydefs;
#------------------------------------------------------------------------------
def htmlEntityNameToCodepoint(htmlWithEntityName):
"""Convert html's entity name into entity code point
eg: from to
related knowledge:
http://www.htmlhelp.com/reference/html40/entities/latin1.html
http://www.htmlhelp.com/reference/html40/entities/special.html
"""
# 'aring': 229,
# 'gt': 62,
# 'sup': 8835,
# 'Ntilde': 209,
# "å":"å",
# ">": ">",
# "&sup": "⊃",
# "Ñ":"Ñ",
nameToCodepointDict = {};
for eachName in htmlentitydefs.name2codepoint:
fullName = "&" + eachName + ";";
fullCodepoint = "" + str(htmlentitydefs.name2codepoint[eachName]) + ";";
nameToCodepointDict[fullName] = fullCodepoint;
#"å" -> "å"
htmlWithCodepoint = htmlWithEntityName;
for key in nameToCodepointDict.keys() :
htmlWithCodepoint = re.compile(key).sub(nameToCodepointDict[key], htmlWithCodepoint);
return htmlWithCodepoint;
#------------------------------------------------------------------------------
def htmlEntityCodepointToName(htmlWithCodepoint):
"""Convert html's entity code point into entity name
eg: from to
related knowledge:
http://www.htmlhelp.com/reference/html40/entities/latin1.html
http://www.htmlhelp.com/reference/html40/entities/special.html
"""
# 8704: 'forall',
# 8194: 'ensp',
# 8195: 'emsp',
# 8709: 'empty',
# "∀": "∀",
# " ": " ",
# " ": " ",
# "∅": "∅",
codepointToNameDict = {};
for eachCodepoint in htmlentitydefs.codepoint2name:
fullCodepoint = "" + str(eachCodepoint) + ";";
fullName = "&" + htmlentitydefs.codepoint2name[eachCodepoint] + ";";
codepointToNameDict[fullCodepoint] = fullName;
#" " -> " "
htmlWithEntityName = htmlWithCodepoint;
for key in codepointToNameDict.keys() :
htmlWithEntityName = re.compile(key).sub(codepointToNameDict[key], htmlWithEntityName);
return htmlWithEntityName;
提示:
1. 想要了解更多的HTML Entity方面的内容的话,可以参考:
2.关于更多我所整理总结的Python方面的函数,可以去看:
3.当然,关于将html 的entity进行解码的话,可以参考附录1所总结的内容,使用HTMLParser中的unescape():>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
>>> print s
© 2010
>>> s = h.unescape('© 2010')
>>> s
u'\xa9 2010'
感兴趣的,自己去折腾吧。
【参考资料】