pythondecode函数_【整理】Python中解码(decode)HTML中的实体(entity)+ 将name entity转为code point entity + 将code point ...

【Python中解码(decode)HTML中的实体(entity)】

使用Python时,有时候会遇到需要处理HTML代码。

而HTML代码中,有时候会出现所谓的实体,英文叫做Entity。

HTML Entity,总体来说,分两类:name entity:通过名字命名的实体,形式为&xxx;。比如©即对应着版权copyright的那个小标志:©。注意:这类(特殊)字符,往往在GBK等编码中,无法正常显示。所以,如果你想要把unicode的字符©在windows的cmd(默认为GBK编码)时,就只能看到"漏",而不是’©’了。当然,对应的,将unicode的"©"编码为UTF-8格式,通过logging输出到(UTF-8编码的)文件中,就可以看到正常显示出来的"©"了。

code point entity:通过此特殊字符所对应的Unicode的值,即成为Unicode code  point==code point==codepoint,中文翻译为码点。形式为 xx;,其中xxx是数字,可以是十进制的,也可以是(以x开头的)十六进制的。比如上述所举例的 © == © == © == ©,都指的是’©’这个特殊字符。

此处,想要把HTML Entity,不论是name entity,还是codepoint entity,都转换为对应的特殊字符的话,偶在参考了一些资料后,最终整理出下面的函数,方便大家使用:import re;

#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.

# so htmlentitydefs is only available between Python 2.3 and Python 2.7

import htmlentitydefs;

def decodeHtmlEntity(origHtml, decodedEncoding=""):

"""Decode html entity (name/decimal code point/hex code point) into unicode char (and then encode to decodedEncoding encoding char if decodedEncoding is not empty)

eg: from © or © or © or © to unicode '©', then encode to decodedEncoding if decodedEncoding is not empty

Note:

Some special char can NOT show in some encoding, such as © can NOT show in GBK

Related knowledge:

http://www.htmlhelp.com/reference/html40/entities/latin1.html

http://www.htmlhelp.com/reference/html40/entities/special.html

"""

decodedHtml = "";

#A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1

# 'zwnj': '‌',

# 'aring': '\xe5',

# 'gt': '>',

# 'yen': '\xa5',

#logging.debug("htmlentitydefs.entitydefs=%s", htmlentitydefs.entitydefs);

#A dictionary that maps HTML entity names to the Unicode codepoints

# 'aring': 229,

# 'gt': 62,

# 'sup': 8835,

# 'Ntilde': 209,

#logging.debug("htmlentitydefs.name2codepoint=%s", htmlentitydefs.name2codepoint);

#A dictionary that maps Unicode codepoints to HTML entity names

# 8704: 'forall',

# 8194: 'ensp',

# 8195: 'emsp',

# 8709: 'empty',

#logging.debug("htmlentitydefs.codepoint2name=%s", htmlentitydefs.codepoint2name);

#http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/

decodedEntityName = re.sub('&(?P[a-zA-Z]{2,10});', lambda matched: unichr(htmlentitydefs.name2codepoint[matched.group("entityName")]), origHtml);

#print "type(decodedEntityName)=",type(decodedEntityName); #type(decodedEntityName)=

decodedCodepointInt = re.sub('(?P\d{2,5});', lambda matched: unichr(int(matched.group("codePointInt"))), decodedEntityName);

#print "decodedCodepointInt=",decodedCodepointInt;

decodedCodepointHex = re.sub('(?P[a-fA-F\d]{2,5});', lambda matched: unichr(int(matched.group("codePointHex"), 16)), decodedCodepointInt);

#print "decodedCodepointHex=",decodedCodepointHex;

#logging.info("origHtml=%s", origHtml);

decodedHtml = decodedCodepointHex;

#logging.info("decodedHtml=%s", decodedHtml);

if(decodedEncoding):

# note: here decodedHtml is unicode

decodedHtml = decodedHtml.encode(decodedEncoding, 'ignore');

#print "after encode into decodedEncoding=%s, decodedHtml=%s"%(decodedEncoding, decodedHtml);

return decodedHtml;

【实现name entity和code point entity之间的互相转换】

而想要在name entity转换为code point entity,比如,从   转换为  

或者是要把code point entity转换为name  entity的话,比如从   转换为  

可以用下面对应的,我所整理出来的函数:#Note: The htmlentitydefs module has been renamed to html.entities in Python 3.0.

# so htmlentitydefs is only available between Python 2.3 and Python 2.7

import htmlentitydefs;

#------------------------------------------------------------------------------

def htmlEntityNameToCodepoint(htmlWithEntityName):

"""Convert html's entity name into entity code point

eg: from   to  

related knowledge:

http://www.htmlhelp.com/reference/html40/entities/latin1.html

http://www.htmlhelp.com/reference/html40/entities/special.html

"""

# 'aring': 229,

# 'gt': 62,

# 'sup': 8835,

# 'Ntilde': 209,

# "å":"å",

# "&gt": ">",

# "&sup": "⊃",

# "&Ntilde":"Ñ",

nameToCodepointDict = {};

for eachName in htmlentitydefs.name2codepoint:

fullName = "&" + eachName + ";";

fullCodepoint = "" + str(htmlentitydefs.name2codepoint[eachName]) + ";";

nameToCodepointDict[fullName] = fullCodepoint;

#"å" -> "å"

htmlWithCodepoint = htmlWithEntityName;

for key in nameToCodepointDict.keys() :

htmlWithCodepoint = re.compile(key).sub(nameToCodepointDict[key], htmlWithCodepoint);

return htmlWithCodepoint;

#------------------------------------------------------------------------------

def htmlEntityCodepointToName(htmlWithCodepoint):

"""Convert html's entity code point into entity name

eg: from   to  

related knowledge:

http://www.htmlhelp.com/reference/html40/entities/latin1.html

http://www.htmlhelp.com/reference/html40/entities/special.html

"""

# 8704: 'forall',

# 8194: 'ensp',

# 8195: 'emsp',

# 8709: 'empty',

# "∀": "∀",

# " ": " ",

# " ": " ",

# "∅": "∅",

codepointToNameDict = {};

for eachCodepoint in htmlentitydefs.codepoint2name:

fullCodepoint = "" + str(eachCodepoint) + ";";

fullName = "&" + htmlentitydefs.codepoint2name[eachCodepoint] + ";";

codepointToNameDict[fullCodepoint] = fullName;

#" " -> " "

htmlWithEntityName = htmlWithCodepoint;

for key in codepointToNameDict.keys() :

htmlWithEntityName = re.compile(key).sub(codepointToNameDict[key], htmlWithEntityName);

return htmlWithEntityName;

提示:

1. 想要了解更多的HTML Entity方面的内容的话,可以参考:

2.关于更多我所整理总结的Python方面的函数,可以去看:

3.当然,关于将html 的entity进行解码的话,可以参考附录1所总结的内容,使用HTMLParser中的unescape():>>> import HTMLParser

>>> h = HTMLParser.HTMLParser()

>>> s = h.unescape('© 2010')

>>> s

u'\xa9 2010'

>>> print s

© 2010

>>> s = h.unescape('© 2010')

>>> s

u'\xa9 2010'

感兴趣的,自己去折腾吧。

【参考资料】

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值