HTML entity encoding的解析

关于浏览器安全涉及的内容:http://code.google.com/p/browsersec

本文转自转:http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding

更多的可以了解,HTML中关于字符解析的部分:http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

HTML entity encodingHTML entities. The purpose of this scheme is to make it possible to safely render certain reserved HTML characters (e.g., < > &) within documents, as well as to carry high bit characters safely over 7-bit media. The scheme nominally permits three types of notation:

HTML features a special encoding scheme called

  • One of predefined, named entities, in the format of &<name>; - for example &lt; for <, &gt; for >, &rarr; for , etc,
  • Decimal entities, &#<nn>;, with a number corresponding to the desired Unicode character value - for example &#60; for <, &#8594; for ,
  • Hexadecimal entities, &#x<nn>;, likewise - for example &#x3c; for <, &#x2192; for .

In every browser, HTML entities are decoded only in parameter values and stray text between tags. Entities have no effect on how the general structure of a document is understood, and no special meaning in sections such as <SCRIPT>. The ability to understand and parse the syntax is still critical to properly understanding the value of a particular HTML parameter, however. For example, as hinted in one of the earlier sections, <A HREF="javascript&#09;:alert(1)"> may need to be parsed as an absolute reference to javascript<TAB>:alert(1), as opposed to a link to something called javascript& with a local URL hash string part of #09;alert(1).

Unfortunately, various browsers follow different parsing rules to these HTML entity notations; all rendering engines recognize entities with no proper ; terminator, and all permit entities with excessively long, zero-padded notation, but with various thresholds:

Test description MSIE6 MSIE7 MSIE8 FF2 FF3 Safari Opera Chrome Android
Maximum length of a correctly terminated decimal entity 7 7 7 8* 8* 8*
Maximum length of an incorrectly terminated decimal entity 7 7 7 8* 8* 8*
Maximum length of a correctly terminated hex entity 6 6 6 8* 8* 8*
Maximum length of an incorrectly terminated hex entity 0 0 0 8* 8* 8*
Characters permitted in entity names (excluding A-Z a-z 0-9) none none none - . - . none none none none

* Entries one byte longer than this limit still get parsed, but incorrectly; for example, &#000000065; becomes a sequence of three characters, /x06 5 ;. Two characters and more do not get parsed at all - &#0000000065; is displayed literally).

An interesting piece of trivia is that, as per HTML entity encoding requirements, links such as:

http://example.com/?p1=v1&p2=v2

Should be technically always encoded in HTML parameters (but not in JavaScript code) as:

<a href="http://example.com/?p1=v1&amp;p2=v2">Click here</a>

In practice, however, the convention is almost never followed by web developers, and browsers compensate for it by treating invalid HTML entities as literal &-containing strings.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值