HTML entity encoding的解析

最新推荐文章于 2022-04-17 21:44:50 发布

buptisc_txy

最新推荐文章于 2022-04-17 21:44:50 发布

阅读量2.5k

点赞数

文章标签： encoding html scheme javascript parsing parameters

关于浏览器安全涉及的内容：http://code.google.com/p/browsersec

本文转自转：http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding

更多的可以了解，HTML中关于字符解析的部分：http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html

HTML entity encoding HTML entities. The purpose of this scheme is to make it possible to safely render certain reserved HTML characters (e.g., `< > &`) within documents, as well as to carry high bit characters safely over 7-bit media. The scheme nominally permits three types of notation:

HTML features a special encoding scheme called

One of predefined, named entities, in the format of &<name>; - for example < for <, > for >, → for →, etc,

Decimal entities, &#<nn>;, with a number corresponding to the desired Unicode character value - for example < for <, → for →,

Hexadecimal entities, &#x<nn>;, likewise - for example < for <, → for →.

In every browser, HTML entities are decoded only in parameter values and stray text between tags. Entities have no effect on how the general structure of a document is understood, and no special meaning in sections such as <SCRIPT>. The ability to understand and parse the syntax is still critical to properly understanding the value of a particular HTML parameter, however. For example, as hinted in one of the earlier sections, <A HREF="javascript	:alert(1)"> may need to be parsed as an absolute reference to javascript<TAB>:alert(1), as opposed to a link to something called javascript& with a local URL hash string part of #09;alert(1).

Unfortunately, various browsers follow different parsing rules to these HTML entity notations; all rendering engines recognize entities with no proper ; terminator, and all permit entities with excessively long, zero-padded notation, but with various thresholds:

Test description	MSIE6	MSIE7	MSIE8	FF2	FF3	Safari	Opera	Chrome	Android
Maximum length of a correctly terminated decimal entity	7	7	7	∞	∞	8^*	∞	8^*	8^*
Maximum length of an incorrectly terminated decimal entity	7	7	7	∞	∞	8^*	∞	8^*	8^*
Maximum length of a correctly terminated hex entity	6	6	6	∞	∞	8^*	∞	8^*	8^*
Maximum length of an incorrectly terminated hex entity	0	0	0	∞	∞	8^*	∞	8^*	8^*
Characters permitted in entity names (excluding `A-Z a-z 0-9`)	none	none	none	- .	- .	none	none	none	none

^* Entries one byte longer than this limit still get parsed, but incorrectly; for example, &#000000065; becomes a sequence of three characters, /x06 5 ;. Two characters and more do not get parsed at all - &#0000000065; is displayed literally).

An interesting piece of trivia is that, as per HTML entity encoding requirements, links such as:

http://example.com/?p1=v1&p2=v2

Should be technically always encoded in HTML parameters (but not in JavaScript code) as:

<a href="http://example.com/?p1=v1&amp;p2=v2">Click here</a>

In practice, however, the convention is almost never followed by web developers, and browsers compensate for it by treating invalid HTML entities as literal &-containing strings.