关于浏览器安全涉及的内容:http://code.google.com/p/browsersec
本文转自转:http://code.google.com/p/browsersec/wiki/Part1#HTML_entity_encoding
更多的可以了解,HTML中关于字符解析的部分:http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html
HTML entity encodingHTML entities. The purpose of this scheme is to make it possible to safely render certain reserved HTML characters (e.g., < > &) within documents, as well as to carry high bit characters safely over 7-bit media. The scheme nominally permits three types of notation:
HTML features a special encoding scheme called
- One of predefined, named entities, in the format of &<name>; - for example < for <, > for >, → for →, etc,
- Decimal entities, &#<nn>;, with a number corresponding to the desired Unicode character value - for example < for <, → for →,
- Hexadecimal entities, &#x<nn>;, likewise - for example < for <, → for →.
In every browser, HTML entities are decoded only in parameter values and stray text between tags. Entities have no effect on how the general structure of a document is understood, and no special meaning in sections such as <SCRIPT>. The ability to understand and parse the syntax is still critical to properly understanding the value of a particular HTML parameter, however. For example, as hinted in one of the earlier sections, <A HREF="javascript	:alert(1)"> may need to be parsed as an absolute reference to javascript<TAB>:alert(1), as opposed to a link to something called javascript& with a local URL hash string part of #09;alert(1).
Unfortunately, various browsers follow different parsing rules to these HTML entity notations; all rendering engines recognize entities with no proper ; terminator, and all permit entities with excessively long, zero-padded notation, but with various thresholds:
Test description | MSIE6 | MSIE7 | MSIE8 | FF2 | FF3 | Safari | Opera | Chrome | Android |
Maximum length of a correctly terminated decimal entity | 7 | 7 | 7 | ∞ | ∞ | 8* | ∞ | 8* | 8* |
Maximum length of an incorrectly terminated decimal entity | 7 | 7 | 7 | ∞ | ∞ | 8* | ∞ | 8* | 8* |
Maximum length of a correctly terminated hex entity | 6 | 6 | 6 | ∞ | ∞ | 8* | ∞ | 8* | 8* |
Maximum length of an incorrectly terminated hex entity | 0 | 0 | 0 | ∞ | ∞ | 8* | ∞ | 8* | 8* |
Characters permitted in entity names (excluding A-Z a-z 0-9) | none | none | none | - . | - . | none | none | none | none |
* Entries one byte longer than this limit still get parsed, but incorrectly; for example, A becomes a sequence of three characters, /x06 5 ;. Two characters and more do not get parsed at all - A is displayed literally).
An interesting piece of trivia is that, as per HTML entity encoding requirements, links such as:
http://example.com/?p1=v1&p2=v2
Should be technically always encoded in HTML parameters (but not in JavaScript code) as:
<a href="http://example.com/?p1=v1&p2=v2">Click here</a>
In practice, however, the convention is almost never followed by web developers, and browsers compensate for it by treating invalid HTML entities as literal &-containing strings.