网页中的字符编码(html的unicode实体编码)

最新推荐文章于 2023-04-18 11:04:05 发布

weixin_30367543

最新推荐文章于 2023-04-18 11:04:05 发布

阅读量588

点赞数

文章标签：爬虫 php 人工智能

原文链接：http://www.cnblogs.com/zccee/archive/2012/02/04/2338515.html

版权

1、编码转换（to Unicode）

（程序代码来源于网络）

Js版

<script>
      test = "你好abc"
      str = ""
      for( i=0;     i<test.length; i++ )
      {
       temp = test.charCodeAt(i).toString(16);
       str     += "\\u"+ new Array(5-String(temp).length).join("0") +temp;
      }
      document.write (str)
</script>

vbs版

Function Unicode(str1)
      Dim str,temp
      str = ""
      For i=1     to len(str1)
       temp = Hex(AscW(Mid(str1,i,1)))
       If len(temp) < 5 Then     temp = right("0000" & temp, 4)
       str = str & "\u" & temp
      Next
      Unicode = str
End Function

Function htmlentities(str)
      For i = 1 to Len(str)
          char = mid(str, i, 1)
          If Ascw(char) > 128 then
              htmlentities = htmlentities & "&#" & Ascw(char) & ";"
          Else
              htmlentities = htmlentities & char
          End if
      Next
End Function

coldfusion版

function nochaoscode(str)
{
      var new_str = “”;
      for(i=1; i lte len(str);i=i+1){
          if(asc(mid(str,i,1)) lt 128){
              new_str = new_str & mid(str,i,1);
          }else{
              new_str = new_str & “&##” & asc(mid(str,i,1));
          }
      }
      return new_str;
}

附：

在php中我们可以用mbstring的mb_convert_encoding函数实现这个正向及反向的转化。如：

mb_convert_encoding ("你好", "HTML-ENTITIES", "gb2312"); //输出：你好
mb_convert_encoding ("你好", "gb2312", "HTML-ENTITIES"); //输出：你好

如果需要对整个页面转化，则只需要在php文件的头部加上这三行代码：

mb_internal_encoding("gb2312"); // 这里的gb2312是你网站原来的编码
mb_http_output("HTML-ENTITIES");
ob_start('mb_output_handler');

如果没有打开mbstring扩展，可以参考coolcode.cn上的这两篇文章：
在任意字符集下正常显示网页的方法
 在任意字符集下正常显示网页的方法（续）

2、HTML实体

HTML 4.01 支持 ISO 8859-1 (Latin-1) 字符集。

提示实体名是区分大小写的。

备注同一个符号，可以用“实体名称”和“实体编号”两种方式引用，“实体名称”的优势在于便于记忆，但不能保证所有的浏览器都能顺利识别它，而“实体编号”则没有这种担忧，但它实在不方便记忆。

ASCII中部分实体的新名字

显示	描述	实体名称	实体编号
"	quotation mark	"	"
'	apostrophe	' (IE下无效)	'
&	ampersand	&	&
<	less-than	<	<
>	greater-than	>	>

ISO 8859-1 符号实体

显示	描述	实体名称	实体编号
	non-breaking space
¡	inverted exclamation mark	¡	¡
¤	currency	¤	¤
￠	cent	¢	¢
￡	pound	£	£
￥	yen	¥	¥
¦	broken vertical bar	¦	¦
§	section	§	§
¨	spacing diaeresis	¨	¨
©	copyright	©	©
a	feminine ordinal indicator	ª	ª
«	angle quotation mark (left)	«	«
?	negation	¬	¬
-	soft hyphen
®	registered trademark	®	®
™	trademark	™	™
ˉ	spacing macron	¯	¯
°	degree	°	°
±	plus-or-minus	±	±
2	superscript 2	²	²
3	superscript 3	³	³
′	spacing acute	´	´
μ	micro	µ	µ
?	paragraph	¶	¶
·	middle dot	·	·
?	spacing cedilla	¸	¸
1	superscript 1	¹	¹
o	masculine ordinal indicator	º	º
»	angle quotation mark (right)	»	»
?	fraction 1/4	¼	¼
?	fraction 1/2	½	½
?	fraction 3/4	¾	¾
?	inverted question mark	¿	¿
×	multiplication	×	×
÷	division	÷	÷

ISO 8859-1 字符实体

显示	描述	实体名称	实体编号
À	capital a, grave accent	À	À
Á	capital a, acute accent	Á	Á
Â	capital a, circumflex accent	Â	Â
Ã	capital a, tilde	Ã	Ã
Ä	capital a, umlaut mark	Ä	Ä
Å	capital a, ring	Å	Å
Æ	capital ae	Æ	Æ
Ç	capital c, cedilla	Ç	Ç
È	capital e, grave accent	È	È
É	capital e, acute accent	É	É
Ê	capital e, circumflex accent	Ê	Ê
Ë	capital e, umlaut mark	Ë	Ë
Ì	capital i, grave accent	Ì	Ì
Í	capital i, acute accent	Í	Í
Î	capital i, circumflex accent	Î	Î
Ï	capital i, umlaut mark	Ï	Ï
Ð	capital eth, Icelandic	Ð	Ð
Ñ	capital n, tilde	Ñ	Ñ
Ò	capital o, grave accent	Ò	Ò
Ó	capital o, acute accent	Ó	Ó
Ô	capital o, circumflex accent	Ô	Ô
Õ	capital o, tilde	Õ	Õ
Ö	capital o, umlaut mark	Ö	Ö
Ø	capital o, slash	Ø	Ø
ù	capital u, grave accent	Ù	Ù
ú	capital u, acute accent	Ú	Ú
?	capital u, circumflex accent	Û	Û
ü	capital u, umlaut mark	Ü	Ü
Y	capital y, acute accent	Ý	Ý
T	capital THORN, Icelandic	Þ	Þ
?	small sharp s, German	ß	ß
à	small a, grave accent	à	à
á	small a, acute accent	á	á
a	small a, circumflex accent	â	â
?	small a, tilde	ã	ã
?	small a, umlaut mark	ä	ä
?	small a, ring	å	å
?	small ae	æ	æ
?	small c, cedilla	ç	ç
è	small e, grave accent	è	è
é	small e, acute accent	é	é
ê	small e, circumflex accent	ê	ê
?	small e, umlaut mark	ë	ë
ì	small i, grave accent	ì	ì
í	small i, acute accent	í	í
?	small i, circumflex accent	î	î
?	small i, umlaut mark	ï	ï
e	small eth, Icelandic	ð	ð
?	small n, tilde	ñ	ñ
ò	small o, grave accent	ò	ò
ó	small o, acute accent	ó	ó
?	small o, circumflex accent	ô	ô
?	small o, tilde	õ	õ
?	small o, umlaut mark	ö	ö
?	small o, slash	ø	ø
ù	small u, grave accent	ù	ù
ú	small u, acute accent	ú	ú
?	small u, circumflex accent	û	û
ü	small u, umlaut mark	ü	ü
y	small y, acute accent	ý	ý
t	small thorn, Icelandic	þ	þ
?	small y, umlaut mark	ÿ	ÿ

其它一些 HTML 所支持的实体

显示	描述	实体名称	实体编号
Œ	capital ligature OE	&OElig;	Œ
œ	small ligature oe	&oelig;	œ
Š	capital S with caron	&Scaron;	Š
š	small S with caron	&scaron;	š
Ÿ	capital Y with diaeres	&Yuml;	Ÿ
ˆ	modifier letter circumflex accent	&circ;	ˆ
˜	small tilde	&tilde;	˜
	en space	&ensp;
	em space	&emsp;
	thin space
‌	zero width non-joiner	&zwnj;	‌
‍	zero width joiner	&zwj;	‍
‎	left-to-right mark	&lrm;	‎
‏	right-to-left mark	&rlm;	‏
–	en dash	–	–
—	em dash	—	—
‘	left single quotation mark	‘	‘
’	right single quotation mark	’	’
‚	single low-9 quotation mark	&sbquo;	‚
“	left double quotation mark	“	“
”	right double quotation mark	”	”
„	double low-9 quotation mark	&bdquo;	„
†	dagger	&dagger;	†
‡	double dagger	&Dagger;	‡
…	horizontal ellipsis	…	…
‰	per mille	&permil;	‰
‹	single left-pointing angle quotation	&lsaquo;	‹
›	single right-pointing angle quotation	&rsaquo;	›
	euro	€	€