Can XML use non-Latin characters?

最新推荐文章于 2021-01-04 15:16:07 发布

java169

最新推荐文章于 2021-01-04 15:16:07 发布

阅读量559

点赞数

文章标签： xml character encoding numbers google rest

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/java169/article/details/2464145

版权

<script type="text/javascript"> google_ad_client = "pub-8800625213955058"; /* 336x280, 创建于 07-11-21 */ google_ad_slot = "0989131976"; google_ad_width = 336; google_ad_height = 280; // </script> <script type="text/javascript" src="http://pagead2.googlesyndication.com/pagead/show_ads.js"> </script> Yes, the XML Specification explicitly says XML uses ISO 10646, the international standard 31-bit character repertoire which covers most human (and some non-human) languages. This is currently congruent with Unicode and is planned to be superset of Unicode. The spec says (2.2): `All XML processors must accept the UTF-8 and UTF-16 encodings of ISO 10646...' . UTF-8 is an encoding of Unicode into 8-bit characters: the first 128 are the same as ASCII, the rest are used to encode the rest of Unicode into sequences of between 2 and 6 bytes. UTF-8 in its single-octet form is therefore the same as ISO 646 IRV (ASCII), so you can continue to use ASCII for English or other unaccented languages using the Latin alphabet. Note that UTF-8 is incompatible with ISO 8859-1 (ISO Latin-1) after code point 126 decimal (the end of ASCII). UTF-16 is an encoding of Unicode into 16-bit characters, which lets it represent the next two planes. UTF-16 is incompatible with ASCII because it uses two 8-bit bytes per character. `...the mechanisms for signalling which of the two are in use, and for bringing other encodings into play, are [...] in the discussion of character encodings.' The XML Specification explains how to specify in your XML file which coded character set you are using. Use of UCS-4 can only legally be specified in SGML or XML when the WebSGML Adaptations to ISO 8879 are implemented: this enables numbers longer than eight digits to be used in the SGML Declaration. `Regardless of the specific encoding used, any character in the ISO 10646 character set may be referred to by the decimal or hexadecimal equivalent of its bit string' : so no matter which character set you personally use, you can still refer to specific individual characters from elsewhere in the encoded repertoire by using &#dddd; (decimal character code) or &#xHHHH; (hexadecimal character code, in uppercase). The terminology can get confusing, as can the numbers: see the ISO 10646 Concept Dictionary. Rick Jelliffe has XML-ized the ISO character entity sets. Mike Brown's encoding information at http://skew.org/xml/tutorial/ is a very useful explanation of the need for correct encoding. There is an excellent online database of glyphs and characters in many encodings from the Estonian Language Institute server at http://www.eki.ee/letter/.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。