为了得到一个简短的答案,为简单起见,我建议使用UTF-16. Java / C#/ Python 3.0完全出于简化目的切换到该模型.
我一直希望wchar_t的宽度为16或32位,许多平台都支持它.实际上,像wcrtomb()这样的API不允许实现支持wchar_t *的移位状态,但是由于UTF-8不需要,因此可以使用它,而排除其他编码.
然后,我回答有关XML的问题.
File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.
我不确定,但我不这么认为.
在同一文件中混合两种编码会带来麻烦和数据损坏.
用UTF-16编码文件通常是一个不好的选择,因为大多数程序都依赖使用ASCII.
问题是:XML文件可能使用任何单一编码,甚至可能使用UTF-16,但随后的初始编码声明也必须使用UTF-16,甚至使用标签.我在UTF-16上看到的问题是:一个可靠的语法应该如何解析初始声明?答案来自规范:§4.3.3:
In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.
阅读该文档时,请注意,XML文件也是一个实体,称为文档实体.通常,实体是文档的存储单元.从整个规范中,我会说每个实体只允许一个编码声明,并且在读取它们时会将所有实体转换为UTF-16,以便于处理.
网志: