用StringBuilder创建的内存XML文件,如果用ToString的方式转换进XmlDocument.LoadXml中,不会有问题,此时编码为缺省的Unicode,UTF-16.但是如果用MemoryStream通过设置Encoding属性为Encoding.UTF8,再通过Encoding.UTF8.GetString(stream.ToArray());转换为String,导入XmlDocument则会报不可识别的字符错误,但是Encoding换成new UTF8Encoding();则没有问题,这是为什么呢?
其实奥妙就在Encoding得前置标识字符上,通过查看转换结果可以看出
System.Text.Encoding.UTF8.GetBytes(ss);
错误的情况前面增加了EFBBBF三个字节。通过MSDN可知这是为了标识编码方式的信息
Optionally, the Encoding provides a preamble which is an array of bytes that can be prefixed to the sequence of bytes resulting from the encoding process. If the preamble contains a byte order mark (In Unicode, code point U+FEFF), it helps the decoder determine the byte order and the transformation format or UTF. The Unicode byte order mark is serialized as follows (in hexadecimal):
-
UTF-8: EF BB BF
-
UTF-16 big-endian byte order: FE FF
-
UTF-16 little-endian byte order: FF FE
-
UTF-32 big-endian byte order: 00 00 FE FF
-
UTF-32 little-endian byte order: FF FE 00 00
那为什么new UTF8Encoding()没有问题呢?
引用老外的话来解释:
The hex value you see sets the byte ordering mark of the text. If you are using UTF8, should be 3 characters long and of value 0xEFBBBF. You can actually see it by calling Encoding.UTF8.GetPreamble().
I searched more thoroughly, and there is actually a difference between the two calls you were making:
Encoding.UTF8 returns a new instance of UTF8Encoding(true ), so you get an encoder that use the preamble of UTF8 for all encoding operation.
When you called UTF8Encoding(), the default is to call UTF8Encoding(false ), which does not use the preamble of UTF8 for encoding operation. (the preamble will then be an empty byte array)
So when you used Encoding.UTF8, the preamble was emitted, rendering your data invalid.
这几个字符,显然XmlDocument无法处理,所以报错。
个人觉得,在程序内部或内存里,还是不要加前缀的好,就是裸着。
如果保存到文件中,考虑到国际化的因素,需要用到字节顺序识别等等内容的,可以加前缀,但是读入内存最好把它去掉。