java utf16 to utf8_Java 8 UTF-16 isn't default charset but UTF-8

Java与Unicode编码

最新推荐文章于 2024-08-03 15:21:18 发布

原创最新推荐文章于 2024-08-03 15:21:18 发布 · 161 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#java utf16 to utf8

本文探讨了Java中使用UTF-16编码处理Unicode字符的方式，并讨论了不同编码在实际应用中的转换及潜在问题。

Let's back up a bit…

Java's text datatypes use the UTF-16 character encoding of the Unicode character set. (As do, VB4/5/6/A/Script, JavaScript, .NET, ….) You can see this in the various operations you do with the string API: indexing, length, ….

Libraries support converting between the text datatypes and byte arrays using various encodings. Some of them are categorized as "Extended ASCII", but stating that is a very poor substitute for naming the character encoding actually being used.

Some operating systems allow the user to designate a default character encoding. (Most users don't know or care, though.) Java attempts to pick this up. It is only useful when the program understands that input from the user is that character encoding or that output should be. This century, users dealing in text files prefer to use a specific encoding, communicate them unchanged across systems, don't appreciate lossy conversions and therefore don't have any use for this concept. From a program's point of view, it is never what you want unless it is exactly what you want.

Where a conversion would be lossy, you have the choice of a replacement character (such a '?'), omitting it, or throwing an exception.

A character encoding is a map between a codepoint (integer) of a character set and one or more code units, according to the definition of the encoding. A code unit is a fixed size and the number of code units needed for a codepoint, might vary by codepoint.

In libraries, it is not generally useful to have an array of code units so they take the further step of converting to/from an array of bytes. byte values do range from -128 to 127, however, that's the Java interpretation as two's complement 8-bit integers. As the bytes are understood to be encoding text, the values would be interpret according to the rules of the character encoding.

Because some Unicode encodings, have code units more than one byte long, byte order becomes important. So, at the byte array level, there is UTF-16 Big Endian and UTF-16 Little Endian. When communicating a text file or stream, you would send the bytes and well as having a shared knowledge of the encoding. This "metadata" is required for understanding. So, UTF-16BE or UTF-16LE, for example. To make that a bit easier, Unicode allows some metadata beginning of the file or stream to indicate the byte order. It is called the byte-order mark (BOM) So, the external metadata can share the encoding (say, UTF-16), while the internal metadata shares the byte order. Unicode allows the BOM to be present even when byte order is not relevant, such as UTF-8. So, if the understanding is that the bytes are text encoded with any Unicode encoding and a BOM is present, then it's a very simple matter to figure out which Unicode encoding it is and what the byte order is, if relavent.

1) You are seeing the BOM in some of your Unicode encoding outputs.

2) È is not in the ASCII character set. What would want to happen in this case? I often prefer an exception.

3) The system you were using, for your account, at the time of your tests, may have had UTF-8 as the default character encoding, Is that important to the way you want and have encoded your text files on that system?