Unicode in Java

本文简述Unicode在Java中的一些知识点,属于个人学习总结,能力有限,还望各位大神多多指点。

Terminology

术语

UTF-8

UTF-8 is a variable width character encoding capable of encoding all valid code points in Unicode using one to four code units, and each code unit has 1 byte.

UTF-8是一种可变长度编码。一个Unicode字符可以用一到四个码元表示,每个码元长度为一个字节。

UTF-16

UTF-16 is a variable width character encoding capable of encoding all valid code points in Unicode using one to two code units, and each code unit has 2 bytes.

UTF-16是一种可变长度编码。一个Unicode字符可以用一到两个码元表示,每个码元长度为两个字节。

UTF-32

UTF-32 is a fixed width character encoding capable of encoding all valid code points in Unicode using one code unit, and each code unit has 4 bytes.

UTF-32是一种固定长度编码。一个Unicode字符用一个码元表示,每个码元长度为四个字节。

character

A character is a minimal unit of text that has semantic value. E.g. English letter “A”, Chinese character “中”.

字符是具有语义的最小文字单元。如英语字母“A”,汉字“中”。

character set

A character set is a collection of characters that might be used by one or multiple languages.

字符集是一组字符的集合。一个字符集可能被一种语言使用,也可能被多种语言使用。

coded character set

A coded character set is a character set where each character is assigned a unique number, i.e. code point.

编码字符集中的每个字符都被分配了一个唯一的值,这个值被成为码位。

code point

A code point is a value that can be used in a coded character set.

码位是编码字符集中可以使用的编码值。

In Java, a code point is a 32-bit int data type, where the lower 21 bits represent a valid code point value and the upper 11 bits are 0.

在Java中,一个码位用一个32位的int型表示,其中低21位用于表示有效的码位,高11为0。

To represent a character in Unicode, we use “U+” prior to code point, e.g. U+0041 means English upper letter “A”, U+1D7D8 means mathematical double-struck digit zero “?”.

在Unicode中,字符是以“U+”开头的,后面跟着16进制的码位,如U+0041表示英语大写字母“A”,U+1D7D8表示数学中的双线数字零“?”。

The range of code point is from U+0000 to U+10FFFF.

码位的取值范围从U+0000到U+10FFFF。

code unit

A code unit is a particular sequence of bits used in an encoding.

码元是在编码时使用的一组连续比特位。

code plane

A code plane is a group of 65536 code points.

在Unicode中,每65536个码位为一组,成为一个码面。

There are 17 code planes in Unicode.

一共有17个码面。

The first one is called Basic Multilingual Plane, a.k.a. BMP, which contains code points from U+0000 to U+FFFF.

第一个码面被称为基本多语言码面,简称BMP,包含码位范围从U+0000到U+FFFF。

The others are called Supplementary Characters, which contains code points from U+10000 to U+10FFFF.

其它码面称为补充字符,包含码位范围从U+10000到U+10FFFF。

basic multilingual plane

See code plane.

参见码面。

supplementary character

See code plane.

参见码面。

surrogate

In Java, a Unicode character is encoded with UTF-16.

在Java中Unicode字符使用UTF-16编码。

A char data type can represent only 65536 characters and all of these characters are in BMP.

一个char型可以表示BMP中的任意字符,共65536个字符。

In order to represent supplementary characters, two consecutive char are used.
The first one is called high surrogate, which falls into a range of 1024 unused code points of BMP, from U+D800 to U+DBFF, a.k.a. from 55296 to 56319 in decimal.
The second one is called low surrogate, which falls into a range of 1024 unused code points of BMP, from U+DC00 to U+DFFF, a.k.a. from 56320 to 57343 in decimal.

补充字符是用两个连续的char型表示的。第一个char型被称为高代理位,其取值范围是BMP中的1024个码位,从U+D800到U+DBFF,即十进制的55296到56319。第二个char型被称为低代理位,其取值范围是BMP中的1024个码位,从U+DC00到U+DFFF,即十进制的56320到57343。

So, BMP code points are separated into four parts, like below:

因此,BMP的码位被分成了四个部分,如下所示:

U+0000 U+D7FF U+D800 U+DBFF U+DC00 U+DFFF U+E000 U+FFFF

E.g. U+1D7D8 is represented as \uD835\uDFD8.

例如:U+1D7D8被表示为\uD835\uDFD8。

How to convert between supplementary character and surrogates

补充字符和高低代理位之间的转换

java.lang.Character已经提供了转换方法,这里的算法仅供参考。

“cp” means code point, “hs” means high surrogate, “ls” means low surrogate.

cp表示补充字符的码位,hs表示高代理位,ls表示低代理位。

get surrogates from code point:

从码位计算高低代理位:

hs = (cp - 0x10000) >> 10 + 0xD800

ls = (cp - 0x10000) & 0x03FF + 0xDC00

E.g.

hs = (0x1D7D8 - 0x10000) >> 10 + 0xD800 = 0xD835

ls = (cp - 0x10000) & 0x03FF + 0xDC00 = 0xDFD8

get code point from surrogates:

从高低代理位计算码位:

cp = (((hs - 0xD800) << 10) | (ls - 0xDC00)) + 0x10000

E.g.

cp = (((0xD835 - 0xD800) << 10) | (0xDFD8 - 0xDC00)) + 0x10000 = 0x1D7D8

How to represent supplementary character in Java

在Java中如何表示补充字符

A supplementary character can be represented by two BMP characters, and each BMP character can be represented by one Java char data type.

一个补充字符用两个BMP字符表示,一个BMP字符用一个char型表示。

java.lang.Character class is the wrapper of char data type, that each java.lang.Character class instance wraps one char.

java.lang.Character类是char型的封装类,每个java.lang.Character实例封装一个char型字符。

In other words, one java.lang.Character class instance can’t represent a supplementary character.

换句话说,一个java.lang.Character实例无法表示一个补充字符。

But java.lang.Character provide lots of static methods to handle supplementary characters.

但是java.lang.Character提供了很多用于处理补充字符的静态方法。

These methods are divided into two categories:

  1. support int type code point, i.e. support supplementary characters.
  2. support char type BMP character, i.e. doesn’t support supplementary characters.

这些方法分为两类:

  1. 支持int型的码位,也就是用int型码位表示补充字符。
  2. 支持char型BMP字符,但不支持补充字符。

In most cases, the same functionality is implemented by two methods: one supports supplementary characters, and one doesn’t.

大多数情况下,同一个功能提供两个实现,一个支持补充字符,一个不支持。

E.g.

public static int toUpperCase(int codePoint);

public static char toUpperCase(char ch);

In Java source, except for comments, identifiers, the contents of char literal, and String literal, all input elements in Java are formed only from ASCII characters.

在Java源码中,除了注释、标识符、char型值和字符串型值以为,其它字符都是用ASCII编码的。

But it’s better to use escaped non-ASCII characters, because whether or not a character can be displayed correctly depends on if correct font is installed on your system.

在Java源码中,对于补充字符,最好都使用ASCII转义字符表示,即\uXXXX\uXXXX,因为一个补充字符是否能正确显示取决于你的系统是否按照的对应的字体。或者说在你的系统上能正确显示的字符,在别人的系统上可能显示为乱码。只有ASCII字符在所有系统上都能正确显示。

Unicode escape

As discussed above, a Unicode character can be represented as one or two chars in format \uXXXX. This format is called Unicode escape. So, all Unicode characters can be represented as ASCII characters.

前面已经讨论过,一个Unicode字符可以用一到两个char表示,并且char的格式是\uXXXX,这种格式被称为Unicode转义码。它的作用是可以用ASCII字符表示所有Unicode字符。

There are three steps when transfer Java source code to meaningful syntactic grammar.

  1. A translation of Unicode escapes in the raw stream of Unicode characters to the corresponding Unicode character.
  2. A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators.
  3. A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of input elements which, after white space and comments are discarded, comprise the tokens that are the terminal symbols of the syntactic grammar.

把Java源代码转换成有意义的语法格式需要三个步骤。

  1. 将Unicode转义码转换成Unicode字符。
  2. 将从第一步得到的字符流用换行符分割成不同的行。
  3. 将从第二步得到的字符流中的空格和注释去掉,然后做语法解析。

So, there are some special Unicode characters that can’t be represented as \uXXXX directly, because it may not as you expected.

因此,有些特殊的Unicode字符不能直接表示成\uXXXX,因为它可能跟你想的不一样。

E.g. when you initialize a String instance, you want to insert a line terminator, so the String instance meaning something in two lines.

String str = "line 1\u000Dline 2";

\u000D means carriage return, a.k.a. CR

But in translation step 1, the code is translated to:

String str = "line 1
line 2";

In step 2, they are considered as two different lines.

In step 3, it reports format error: String literal is not properly closed by a double-quote.

例如,当你初始化一个字符串实例的时候,你想加入一个换行符来表示两行字符。

\u000D表示换行,即CR。

但是在转换第一步,\u000D被转换成了一个真正的换行符,在第二步你的代码被分成了两行,在第三步就报错了。

How to solve this?

怎么解决呢?

There are some special Unicode escapes:

你可以使用一些特殊的Unicode转义码:

\b means backspace BS \u0008
\t means horizontal tab HT \u0009
\n means linefeed LF \u000a
\f means form feed FF \u000c
\r means carriage return CR \u000d
\" means double quote \u0022
\’ means single quote \u0027
\\ means backslash \u005c
\000 ~ \3FF is used to represent something in octal value. This is for compatibility with C but not recommended in Java.

八进制的格式仅用于兼容C语言,不建议使用。

Your case should be like this:

这样做就可以了。

String str = "line 1\nline 2";

What difference between \uXXXX and \uuXXXX

\uXXXX和\uuXXXX有啥区别

Java supports one or multiple u in Unicode escape, and they don’t have difference in runtime. Their difference is in compile/de-compile.

在Unicode转义码中可以使用一个或多个u,他们在运行时没有任何区别,而他们的区别体现着编译和反编译过程。

In Java source, Unicode character and Unicode escape can be used together. When compile, all Unicode characters will be converted to Unicode escape. Later on, the Java class file can be de-compiled to Java source. In this case, how to know whether a Unicode character was used, or a Unicode escape was used?

在Java源码中,Unicode字符和Unicode转义码同时存在。在编译时,所有Unicode字符都会被转换成Unicode转义码。在某些情况,Java class会被反编译成源码。那么怎么知道以前是用的Unicode字符还是Unicode转义码?

To solve this problem, when compile, Unicode character is converted to \uXXXX, and Unicode escape \uXXXX is converted to \uuXXXX. There is one more u added. When de-compile, if there is one u in Unicode escape, it’s converted to Unicode character directly, otherwise, remove one u and keep as Unicode escape.

为了解决这个问题,在编译的时候,Unicode字符会被转换成\uXXXX格式,而Unicode转义码\uXXXX会被转换成\uuXXXX。这里多了一个u。当反编译的时候,如果只有一个u,就转换成Unicode字符,否则就去掉一个u,依旧是Unicode转义码。

Constants in java.lang.Character

java.lang.Character中的常量

There are lots of constants defined in java.lang.Character class. They are briefly in three groups:

  • character types
  • character directionality
  • boundary constants

java.lang.Character中的常量主要有三类:

  • 字符类型 (本文不讨论)
  • 字符方向属性 (本文不讨论)
  • 边界常量

SIZE和BYTES定义了一个char型占用的比特位数和字节数。

MIN_RADIX和MAX_RADIX定义了基数的边界值。基数用于检查一个字符在指定进制中是否为有效字符,以及将字符和其数值含义之间进行转换。例如十进制数中0~9都是有效字符,十六进制数中A~F也是有效字符。

MIN_VALUE和MAX_VALUE定义了BMP字符的码位边界。

MIN_CODE_POINT和MAX_CODE_POINT定义了Unicode字符的码位边界。

MIN_SUPPLEMENTARY_CODE_POINT定义了最小的补充字符的码位。

MIN_HIGH_SURROGATE,MAX_HIGH_SURROGATE,MIN_LOW_SURROGATE,MAX_LOW_SURROGATE,MIN_SURROGATE和MAX_SURROGATE定义了代理位的边界。

Frequently used methods

常用方法

get number of chars of a code point

给定一个码位,计算其占用多少个char。

你需要自己验证给定的码位是否有效。

public class Test {
   
    
    private static void testCharCount(int codePoint) {
   
        // Before invoking Character.charCount, it's better to valid input code point.
        // Because for invalid code point, it also returns 1.
        if (Character.isValidCodePoint(codePoint)) {
   
            System.out.println(Character.charCount(codePoint));
        } else {
   
            System.out.println("invalid code point");
        }
    }

    public static void main(String[] args) {
   
        int bmpCodePoint = 0x0041; // A
        int supplementaryCodePoint = 0x1D7D8; // ?
        int invalidCodePoint = -100;

        // 1
        testCharCount(bmpCodePoint);
        // 2
        testCharCount(supplementaryCodePoint);
        // invalid code point
        testCharCount(invalidCodePoint);
    }
}

get code point from surrogates

从代理位到码位的转换。

你需要自己验证给定的代理位是否有效。

public class Test {
   

    private static void testToCodePoint1(char highSurrogate, char lowSurrogate) {
   
        // You have to validate surrogates by yourself.
        // Otherwise, the output is unexpected.
        if (Character.isSurrogatePair(highSurrogate, lowSurrogate)) {
   
            System.out.println(Character.toCodePoint(highSurrogate, lowSurrogate));
        } else {
   
            System.out.println("they are not in pair");
        }
    }

    // This is the same as above
    private static void testToCodePoint2(char highSurrogate, char lowSurrogate) {
   
        if (Character.isHighSurrogate(highSurrogate) && Character.isLowSurrogate(lowSurrogate)) {
   
            System.out.println(Character.toCodePoint(highSurrogate, lowSurrogate));
        } else {
   
            System.out.println("they are not in pair");
        }
    }

    public static void main(String[] args
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值