java中char类型_理解Java中的char类型

摘自《Java Core》ed 11

To fully understand the char type, you have to know about the Unicode encoding scheme. Unicode was invented to overcome the limitations of traditional character encoding schemes. Before Unicode, there were many different standards: ASCII in the United States, ISO 8859-1 for Western European languages, KOI-8 for Russian, GB18030 and BIG-5 for Chinese, and so on. This caused two problems. First, a particular code value corresponds to different letters in the different encoding schemes. Second, the encodings for languages with large character sets have variable length: Some common characters are encoded as single bytes, others require two or more bytes.

Unicode码是为了解决之前不同的语言有各自的字符编码所产生的两个问题。第一是同一个编码值,在不同的字符集中可能对应不同的字符;第二,一些数量比较大的字符集编码所占空间(字长)不同,可能是1个字节、可能是2个或更多字节。

Unicode was designed to solve these problems. When the unification effort started in the 1980s, a fixed 2-byte code was more than sufficient to encode all characters used in all languages in the world, with room to spare for future expansion—or so everyone thought at the time. In 1991, Unicode 1.0 was released, using slightly less than half of the available 65,536 code values. Java was designed from the ground up to use 16-bit Unicode characters, which was a major advance over other programming languages that used 8-bit characters.

20世纪80年代的Unicode编码被认为固定的2个字节大小的编码(65536个)可以表示完世界上所有的语言的字符,并且还有空间来扩展。Unicode1.0中所有字符占了不到一半的编码空间。Java设计就是基于16位Unicode字符的(对于8位字符的优势)。

Unfortunately, over time, the inevitable happened. Unicode grew beyond 65,536 characters, primarily due to the addition of a very large set of ideographs used for Chinese, Japanese, and Korean. Now, the 16-bit char type is insufficient to describe all Unicode characters.

16位的编码空间还是被占满了(中、日、韩的象形文字?)

We need a bit of terminology to explain how this problem is resolved in Java, beginning with Java 5. A code point is a code value that is associated with a character in an encoding scheme. In the Unicode standard, code points are written in hexadecimal and prefixed with U+, such as U+0041 for the code point of the Latin letter A. Unicode has code points that are grouped into 17 code planes. The first code plane, called the basic multilingual plane, consists of the “classic” Unicode characters with code points U+0000 to U+FFFF. Sixteen additional planes, with code points U+10000 to U+10FFFF, hold the supplementary characters.

从Java5开始,一个code point(码点)绑定在某个编码模式中的一个字符。Unicode标准中,码点被写成十六位的十六进制数加一个U+(U+0041)。Unicode将码点分组排到17个code plane(空间)。第一个code plane是基本的多语言plane,包括“经典的”Unicode字符(完整语言意义的字符)(U+0000 - U+FFFF)。其余16个plane,(U+10000 - U+10FFFF),包括补充字符。

The UTF-16 encoding represents all Unicode code points in a variable-length code. The characters in the basic multilingual plane are represented as 16-bit values, called code units. The supplementary characters are encoded as consecutive pairs of code units. Each of the values in such an encoding pair falls into a range of 2048 unused values of the basic multilingual plane, called the surrogates area (U+D800 to U+DBFF for the first code unit, U+DC00 to U+DFFF for the second code unit). This is rather clever, because you can immediately tell whether a code unit encodes a single character or it is the first or second part of a supplementary character.

在basic multilingual plane中的字符被称为一个code unit(码元)。补充字符被编码为码元对。每个编码成码元对的值都在基本plane中的2048个未用到的值范围内。(叫做替代区?)

In Java, the char type describes a code unit in the UTF-16 encoding. Our strong recommendation is not to use the char type in your programs unless you are actually manipulating UTF-16 code units. You are almost always better off treating strings as abstract data types.

Java中,char类型描述的是一个码元,所以建议尽量不用char类型,除非要处理UTF-16编码的码元。最好还是用String。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值