java char=9为什么可以_为啥在java中不要使用char类型

最新推荐文章于 2021-12-29 17:38:38 发布

weixin_39524183

最新推荐文章于 2021-12-29 17:38:38 发布

阅读量106

点赞数

文章标签： java char=9为什么可以

本文链接：https://blog.csdn.net/weixin_39524183/article/details/114391065

版权

To understand the char type, you have to know about the Unicode encoding scheme. Unicode was invented to overcome the limitations of traditional character encoding schemes. Before Unicode, there were many different standards: ASCII

in the United States, ISO 8859-1 for Western European languages, KOI-8 for Russian, GB18030 and BIG-5 for Chinese, and so on. This causes two problems. A particular code value corresponds to different letters in the various encoding schemes. Moreover, the

encodings for languages with large character sets have variable length: some common characters are encoded as single bytes, others require two or more bytes.

Unicode was designed to solve these problems. When the unification effort started in the 1980s, a fixed 2-byte width code was more than sufficient to encode all characters used in all languages in the world, with room to spare for

future expansion—or so everyone thought at the time. In 1991, Unicode 1.0 was released, using slightly less than half of the available 65,536 code values. Java was designed from the ground up to use 16-bit Unicode characters, which was a major advance over

other programming languages that used 8-bit characters.

Unfortunately, over time, the inevitable happened. Unicode grew beyond 65,536 characters, primarily due to the addition of a very large set of ideographs used for Chinese, Japanese, and Korean. Now, the 16-bit char type is insufficient

to describe all Unicode characters.

We need a bit of terminology to explain how this problem is resolved in Java, beginning with JDK 5.0. A code point is a code value that is associated with a character in an encoding scheme. In the Unicode standard, code points are

written in hexadecimal and prefixed with U+, such as U+0041 for the code point of the letter A. Unicode has code points that are grouped into 17 code planes. The first code plane, called the basic multilingual plane, consists of the "classic" Unicode characters

with code points U+0000 to U+FFFF. Sixteen additional planes, with code points U+10000 to U+10FFFF, hold the supplementary characters.

The UTF-16 encoding is a method of representing all Unicode code points in a variable length code. The characters in the basic multilingual plane are represented as 16-bit values, called code units. The supplementary characters

are encoded as consecutive pairs of code units. Each of the values in such an encoding pair falls into an unused 2048-byte range of the basic multilingual plane, called the surrogates area (U+D800 to U+DBFF for the first code unit, U+DC00 to U+DFFF for the

second code unit).This is rather clever, because you can immediately tell whether a code unit encodes a single character or whether it is the first or second part of a supplementary character. For example, the mathematical symbol for the set of integers has

code point U+1D56B and is encoded by the two code units U+D835 and U+DD6B. (Seehttp://en.wikipedia.org/wiki/UTF-16for

a description of the encoding algorithm.)

weixin_39524183

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫