Unicode和UTF-8编码

最新推荐文章于 2024-05-25 17:41:43 发布

路行-

最新推荐文章于 2024-05-25 17:41:43 发布

阅读量2.8k

点赞数

分类专栏： Java 基础文章标签：字符

本文链接：https://blog.csdn.net/u013409186/article/details/100674292

版权

Java 同时被 2 个专栏收录

1 篇文章 0 订阅

订阅专栏

基础

1 篇文章 0 订阅

订阅专栏

弄清楚几个概念

什么是字符？

百科定义：字符指类字形单位或符号，包括字母、数字、运算符号、标点符号和其他符号，以及一些功能性符号。

个人理解：字符是所有语言中最小单位的具有含义的一个符号，比如英语的字母、标点，中文的汉字、标点，还有其他语言的符号等等，可以简单理解为一个每种语言的字母或单独的字;

什么是字符编码？为什么需要字符编码？

字符编码是把字符集中的字符编码为指定集合中的某一对象（例如：比特模式、自然数序列、8位组或者电脉冲），以便文本在计算机中存储和通过通信网络的传递；

由于计算机只能识别0和1，因此要处理字符必须将其转换成0和1才能处理。计算机设计采用8个比特（bit）作为一个字节（byte），所以，一个字节能表示的最大的整数就是255（二进制11111111=十进制255），0 - 255被用来表示大小写英文字母、数字和一些符号，这个编码表被称为ASCII编码，比如大写字母A的编码是65，小写字母z的编码是122。

什么是Unicode？为什么有Unicode?

Unicode是国际组织制定的可以容纳世界上所有文字和符号的字符编码方案，Unicode把所有语言都统一到一套编码里，这样就不会再有乱码问题了。一个Unicode的字符通常会用“U+”然后紧接着一组十六进制的数字来表示这一个字符。

最初的Unicode标准定义使用固定的16bit表示字符，但是Unicode标准已经改变，允许使用超过16bit标识字符，目前Unicon字符范围U+0000 to U+10FFFF。字符从U + 0000到U + FFFF的也被称为BMP（Basic Multilingual Plane），对于大于U+FFFF 的字符称为补充字符（supplementary）

Unicode和UTF-8是什么关系？为什么有Unicode还需要UTF-8编码？

Unicode可以理解为字符集，UTF-8可以理解为字符编码，将Unicode字符映射成具体的可存储字节。UTF-8是使用1-4个字节来标识每一个Unicode字符，并向后兼容ASCII；

Unicode属于字符集，即将全世界所有的字符收集起来统一分配固定的值（称为Unicode标量值或者code point），这样全世界的字符值就不会冲突（如果不统一，不同语言的字符可能分配到同一个值上）。这种映射关系，能够通过值标识Unicode字符，但是并没有标识这个值在计算机中怎么存储，比如几个字节代表一个字符等等，而这些就是字符编码要做的（将Unicode字符映射到具体的存储中去）；

字符集需要有具体的值来标识某一个字符，这就是Unicode的标量值或者code point，而字符编码就是将字符集映射成具体的可存储字节。对于Unicode字符集，其实现可能有多中，如UTF-8、UTF-16、UTF-32等，即同一个Unicode字符（code point相同）可能映射成不同的字节数和字节表示的值；

Java中的字符编码？

Java8平台采用的6.2版本的Unicode标准，采用的字符编码是UTF-16，所以一个char在Java中占用2个byte。对于BMP中的字符，使用一个char即可以表示，对于补充字符，则使用一对char字符表示，第一个char是高代理范围(\uD800-\uDBFF)，第二个char是低代理范围(\uDC00-\uDFFF)。

int能够代表所有的Unicode字符，包括补充字符，其中低21bit用于代表Unicode code point，高11bit必须是0，代码中可以使用int标识Unicode code point 来代表补充字符。

Character中方法参数是char的不支持补充字符，传入补充字符会被当做未知字符处理；

Character中方法参数是int的支持所有Unicode字符；

Java中的Character类作用？

这个问题是本文最初的问题...

Character类代表的是字符，基本数据类型char的包装类。 Character的对象包含单个字段，其类型为char，并且该类提供了大量静态方法，用于确定字符的类别（小写字母，数字等）以及将字符从大写转换为小写，反之亦然。

看文档感觉一些概念可以学习下：

Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF and E000 to 10FFFF inclusive.

Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF

Unicode Character Database. A collection of files providing normative and informative Unicode character properties and mappings

High-Surrogate Code Point. A Unicode code point in the range U+D800 to U+DBFF.

High-Surrogate Code Unit. A 16-bit code unit in the range D800 to DBFF, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate.

Low-Surrogate Code Point.. A Unicode code point in the range U+DC00 to U+DFFF.

Low-Surrogate Code Unit. A 16-bit code unit in the range DC00 to DFFF, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate.

Unicode Encoding Form. A character encoding form that assigns each Unicode scalar value to a unique code unit sequence. The Unicode Standard defines three Unicode encoding forms: UTF-8, UTF-16, and UTF-32.

Unicode Encoding Scheme. A specified byte serialization for a Unicode encoding form, including the specification of the handling of a byte order mark (BOM), if allowed.

Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form.

UTF-8. A multibyte encoding for text that represents each Unicode character with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the predominant form of Unicode in web pages.

UTF-8 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length

UTF-8 Encoding Scheme. The Unicode encoding scheme that serializes a UTF-8 code unit sequence in exactly the same order as the code unit sequence itself.

UTF-16. A multibyte encoding for text that represents each Unicode character with 2 or 4 bytes; it is not backward-compatible with ASCII. It is the internal form of Unicode in many programming languages, such as Java, C#, and JavaScript, and in many operating systems.

UTF-16 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair

UTF-16 Encoding Scheme. The UTF-16 encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian formats.

UTF-32. A multibyte encoding for text that represents each Unicode character with 4 bytes; it is not backward-compatible with ASCII.

UTF-32 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.

UTF-32 Encoding Scheme. The Unicode encoding scheme that serializes a UTF-32 code unit sequence as a byte sequence in either big-endian or little-endian formats.

Big-endian. A computer architecture that stores multiple-byte numerical values with the most significant byte (MSB) values first.

Little-endian. A computer architecture that stores multiple-byte numerical values with the least significant byte (LSB) values first.

路行-

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Unicode和UTF-8编码

弄清楚几个概念什么是字符？百科定义：字符指类字形单位或符号，包括字母、数字、运算符号、标点符号和其他符号，以及一些功能性符号。个人理解：字符是所有语言中最小单位的具有含义的一个符号，比如英语的字母、标点，中文的汉字、标点，还有其他语言的符号等等，可以简单理解为一个每种语言的字母或单独的字;什么是字符编码？为什么需要字符编码？字符编码是把字符集中的字符编码为指定集合中的某一对象（...
复制链接

扫一扫

专栏目录