Unicode和UTF-8编码

弄清楚几个概念

什么是字符?

百科定义:字符指类字形单位或符号,包括字母、数字、运算符号、标点符号和其他符号,以及一些功能性符号。

个人理解:字符是所有语言中最小单位的具有含义的一个符号,比如英语的字母、标点,中文的汉字、标点,还有其他语言的符号等等,可以简单理解为一个每种语言的字母或单独的字;

什么是字符编码?为什么需要字符编码?

字符编码是把字符集中的字符编码为指定集合中的某一对象(例如:比特模式、自然数序列、8位组或者电脉冲),以便文本在计算机中存储和通过通信网络的传递;

由于计算机只能识别0和1,因此要处理字符必须将其转换成0和1才能处理。计算机设计采用8个比特(bit)作为一个字节(byte),所以,一个字节能表示的最大的整数就是255(二进制11111111=十进制255),0 - 255被用来表示大小写英文字母、数字和一些符号,这个编码表被称为ASCII编码,比如大写字母A的编码是65,小写字母z的编码是122。

什么是Unicode?为什么有Unicode?

Unicode是国际组织制定的可以容纳世界上所有文字和符号的字符编码方案,Unicode把所有语言都统一到一套编码里,这样就不会再有乱码问题了。一个Unicode的字符通常会用“U+”然后紧接着一组十六进制的数字来表示这一个字符。

最初的Unicode标准定义使用固定的16bit表示字符,但是Unicode标准已经改变,允许使用超过16bit标识字符,目前Unicon字符范围U+0000 to U+10FFFF。字符从U + 0000到U + FFFF的也被称为BMP(Basic Multilingual Plane),对于大于U+FFFF 的字符称为补充字符(supplementary)

Unicode和UTF-8是什么关系?为什么有Unicode还需要UTF-8编码?

Unicode可以理解为字符集,UTF-8可以理解为字符编码,将Unicode字符映射成具体的可存储字节。UTF-8是使用1-4个字节来标识每一个Unicode字符,并向后兼容ASCII;

Unicode属于字符集,即将全世界所有的字符收集起来统一分配固定的值(称为Unicode标量值或者code point),这样全世界的字符值就不会冲突(如果不统一,不同语言的字符可能分配到同一个值上)。这种映射关系,能够通过值标识Unicode字符,但是并没有标识这个值在计算机中怎么存储,比如几个字节代表一个字符等等,而这些就是字符编码要做的(将Unicode字符映射到具体的存储中去);

字符集需要有具体的值来标识某一个字符,这就是Unicode的标量值或者code point,而字符编码就是将字符集映射成具体的可存储字节。对于Unicode字符集,其实现可能有多中,如UTF-8、UTF-16、UTF-32等,即同一个Unicode字符(code point相同)可能映射成不同的字节数和字节表示的值;

Java中的字符编码?

Java8平台采用的6.2版本的Unicode标准,采用的字符编码是UTF-16,所以一个char在Java中占用2个byte。对于BMP中的字符,使用一个char即可以表示,对于补充字符,则使用一对char字符表示,第一个char是高代理范围(\uD800-\uDBFF),第二个char是低代理范围(\uDC00-\uDFFF)。

int能够代表所有的Unicode字符,包括补充字符,其中低21bit用于代表Unicode code point,高11bit必须是0,代码中可以使用int标识Unicode code point 来代表补充字符。

Character中方法参数是char的不支持补充字符,传入补充字符会被当做未知字符处理;

Character中方法参数是int的支持所有Unicode字符;

Java中的Character类作用?

这个问题是本文最初的问题...

Character类代表的是字符,基本数据类型char的包装类。 Character的对象包含单个字段,其类型为char,并且该类提供了大量静态方法,用于确定字符的类别(小写字母,数字等)以及将字符从大写转换为小写,反之亦然。

 

看文档感觉一些概念可以学习下:

Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF and E000 to 10FFFF inclusive.

Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF

Unicode Character Database. A collection of files providing normative and informative Unicode character properties and mappings

High-Surrogate Code Point. A Unicode code point in the range U+D800 to U+DBFF.

High-Surrogate Code Unit. A 16-bit code unit in the range D800 to DBFF, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate.

Low-Surrogate Code Point.. A Unicode code point in the range U+DC00 to U+DFFF.

Low-Surrogate Code Unit. A 16-bit code unit in the range DC00 to DFFF, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate.

Unicode Encoding Form. A character encoding form that assigns each Unicode scalar value to a unique code unit sequence. The Unicode Standard defines three Unicode encoding forms: UTF-8, UTF-16, and UTF-32.

Unicode Encoding Scheme. A specified byte serialization for a Unicode encoding form, including the specification of the handling of a byte order mark (BOM), if allowed.

Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. 

UTF-8. A multibyte encoding for text that represents each Unicode character with 1 to 4 bytes, and which is backward-compatible with ASCII. UTF-8 is the predominant form of Unicode in web pages. 

UTF-8 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value to an unsigned byte sequence of one to four bytes in length

UTF-8 Encoding Scheme. The Unicode encoding scheme that serializes a UTF-8 code unit sequence in exactly the same order as the code unit sequence itself. 

UTF-16. A multibyte encoding for text that represents each Unicode character with 2 or 4 bytes; it is not backward-compatible with ASCII. It is the internal form of Unicode in many programming languages, such as Java, C#, and JavaScript, and in many operating systems. 

UTF-16 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair

UTF-16 Encoding Scheme. The UTF-16 encoding scheme that serializes a UTF-16 code unit sequence as a byte sequence in either big-endian or little-endian formats.

UTF-32. A multibyte encoding for text that represents each  Unicode character with 4 bytes; it is not backward-compatible with ASCII.

UTF-32 Encoding Form. The Unicode encoding form that assigns each Unicode scalar value to a single unsigned 32-bit code unit with the same numeric value as the Unicode scalar value.

UTF-32 Encoding Scheme. The Unicode encoding scheme that serializes a UTF-32 code unit sequence as a byte sequence in either big-endian or little-endian formats.

Big-endian. A computer architecture that stores multiple-byte numerical values with the most significant byte (MSB) values first.

Little-endian. A computer architecture that stores multiple-byte numerical values with the least significant byte (LSB) values first.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值