字符编码

最新推荐文章于 2024-07-18 17:01:17 发布

azhu422

最新推荐文章于 2024-07-18 17:01:17 发布

阅读量371

点赞数

文章标签： encoding stream character string byte apache

本文链接：https://blog.csdn.net/azhu422/article/details/5841104

版权

首先需要整明白两个概念：字符集、字符集编码。

清楚Unicode是字符集，在字符集中，每个字符只有编号，也称为码点。UTF-16是Unicode字符集的一种编码，UTF-8是同一字符集的另一种编码。

参考：

http://hi.baidu.com/%B0%AE%D0%C4%CD%AC%C3%CB_%B3%C2%F6%CE/blog/item/31bf18a2306cc5a7cbefd0c8.html

有编码就有编码码表。

ASCII表用了一个字节的前7个bit

针对最高位的扩展有了更多的编码（第8位为1）。

all encodings with a code unit size greater than 8 bits are assumed to use an ASCII-compatible low-order byte

GB2312是国标，GBK是部分企业开发的兼容GB2312的编码

网上有GB2312编码表：http://www.knowsky.com/resource/gb2312tbl.htm

也有GBK码表的说明以及详细码表：http://www.jsjzx.net/studypc/write/007.htm

实际开发中，针对码表编写逐字节检测（检测第一个字节合法的GB码的第一个字节，第二个字节合法的GB码的第二个字节）的算法，是一个可行的方案。

另外常用的UTF-8和UTF-16，它们是为Unicode字符集的两种不同编码方式。

其中Unicode和UTF-8的对应关系

U-00000000 - U-0000007F: 0xxxxxxx

U-00000080 - U-000007FF: 110xxxxx 10xxxxxx

U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

注意到第一个字节第一个0前的1的个数和该码的字节数是一样的。

不同字符集之间的转换

Java里的转换更为方便，但是要理解透彻几个点。

Jdk文档中明确说明：String 表示一个 UTF-16 格式的字符串，将字符定义为固定宽度的 16 位实体。所以对于同一个字符串，不论如何decode或者encode，只要用String引用标识，它在内存中就是一个UTF-16编码的二进制串。

所以str. getBytes(“UTF-8”); 是指，把str这个以UTF-16的二进制串，每次读入两个字节，将这两个字节映射到Unicode码表上，然后再根据该Unicode码点的UTF-8编码方式，返回一个字节数组。同理str. getBytes(“GBK”); 但是GB码比Unicode码点数量要少，所以有可能会丢失数据或者出错。

另外str = new String(bytes, “UTF-8”); 是指，按照UTF-8的编码方式（计算首字节第一个0前的1的个数），逐字节读入bytes中保存的二进制串；每识别出一个Unicode码点，就将之转换成该Unicode码点对应的UTF-16编码的二进制串。最后生成的String仍然是一个UTF-16编码的二进制串。

Java可转换的字符编码类型由JVM平台支持，操作系统支持。也有一些第三方的库提供支持，比如apache lang或者apache codec。

但是在C/C++里不太一样，C里就是一个char数组，每个char就是8个bit。没有统一的表示，每次转换就得先识别出编码方式，然后逐字节查找码点，再从码点，寻找要转换过去的编码方式的二进制串。第三方库有iConv库。

编码检测识别

读取输入流的前四个字节，

1．如果前四个字节有BOM （Byte Order Mark），则可识别出来：

BOM Bytes Encoding

EF BB FF UTF-8

FF FE 00 00 UTF-32 (little-endian)

00 00 FE FF UTF-32 (big-endian)

FF FE UTF-16 (little-endian)

FE FF UTF-16 (big-endian)

0E FE FF SCSU

2B 2F 76 UTF-7

DD 73 66 73 UTF-EBCDIC

FB EE 28 BOCU-1

2．如果前四个字节不是BOM，再检测该四个字节的特殊情况：

"00 00..." : If the stream starts with two zero bytes, the default 32-bit big-endian encoding UTF-32BE is used.

"00 XX..." : If the stream starts with a single zero byte, the default 16-bit big-endian encoding UTF-16BE is used.

"XX ?? 00 00..." : If the third and fourth bytes of the stream are zero, the default 32-bit little-endian encoding UTF-32LE is used.

"XX 00..." or "XX ?? XX 00..." : If the second or fourth byte of the stream is zero, the default 16-bit little-endian encoding UTF-16LE is used.

"XX XX 00 XX..." : If the third byte of the stream is zero, the default 16-bit big-endian encoding UTF-16BE is used (assumes the first character is > U+00FF).

"4C XX XX XX..." : If the first four bytes are consistent with the EBCDIC encoding of an XML declaration ("<?xm") or a document type declaration ("<!DO"), or any other string starting with the EBCDIC character '<' followed by three non-ASCII characters (8th bit set), which is consistent with EBCDIC alphanumeric characters, the default EBCDIC-compatible encoding Cp037 is used.

"XX XX XX XX..." : Otherwise, if all of the first four bytes of the stream are non-zero, the default 8-bit ASCII-compatible encoding ISO-8859-1 is used.（GBK/GB2312）如果涉及到中文，再在GB码表中查找这几个字节是否是合法的GB码

3． UTF-8无BOM检测

有人在网上提出了检测算法，就是根据UTF-8的编码方式统计字节中1的数目和之后的字节数。

http://mail.nl.linux.org/linux-utf8/1999-09/msg00110.html

朱亮根据网络资料整理

azhu422

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
字符编码

 首先需要整明白两个概念：字符集、字符集编码。 清楚Unicode是字符集，在字符集中，每个字符只有编号，也称为码点。UTF-16是Unicode字符集的一种编码，UTF-8是同一字符集的另一种编码。 参考： http://hi.baidu.com/%B0%AE%D0%C4%CD%AC%C3%CB_%B3%C2%F6%CE/blog/item/31bf18a2306cc5a7cbefd0c8.html 有
复制链接

扫一扫

字符编码

“相关推荐”对你有帮助么？