GBK转UTF-8编码

最新推荐文章于 2024-10-18 12:55:59 发布

大罗罗的马拉松

最新推荐文章于 2024-10-18 12:55:59 发布

阅读量2.3k

点赞数

分类专栏： Java基础

本文链接：https://blog.csdn.net/lxqluo/article/details/51567892

版权

Java基础专栏收录该内容

44 篇文章 1 订阅

订阅专栏

能处理1, 2, 3个字节的英文，特殊字符，中文

http://www.cnblogs.com/chenwenbiao/archive/2011/08/11/2134503.html utf-8编码规则

http://www.jianshu.com/p/07b578adfbf8 utf-8介绍

http://www.blogjava.net/pengpenglin/archive/2010/02/22/313669.html 普通代码实现

http://www.iteye.com/problems/94345 位运算

http://tool.lu/hexconvert/ 各种进制在线转换工具

http://baike.baidu.com/link?url=X8kyAipcbMOtbpsMb0o-Zs20OSpKepuYFNfEdCvmezlCSKY-a4wUGuvQ9dZOa90zVCHKVq0cddRH3q7O3mBLyq 十六进制

互联网的普及，强烈要求出现一种统一的编码方式。UTF-8就是在互联网上使用最广的一种unicode的实现方式。其他实现方式还包括UTF-16和UTF-32，不过在互联网上基本不用。重复一遍，这里的关系是，UTF-8是Unicode的实现方式之一。

UTF-8最大的一个特点，就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号，根据不同的符号而变化字节长度。

UTF-8的编码规则很简单，只有二条：

1）对于单字节的符号，字节的第一位设为0，后面7位为这个符号的unicode码。因此对于英语字母，UTF-8编码和ASCII码是相同的。

2）对于n字节的符号（n>1），第一个字节的前n位都设为1，第n+1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的unicode码。

下表总结了编码规则，字母x表示可用编码的位。

Unicode符号范围 | UTF-8编码方式

(十六进制) | （二进制）

--------------------+---------------------------------------------

0000 0000-0000 007F | 0xxxxxxx

0000 0080-0000 07FF | 110xxxxx 10xxxxxx

0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

下面，还是以汉字“严”为例，演示如何实现UTF-8编码。

已知“严”的unicode是4E25（100111000100101），根据上表，可以发现4E25处在第三行的范围内（0000 0800-0000 FFFF），因此“严”的UTF-8编码需要三个字节，即格式是“1110xxxx 10xxxxxx 10xxxxxx”。

然后，从“严”的最后一个二进制位开始，依次从后向前填入格式中的x，多出的位补0。这样就得到了，“严”的UTF-8编码是“11100100 10111000 10100101”，转换成十六进制就是E4B8A5。

2个字节的 GBK字符 “·“ (姓名间隔符号，在搜狗输入法下，输入数字1左边那个点号键) 转UTF-8编码过程：

① 到每个字符的2进制GBK编码

int m = (int) c[i]; -- 得到“·“ 的十进制数为183，转为二进制为 1011 0111

②将该16进制的GBK编码转换成2进制的字符串(2个字节)

2字节的补齐11位，在前面补0 （图片第二行有11个X，所以补齐11位）

000 1011 0111

③分别在字符串的首位插入110，在第9位插入10，得到2个字节

110 000 10 10 11 0111

④将这2个字节分别转换成16进制编码，得到最终的UTF-8编码。

// Step 2: UTF-8使用3个字节存放一个中文字符，所以长度必须为字符的3倍

byte[] fullByte = newbyte[2];

String tot = "1100001010110111";

String s1 = tot.substring(0, 8);

String s2 = tot.substring(8, 16);

// Step 3-6：最后的步骤，把代表3个字节的字符串按2进制的方式

// 进行转换，变成2进制的整数，再转换成16进制值

Integer ii = Integer.valueOf(s1, 2);

byteb0 = Integer.valueOf(s1, 2).byteValue();

byteb1 = Integer.valueOf(s2, 2).byteValue();

// Step 3-7：把转换后的3个字节按顺序存放到字节数组的对应位置

byte[] bf = newbyte[2];

bf[0] = b0;

bf[1] = b1;

fullByte[0] = bf[0];

fullByte[1] = bf[1];

return fullByte;

以下是完整能处理1-3个字节的中文+英文字符+特殊符号的例子(代码借鉴了网上资源，只有处理2个字节的部分是新加的)：

package example.encoding;

publicclass GBK2UTF8 {

/**

* The main method.

* @param args the arguments

publicstaticvoid main(String[] args) {

try {

System.out.println(String.valueOf(183));

System.out.println((char) 1500);

GBK2UTF8 convert = new GBK2UTF8();

byte[] fullByte = convert.gbk2utf8("木ל图ڥ木西·阿拉ÁÉΣ");

String fullStr = new String(fullByte, "UTF-8");

System.out.println("string from GBK to UTF-8 byte:" + fullStr);

byte[] fullByte2 = convert.getUTF8BytesFromGBKString("木ל图ڥ木西·阿拉ÁÉΣ");

String fullStr2 = new String(fullByte2, "UTF-8");

System.out.println("2 - string from GBK to UTF-8 byte:" + fullStr2);

} catch (Exception e) {

e.printStackTrace();

}

/**

* Gbk2utf8.

* @param chenese the chenese

* @return the byte[]

publicbyte[] gbk2utf8(String chiness) {

// Step 1: 得到GBK编码下的字符数组，一个中文字符对应这里的一个c[i]

charc[] = chiness.toCharArray();

// Step 2: UTF-8使用3个字节存放一个中文字符，所以长度必须为字符的3倍

byte[] fullByte = newbyte[3 * c.length];

intk = 0;

// Step 3: 循环将字符的GBK编码转换成UTF-8编码

for (inti = 0; i < c.length; i++) {

// Step 3-1：将字符的ASCII编码转换成2进制值

intm = (int) c[i];

String word = Integer.toBinaryString(m);

System.out.println(word);

if (m < 128 && m > 0) {

byteb = (byte) m;

byte[] bf = newbyte[1];

bf[0] = b;

fullByte[k++] = bf[0];

continue;

} elseif (m >= 128 && m <= 1775) {

// 处理2个字节的特殊字符，例如姓名分隔符 · （二进制1100001010110111）

// Step 3-2：将2进制值补足16位(2个字节的长度)

//UCS-2编码(16进制) UTF-8 字节流(二进制)

//0000 - 007F 0xxxxxxx

//0080 - 07FF 110xxxxx 10xxxxxx

//0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx

StringBuffer sb = new StringBuffer();

//第2行有11个X，所以要补齐11位

intlen = 11 - word.length();

for (intj = 0; j < len; j++) {

sb.append("0");

}

// Step 3-3：得到该字符最终的2进制GBK编码

// 形似：00010110111

sb.append(word);

System.out.println(sb.toString());

// Step 3-4：最关键的步骤，根据UTF-8的汉字编码规则，首字节

// 以110开头，次字节以10开头。在原始的2进制

// 字符串中插入标志位。最终的长度从11--->16。

sb.insert(0, "110");

sb.insert(8, "10");

System.out.println(sb.toString());

// 00010110111 得到1100001010110111

// Step 3-5：将新的字符串进行分段截取，截为3个字节

String s1 = sb.substring(0, 8);

String s2 = sb.substring(8, 16);

// Step 3-6：最后的步骤，把代表3个字节的字符串按2进制的方式

// 进行转换，变成2进制的整数，再转换成16进制值

byteb0 = Integer.valueOf(s1, 2).byteValue();

byteb1 = Integer.valueOf(s2, 2).byteValue();

// Step 3-7：把转换后的3个字节按顺序存放到字节数组的对应位置

byte[] bf = newbyte[2];

bf[0] = b0;

bf[1] = b1;

fullByte[k++] = bf[0];

fullByte[k++] = bf[1];

continue;

} else {

// 处理3个字节的中文

// Step 3-2：将2进制值补足16位(2个字节的长度)

StringBuffer sb = new StringBuffer();

intlen = 16 - word.length();

for (intj = 0; j < len; j++) {

sb.append("0");

}

// Step 3-3：得到该字符最终的2进制GBK编码

// 形似：1000 0010 0111 1010

sb.append(word);

// Step 3-4：最关键的步骤，根据UTF-8的汉字编码规则，首字节

// 以1110开头，次字节以10开头，第3字节以10开头。在原始的2进制

// 字符串中插入标志位。最终的长度从16--->16+3+2+2=24。

sb.insert(0, "1110");

sb.insert(8, "10");

sb.insert(16, "10");

System.out.println(sb.toString());

// Step 3-5：将新的字符串进行分段截取，截为3个字节

String s1 = sb.substring(0, 8);

String s2 = sb.substring(8, 16);

String s3 = sb.substring(16);

// Step 3-6：最后的步骤，把代表3个字节的字符串按2进制的方式

// 进行转换，变成2进制的整数，再转换成16进制值

byteb0 = Integer.valueOf(s1, 2).byteValue();

byteb1 = Integer.valueOf(s2, 2).byteValue();

byteb2 = Integer.valueOf(s3, 2).byteValue();

// Step 3-7：把转换后的3个字节按顺序存放到字节数组的对应位置

byte[] bf = newbyte[3];

bf[0] = b0;

bf[1] = b1;

bf[2] = b2;

fullByte[k++] = bf[0];

fullByte[k++] = bf[1];

fullByte[k++] = bf[2];

// Step 3-8：返回继续解析下一个中文字符

continue;

}

if (k < fullByte.length) {

byte[] tmp = newbyte[k];

System.arraycopy(fullByte, 0, tmp, 0, k);

returntmp;

}

returnfullByte;

}

publicstaticbyte[] getUTF8BytesFromGBKString(String gbkStr) {

intn = gbkStr.length();

byte[] utfBytes = newbyte[3 * n];

intk = 0;

for (inti = 0; i < n; i++) {

intm = gbkStr.charAt(i);

if (m < 128 && m >= 0) {

utfBytes[k++] = (byte) m;

continue;

} elseif (m >= 128 && m <= 1775) {

//UCS-2编码(16进制) UTF-8 字节流(二进制)

//0000 - 007F 0xxxxxxx

//0080 - 07FF 110xxxxx 10xxxxxx

//0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx

// · 二进制 1011 0111

// 0xc0 = 11000000

// 0x80 = 10000000

// 0x3f = 111111

utfBytes[k++] = (byte) (0xc0 | (m >> 6)); // · 二进制 1011 0111 只取前2位

utfBytes[k++] = (byte) (0x80 | (m & 0x3f)); //

continue;

}

utfBytes[k++] = (byte) (0xe0 | (m >> 12));

utfBytes[k++] = (byte) (0x80 | ((m >> 6) & 0x3f));

utfBytes[k++] = (byte) (0x80 | (m & 0x3f));

}

if (k < utfBytes.length) {

byte[] tmp = newbyte[k];

System.arraycopy(utfBytes, 0, tmp, 0, k);

returntmp;

}

returnutfBytes;

}

大罗罗的马拉松

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录