String 中文问题

最新推荐文章于 2023-10-16 10:10:27 发布

yy6060

最新推荐文章于 2023-10-16 10:10:27 发布

阅读量892

点赞数

分类专栏：编码、乱码 J2SE 文章标签： string import class c

J2SE 同时被 2 个专栏收录

26 篇文章 0 订阅

订阅专栏

编码、乱码

2 篇文章 0 订阅

订阅专栏

import java.io.UnsupportedEncodingException; public class StringEncoding { public static void main(String[] args) { try { String str = "中国"; System.out.println("第一部分-中国------------------------------------------"); System.out.println("str.length()："+str.length()); System.out.println("str.getBytes().length:"+str.getBytes().length); System.out.println("new String(str.getBytes(), 'UTF-8').length():"+new String(str.getBytes(), "UTF-8").length()); System.out .println("new String(str.getBytes(), 'UTF-8').getBytes().length:"+new String(str.getBytes(), "UTF-8").getBytes().length); System.out .println("--------------------------------------------------"); String str2 = "abc中国"; System.out.println("第二部分-abc中国------------------------------------------"); System.out.println("str2.length():"+str2.length()); System.out.println("str2.getBytes().length:"+str2.getBytes().length); System.out.println("new String(str2.getBytes(), 'UTF-8').length():"+new String(str2.getBytes(), "UTF-8").length()); System.out .println("new String(str2.getBytes(), 'UTF-8').getBytes().length:"+new String(str2.getBytes(), "UTF-8").getBytes().length); System.out .println("--------------------------------------------------"); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }

运行结果为：

第一部分-中国------------------------------------------ str.length()：2 str.getBytes().length:6 new String(str.getBytes(), 'UTF-8').length():2 new String(str.getBytes(), 'UTF-8').getBytes().length:6 -------------------------------------------------- 第二部分-abc中国------------------------------------------ str2.length():5 str2.getBytes().length:9 new String(str2.getBytes(), 'UTF-8').length():5 new String(str2.getBytes(), 'UTF-8').getBytes().length:9 --------------------------------------------------

问题：为什么是这个结果？？？？

初步解答：

String str = "中国"; System.out.println(str.length()); //这些应该都没有问题，关键在下面 System.out.println(str.getBytes().length);//注意这里是得到的gbk的字节，一个汉字gbk编码是2个字节

System.out.println(new String(str.getBytes(),"UTF-8")); //这里用得到的gbk的字节去转换成utf-8，需要注意的是一般情况下utf-8汉字占有3个字节，而acsll码在utf-8中是一个字节

//而编码转换的规则是对字节进行扫描，如果可以转换成1个字节的ascll就优先转换，可以转换成2个字节就优先转换成两个字节的utf-8编码 //* 0xxxxxxx (00-7f) ascll // * 110xxxxx 10xxxxxx (c0-df)(80-bf) ascll和汉字之间的一些字符

//* 1110xxxx 10xxxxxx 10xxxxxx (e0-ef)(80-bf)(80-bf) 汉字

//* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (f0-f7)(80-bf)(80-bf)(80-bf)

//* 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx (f8-fb)(80-bf)(80-bf)(80-bf)(80-bf)

//* 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx (fc-fd)(80-bf)(80-bf)(80-bf)(80-bf)(80-bf)

//仔细看上面的utf-8的编码表再看通过getBytes()，

最终得到的二进制码是11010110 11010000 10111001 11111010(楼主可以自己去输出以下，注意负数的转换即可)

//第一个字节与第二个一直到最后字节组合不成utf-8编码(110开始的只有上面的第二行，但是下一个字节不是10******，

因此要对第一个字节单独解码

//再看第二个与第三个字节为11010000 10111001 刚好是一个上面的第二行的范围内，所以对他们两个进行一个utf-8解码

//最后对11111010 进行解码

//通过以上分析可以知道得到的utf-8时3个字符，所以长度是3

System.out.println(new String(str.getBytes(),"UTF-8").getBytes().length);

String str2 ="abc中国";

System.out.println(str2.length());

System.out.println(str2.getBytes().length);

System.out.println(new String(str2.getBytes(),"UTF-8").length());

//01100001 01100010 01100011 11010110 11010000 10111001 11111010

//一样从左到右去解码显然前三个字节满足第一个范围分别解码为abc，后面就与上面的分析一样了

//顺便说一句，如果不想得到乱码可以用getBytes("utf-8") System.out.println(new String(str2.getBytes(),"UTF-8").getBytes().length);

yy6060

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
String 中文问题

<br />import java.io.UnsupportedEncodingException;public class StringEncoding { public static void main(String[] args) { try { String str = "中国"; System.out.println("第一部分-中国------------------------------------------"); System.out.println
复制链接

扫一扫

专栏目录