如题 java字符编码的理解
先来一段辅助理解的代码
Charset iso88591 = Charset.forName("iso-8859-1");
// 双字节编码字符集
Charset gbk = Charset.forName("gbk");
// 可变长度编码字符集
Charset utf8 = Charset.forName("utf-8");
// 原始字符串
String src = "中文";
String a="中";
String c="文";
String other=a+c;
// utf8 编码的字节数组
byte[] utf8Bytes = src.getBytes(utf8);
// 使用 iso-8859-1 错误解码的字符串(乱码)
String wrongStr = new String(utf8Bytes, iso88591);
// 使用 big5 错误解码的字符串(还是乱码)
String wrongStr2 = new String(utf8Bytes, gbk);
System.out.println(src==other);
String b=new String(src.getBytes("unicode"),"unicode");
System.out.println(b);
System.out.println(utf8Bytes);
if(src==b){
System.out.println("src==b");
}
else{
System.out.println("src!=b");
System.out.println(src+" "+b);
System.out.println(Arrays.toString(src.getBytes()));
System.out.println(Arrays.toString(b.getBytes()));
}
System.out.println("wrongStr-iso88591-decoding = " + wrongStr + "
len=" + wrongStr.length());
System.out.println("wrongStr-gbk-decoding = " + wrongStr2 + " len=" +
wrongStr2.length());
System.out.println("orignal-utf8-bytes = " +
Arrays.toString(utf8Bytes));
System.out.println("orginal-gbk-byets
="+Arrays.toString(src.getBytes("GBK")));
System.out.println("orginal-unicode-byets
="+Arrays.toString(src.getBytes("unicode")));
System.out.println("orginal-iso-8859-1-byets
="+Arrays.toString(src.getBytes("iso-8859-1")));
// 把 iso-8859-1 错误解码的字符串恢复utf8编码的字节数组 - 可逆
byte[] resumeBytes = wrongStr.getBytes(iso88591);
String rightStr = new String(resumeBytes, utf8);
// 把 big5 错误解码的字符串恢复utf8编码的字节数组 - 不可逆
byte[] resumeBytes2 = wrongStr2.getBytes(gbk);
String rightStr2 = new String(resumeBytes2, utf8);
System.out.println("resume-iso88591-utf8-bytes = " +
Arrays.toString(resumeBytes));
System.out.println("resume-gbk-utf8-bytes = " +
Arrays.toString(resumeBytes2));
System.out.println(rightStr);
System.out.println(rightStr2);
String gbkfile = "gbk.txt";
String utf8file = "utf8.txt";
File gbkF = new File(gbkfile);
File utf8F = new File(utf8file);
BufferedReader br = new BufferedReader(new InputStreamReader(new
FileInputStream("utf8.txt")));
String data = null;
while ((data = br.readLine()) != null) {
System.out.println(Arrays.toString(data.getBytes("gbk")));
System.out.println(Arrays.toString(data.getBytes("utf8")));
System.out.println(data);
System.out.println(new String(data.getBytes("gbk"), "utf8"));
}
文件的中文字符以文件的字符集编码
readline时使用平台默认的字符集,中文为gbk,把文件的字节在gbk字符集中找到对应的字符,然后在unicode中找到字符对应的字节,然后存入java内存
getbytes(字符集参数)时,先在unicode字符集中找到内存中字节对应的字符,然后在需要转换的字符集(就是参数)中找到对应的字节
new string时,在设置的字符集中找到给定字节对应的字符,然后再unicode字符集中找到对应的字节,存入java内存中
因此打开gbk文件时,windows平台一般软件readline没有问题,打开utf8时readline乱码,因此要new string(getbyte(gbk),utf8)
在读文件和接受其他程序传的参数时要格外注意