在java中的String.getBytes(String charset),会先把字符串按字符分为字符数组,然后按单个字符编码。
import java.io.UnsupportedEncodingException;
public class CharsetTest {
public static void main(String[] args) throws UnsupportedEncodingException {
String s3 = "\u0061";
String s4="\u6c49";
System.out.println(s3);
System.out.println(s4+"\n");
System.out.println("test string.getChars(...):");
String s = "你好lkf&*";
printChars(s);
System.out.println();
System.out.println("test string.getBytes(charset):\n");
String s1 = "汉";
String s2 = "a";
//文件本身编码方式为utf-8
System.out.println("\""+s1+"\""+"的编码结果:");
printEncoding(s1,null);
System.out.println("-------------------------");
System.out.println("\""+s2+"\""+"的编码结果:");
printEncoding(s2,null);
System.out.println("\nBOM:Byte order marker,0xfeff为big-endian,0xfffe为litter-endian");
}
public static void printEncoding(String s1,String [] encodings) {
String[] encodes = encodings==null?new String[]{"utf-8","utf-16","utf-16le","utf-16be","iso-8859-1","us-ascii", "gbk", "gb2312","gb18030","unicode"}:encodings;
for (String encode : encodes) {
byte[] bytes = null;
try {
System.out.print(encode+":");
bytes = s1.getBytes(encode);
StringBuilder x = toHexString(bytes);
System.out.println(x);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
}
public static void printChars(String s) {
char[] chars = new char[s.length()];
s.getChars(0,s.length(),chars,0);
for (char aChar : chars) {
System.out.println(aChar);
}
}
public static StringBuilder toHexString(byte[] bytes) {
StringBuilder b = new StringBuilder("0x(");
for(int i=0; i < bytes.length; i++){
b.append(Character.forDigit((bytes[i] >> 4) & 0xF, 16));
b.append(Character.forDigit((bytes[i] & 0xF), 16));
if (i < (bytes.length - 1)) {
b.append(" ");
}
}
b.append(")");
return b;
}
}
结果为:
a
汉
test string.getChars(...):
你
好
l
k
f
&
*
test string.getBytes(charset):
"汉"的编码结果:
utf-8:0x(e6 b1 89)
utf-16:0x(fe ff 6c 49)
utf-16le:0x(49 6c)
utf-16be:0x(6c 49)
iso-8859-1:0x(3f)
us-ascii:0x(3f) //0x3f表示?,表示无法编码
gbk:0x(ba ba)
gb2312:0x(ba ba)
gb18030:0x(ba ba)
unicode:0x(fe ff 6c 49) //0xfeff为big-endian BOM
-------------------------
"a"的编码结果:
utf-8:0x(61)
utf-16:0x(fe ff 00 61)
utf-16le:0x(61 00)
utf-16be:0x(00 61)
iso-8859-1:0x(61)
us-ascii:0x(61)
gbk:0x(61)
gb2312:0x(61)
gb18030:0x(61)
unicode:0x(fe ff 00 61)
BOM:Byte order marker,0xfeff为big-endian,0xfffe为litter-endian
上面为自己写的测试代码。
此处为转载:谈谈Unicode编码,简要解释UCS、UTF、BMP、BOM等名词http://www.fmddlmyy.cn/text6.html