简介
实际项目开发中,我们有时候可能需要将字符串转换成字节数组,而转化字节数组跟编码格式有关,不同的编码格式转化的字节数组不一样。下面列举了java支持的几种编码格式:
- US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
- ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
- UTF-8 Eight-bit UCS Transformation Format
- UTF-16BE Sixteen-bit UCS Transformation Format, big-endian byte order
- UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order
- UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark
本文重点讲解的是 UTF-16 编码格式字节数组的转化。UTF-16 顾名思义,就是用两个字节表示一个字符。那么用两个字节表示必然存在字节序的问题,即大端小端的问题。下面就来讲讲 UTF-16BE、UTF-16LE、UTF-16 三者之间的区别吧。
UTF-16BE,其后缀是 BE 即 big-endian,大端的意思。大端就是将高位的字节放在低地址表示。
UTF-16LE,其后缀是 LE 即 little-endian,小端的意思。小端就是将高位的字节放在高地址表示。
UTF-16,没有指定后缀,即不知道其是大小端,所以其开始的两个字节表示该字节数组是大端还是小端。即FE FF表示大端,FF FE表示小端。
代码测试
TestUTF_16.java
public static void main(String[] args) {
String test = "代码";
try {
//UTF-16BE编码格式 大端形式编码
byte[] bytesUTF16BE = test.getBytes("UTF-16BE");
printHex("UTF-16BE", bytesUTF16BE);
//UTF-16LE编码格式 小端形式编码
byte[] bytesUTF16LE = test.getBytes("UTF-16LE");
printHex("UTF-16LE", bytesUTF16LE);
//UTF-16
byte[] bytesUTF16 = test.getBytes("UTF-16");
printHex("UTF-16", bytesUTF16);
//大端编码格式的字节数组转化
String strUTF16BE = new String(bytesUTF16BE, "UTF-16BE");
String strUTF16LE = new String(bytesUTF16BE, "UTF-16LE");
String strUTF16 = new String(bytesUTF16BE, "UTF-16");
System.out.println("strUTF16BE:"+strUTF16BE);
System.out.println("strUTF16LE:"+strUTF16LE);
System.out.println("strUTF16:"+strUTF16);
//小端编码格式的字节数组转化
strUTF16BE = new String(bytesUTF16LE, "UTF-16BE");
strUTF16LE = new String(bytesUTF16LE, "UTF-16LE");
strUTF16 = new String(bytesUTF16LE, "UTF-16");
System.out.println("strUTF16BE:"+strUTF16BE);
System.out.println("strUTF16LE:"+strUTF16LE);
System.out.println("strUTF16:"+strUTF16);
//自带大小端编码格式的字节数组转化
strUTF16BE = new String(bytesUTF16, "UTF-16BE");
strUTF16LE = new String(bytesUTF16, "UTF-16LE");
strUTF16 = new String(bytesUTF16, "UTF-16");
System.out.println("strUTF16BE:"+strUTF16BE);
System.out.println("strUTF16LE:"+strUTF16LE);
System.out.println("strUTF16:"+strUTF16);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
}
private static void printHex(String type, byte[] data) {
StringBuilder builder = new StringBuilder();
builder.append("type:").append(type).append(": ");
int temp = 0;
for (int i = 0; i < data.length; i++) {
temp = data[i] & 0xFF;
builder.append("0x").append(Integer.toHexString(temp).toUpperCase()).append(" ");
}
System.out.println(builder.toString());
}
从上面的测试结果可以看出来,指定大小端编码格式的,转化为字节数组时不会带FE FF或者FF FE。带有FE FF或者FF FE的字节数组可以转化成指定大小端编码格式的字符串。
源码
看看String源码中对getBytes的实现
/**
* Returns a new byte array containing the characters of this string encoded using the named charset.
*
* <p>The behavior when this string cannot be represented in the named charset
* is unspecified. Use {@link java.nio.charset.CharsetEncoder} for more control.
*
* @throws UnsupportedEncodingException if the charset is not supported
*/
public byte[] getBytes(String charsetName) throws UnsupportedEncodingException {
return getBytes(Charset.forNameUEE(charsetName));
}
/**
* Returns a new byte array containing the characters of this string encoded using the
* given charset.
*
* <p>The behavior when this string cannot be represented in the given charset
* is to replace malformed input and unmappable characters with the charset's default
* replacement byte array. Use {@link java.nio.charset.CharsetEncoder} for more control.
*
* @since 1.6
*/
public byte[] getBytes(Charset charset) {
String canonicalCharsetName = charset.name();
if (canonicalCharsetName.equals("UTF-8")) {
return Charsets.toUtf8Bytes(value, offset, count);
} else if (canonicalCharsetName.equals("ISO-8859-1")) {
return Charsets.toIsoLatin1Bytes(value, offset, count);
} else if (canonicalCharsetName.equals("US-ASCII")) {
return Charsets.toAsciiBytes(value, offset, count);
} else if (canonicalCharsetName.equals("UTF-16BE")) {
return Charsets.toBigEndianUtf16Bytes(value, offset, count);
} else {
CharBuffer chars = CharBuffer.wrap(this.value, this.offset, this.count);
ByteBuffer buffer = charset.encode(chars.asReadOnlyBuffer());
byte[] bytes = new byte[buffer.limit()];
buffer.get(bytes);
return bytes;
}
}