UTF-16BE、UTF-16LE、UTF-16 三者之间的区别

最新推荐文章于 2025-03-07 09:36:03 发布

QQxiaoqiang1573

最新推荐文章于 2025-03-07 09:36:03 发布

阅读量5.4w

点赞数 19

分类专栏： android之路

本文链接：https://blog.csdn.net/QQxiaoqiang1573/article/details/84937863

版权

android之路专栏收录该内容

63 篇文章

订阅专栏

简介

实际项目开发中，我们有时候可能需要将字符串转换成字节数组，而转化字节数组跟编码格式有关，不同的编码格式转化的字节数组不一样。下面列举了java支持的几种编码格式：

US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set
ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
UTF-8 Eight-bit UCS Transformation Format
UTF-16BE Sixteen-bit UCS Transformation Format, big-endian byte order
UTF-16LE Sixteen-bit UCS Transformation Format, little-endian byte order
UTF-16 Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark

本文重点讲解的是 UTF-16 编码格式字节数组的转化。UTF-16 顾名思义，就是用两个字节表示一个字符。那么用两个字节表示必然存在字节序的问题，即大端小端的问题。下面就来讲讲 UTF-16BE、UTF-16LE、UTF-16 三者之间的区别吧。

UTF-16BE，其后缀是 BE 即 big-endian，大端的意思。大端就是将高位的字节放在低地址表示。
UTF-16LE，其后缀是 LE 即 little-endian，小端的意思。小端就是将高位的字节放在高地址表示。
UTF-16，没有指定后缀，即不知道其是大小端，所以其开始的两个字节表示该字节数组是大端还是小端。即FE FF表示大端，FF FE表示小端。

代码测试

TestUTF_16.java

public static void main(String[] args) {
    String test = "代码";

    try {
        //UTF-16BE编码格式 大端形式编码
        byte[] bytesUTF16BE = test.getBytes("UTF-16BE");
        printHex("UTF-16BE", bytesUTF16BE);

        //UTF-16LE编码格式 小端形式编码
        byte[] bytesUTF16LE = test.getBytes("UTF-16LE");
        printHex("UTF-16LE", bytesUTF16LE);

        //UTF-16
        byte[] bytesUTF16 = test.getBytes("UTF-16");
        printHex("UTF-16", bytesUTF16);
        
        //大端编码格式的字节数组转化
        String strUTF16BE = new String(bytesUTF16BE, "UTF-16BE");
        String strUTF16LE = new String(bytesUTF16BE, "UTF-16LE");
        String strUTF16 = new String(bytesUTF16BE, "UTF-16");
        System.out.println("strUTF16BE:"+strUTF16BE);
        System.out.println("strUTF16LE:"+strUTF16LE);
        System.out.println("strUTF16:"+strUTF16);
        
        //小端编码格式的字节数组转化
        strUTF16BE = new String(bytesUTF16LE, "UTF-16BE");
        strUTF16LE = new String(bytesUTF16LE, "UTF-16LE");
        strUTF16 = new String(bytesUTF16LE, "UTF-16");
        System.out.println("strUTF16BE:"+strUTF16BE);
        System.out.println("strUTF16LE:"+strUTF16LE);
        System.out.println("strUTF16:"+strUTF16);
        
        //自带大小端编码格式的字节数组转化
        strUTF16BE = new String(bytesUTF16, "UTF-16BE");
        strUTF16LE = new String(bytesUTF16, "UTF-16LE");
        strUTF16 = new String(bytesUTF16, "UTF-16");
        System.out.println("strUTF16BE:"+strUTF16BE);
        System.out.println("strUTF16LE:"+strUTF16LE);
        System.out.println("strUTF16:"+strUTF16);
        
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    }
}

private static void printHex(String type, byte[] data) {
    StringBuilder builder = new StringBuilder();
    builder.append("type:").append(type).append(":   ");
    int temp = 0;
    for (int i = 0; i < data.length; i++) {
        temp = data[i] & 0xFF;
        builder.append("0x").append(Integer.toHexString(temp).toUpperCase()).append(" ");
    }

    System.out.println(builder.toString());
}

在这里插入图片描述

从上面的测试结果可以看出来，指定大小端编码格式的，转化为字节数组时不会带FE FF或者FF FE。带有FE FF或者FF FE的字节数组可以转化成指定大小端编码格式的字符串。

源码

看看String源码中对getBytes的实现

/**
 * Returns a new byte array containing the characters of this string encoded using the named charset.
 *
 * <p>The behavior when this string cannot be represented in the named charset
 * is unspecified. Use {@link java.nio.charset.CharsetEncoder} for more control.
 *
 * @throws UnsupportedEncodingException if the charset is not supported
 */
public byte[] getBytes(String charsetName) throws UnsupportedEncodingException {
    return getBytes(Charset.forNameUEE(charsetName));
}
/**
 * Returns a new byte array containing the characters of this string encoded using the
 * given charset.
 *
 * <p>The behavior when this string cannot be represented in the given charset
 * is to replace malformed input and unmappable characters with the charset's default
 * replacement byte array. Use {@link java.nio.charset.CharsetEncoder} for more control.
 *
 * @since 1.6
 */
public byte[] getBytes(Charset charset) {
    String canonicalCharsetName = charset.name();
    if (canonicalCharsetName.equals("UTF-8")) {
        return Charsets.toUtf8Bytes(value, offset, count);
    } else if (canonicalCharsetName.equals("ISO-8859-1")) {
    	return Charsets.toIsoLatin1Bytes(value, offset, count);
    } else if (canonicalCharsetName.equals("US-ASCII")) {
        return Charsets.toAsciiBytes(value, offset, count);
    } else if (canonicalCharsetName.equals("UTF-16BE")) {
        return Charsets.toBigEndianUtf16Bytes(value, offset, count);
    } else {
        CharBuffer chars = CharBuffer.wrap(this.value, this.offset, this.count);
        ByteBuffer buffer = charset.encode(chars.asReadOnlyBuffer());
        byte[] bytes = new byte[buffer.limit()];
        buffer.get(bytes);
        return bytes;
    }
}