Java 编码相关问题

最新推荐文章于 2024-07-15 22:27:18 发布

Orange Summer

最新推荐文章于 2024-07-15 22:27:18 发布

阅读量90

点赞数

分类专栏：编码文章标签： java

本文链接：https://blog.csdn.net/fxt1017748664/article/details/129884865

版权

编码专栏收录该内容

2 篇文章 0 订阅

订阅专栏

字符编码相关知识字符编码详解

String.length()

public class StringBytes {
    public static void main(String[] args) {
        String temp1 = "𝄞";
        String temp2 = "\uD834\uDD1E"; //上面那个字符的UTF-16编码
        System.out.println(temp2);
        System.out.println(temp1.length());
        System.out.println(temp1.codePointCount(0, temp2.length()));
    }
}
//运行结果
𝄞
2
1

/**
 * Returns the length of this string.
 * The length is equal to the number of Unicode code units in the string.
 *
 * @return  the length of the sequence of characters represented by this
 *          object.
 */
public int length() {
    return value.length;
}

从String.length()的 java 源码的注释可以看出，返回的长度等于字符串的 unicode 码元的数量。Java 默认编码是UTF-16，从上面那个字符的编码可以明显看出是UTF-16编码的代理区中的字符，编码为32位，码元为16位，所以有2个码元，字符串的长度为2，但实际不等于字符串中字符的数量。

String.codePointCount()

顾名思义，返回字符串中码点的数量，不知道码点是什么可以看最开头那篇文章。总之一个字符一定对应一个码点，所以这个方法返回的是准确的字符串的字符数量。

String.getBytes().length

/**
 * Encodes this {@code String} into a sequence of bytes using the
 * platform's default charset, storing the result into a new byte array.
 *
 * <p> The behavior of this method when this string cannot be encoded in
 * the default charset is unspecified.  The {@link
 * java.nio.charset.CharsetEncoder} class should be used when more control
 * over the encoding process is required.
 *
 * @return  The resultant byte array
 *
 * @since      JDK1.1
 */
public byte[] getBytes() {
    return StringCoding.encode(value, 0, value.length);
}
public byte[] getBytes(String charsetName)
        throws UnsupportedEncodingException {
    if (charsetName == null) throw new NullPointerException();
    return StringCoding.encode(charsetName, value, 0, value.length);
}

注释说第一个方法使用平台的默认字符集将字符串编码为字节序列，靠环境的默认字符集决定结果显然是很危险的，一般建议是用第二个方法添加选择编码方案的参数来保证结果是自己想要的。

public class StringBytes {
    public static void main(String[] args) throws UnsupportedEncodingException {
        String temp = "字"; //unicode 码点为U+5B57
        System.out.println(temp.getBytes("UTF-8").length);
        System.out.println(temp.getBytes("UTF-16").length);
    }
}
//运行结果
3
4

上述例子查询对应码点范围可知该字符在UTF-8中用3个字节表示，所以得到对应的结果。

但是U+5B57在UTF-16中只需要2 byte 表示，为什么这里显示4 byte 呢？

UTF-16的大小端问题

当一个字符要用大于一个字节表示并传输时，就要考虑字节序的问题。

观察UTF-8的编码规则会发现只要通过每个字节开头的几位就能确定字节的次序，而UTF-16不具有这种性质，将两个字节对调就变成了另一个字符，因此在传输时需要指明UTF-16编码的大小端。

上面多出2个字节的问题就是因为在编码方案中没有指明大小端时，选择开头多用两个字节表示大小端。

public class StringBytes {
    public static void main(String[] args) throws UnsupportedEncodingException {
        //unicode 码点为 U+5B57
        String temp = "字";
        System.out.println(temp.getBytes(StandardCharsets.UTF_16).length);
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_16)));
        System.out.println(temp.getBytes(StandardCharsets.UTF_16LE).length);
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_16LE)));
        System.out.println(temp.getBytes(StandardCharsets.UTF_16BE).length);
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_16BE)));
    }
}
//运行结果
4
[-2, -1, 91, 87]
2
[87, 91]
2
[91, 87]