从String获得ByteBuffer、从ByteBuffer获得CharBuffer的正确姿势

最新推荐文章于 2024-07-16 14:27:03 发布

anlian523

最新推荐文章于 2024-07-16 14:27:03 发布

阅读量1.9k

点赞数 1

分类专栏： Java

本文链接：https://blog.csdn.net/anlian523/article/details/104088208

版权

Java 专栏收录该内容

95 篇文章 66 订阅

订阅专栏

文章目录

错误示例
错误原因
Unicode与UTF-8与UTF-16
正确使用方法

错误示例

import java.nio.ByteBuffer;
import java.nio.CharBuffer;

public class test2 {
    public static void main(String[] args) {
        ByteBuffer bb = ByteBuffer.wrap("Some text".getBytes());
        CharBuffer cb = bb.asCharBuffer();
        String s = cb.toString();
        System.out.print(s);
    }
}

在这里插入图片描述
此时打印字符串乱码。

错误原因

"Some text".getBytes()获得字符串对应的字节（这些字节是字符集中——字符对应的若干字节）是通过UTF-8字符集（默认的字符集）来获得的。

//String.java
    public byte[] getBytes() {
        return StringCoding.encode(value, 0, value.length);//调用到StringCoding的encode函数
    }
    
//StringCoding.java
    static byte[] encode(char[] ca, int off, int len) {
        String csn = Charset.defaultCharset().name();//这句会获得默认的字符集
        try {
            // use charset name encode() variant which provides caching.
            return encode(csn, ca, off, len);
        } catch (UnsupportedEncodingException x) {
            warnUnsupportedCharset(csn);
        }
        try {
            return encode("ISO-8859-1", ca, off, len);
        } catch (UnsupportedEncodingException x) {
            // If this code is hit during VM initialization, MessageUtils is
            // the only way we will be able to get any kind of error message.
            MessageUtils.err("ISO-8859-1 charset not available: "
                             + x.toString());
            // If we can not find ISO-8859-1 (a required encoding) then things
            // are seriously wrong with the installation.
            System.exit(1);
            return null;
        }
    }
//Charset.java
    public static Charset defaultCharset() {
        if (defaultCharset == null) {
            synchronized (Charset.class) {
                String csn = AccessController.doPrivileged(
                    new GetPropertyAction("file.encoding"));//注意这里
                Charset cs = lookup(csn);
                if (cs != null)
                    defaultCharset = cs;
                else
                    defaultCharset = forName("UTF-8");
            }
        }
        return defaultCharset;
    }

看来最终是从"file.encoding"（文件的编码方式一般也都是UTF-8）获得的默认字符集，如果不支持该字符集（if (cs == null)），才会设置默认字符集为"UTF-8"。

cb.toString()时，认为cb对象持有的ByteBuffer成员的字节数组都是UTF-16字符集转换而来的字节，同时它又利用了当Unicode码<0x10000时，UTF-16字符集对应字节与Unicode码一样。而java的char类型就是使用二字节长度的Unicode码作为底层存储，所以执行cb.toString()时，就是把每两个字节作为一个char来进行的转换。具体分析过程在本人博客Java源码分析 ByteBuffer.asCharBuffer打印字符串乱码原因的从源码分析章节。
这样，"Some text".getBytes()encode编码时用的是UTF-8字符集把每个字符变成了若干字节；
String s = cb.toString()decode解码时用的是UTF-16字符集把若干字节变成了每个字符。自然就会出错。

Unicode与UTF-8与UTF-16

Unicode只是字符集合，虽然它为每个字符分配了一个唯一的编号，但是它却不能作为存储的标准，即它不能称为一个CharSet字符集。Unicode的编码空间为U+0000到U+10FFFF。
java的char类型变量，其底层存储实际为U+0000到U+FFFF的Unicode码。因为char只有两个字节。
UTF-8以字节为单位对Unicode进行编码。从下表的分析可见，从Unicode转换为UTF-8 字节流实际上就是把Unicode的有效bit依次放入UTF-8 字节流中的x。从字符的第一个字节，能够知道当前字符一共用几个字节来存储。

Unicode编码(十六进制)	UTF-8 字节流(二进制)	分析
000000-00007F	0xxxxxxx	`0x7F`从右开始有7个有效bit，所以有7个x
000080-0007FF	110xxxxx 10xxxxxx	`0x7FF`从右开始有11个有效bit，所以有11个x
000800-00FFFF	1110xxxx 10xxxxxx 10xxxxxx	`0xFFFF`从右开始有16个有效bit，所以有16个x
010000-10FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	`0x10FFFF`从右开始有21个有效bit，所以有21个x

UTF-16以16位无符号整数（二字节）为单位对Unicode进行编码。为了区分下表两种情况，第一行情况的编码肯定不会以ob110110或0b110111开头。

Unicode编码(十六进制)	UTF-16 字节流(二进制)	分析
000000-00FFFF	Unicode编码本身（一个16位整数）
010000-10FFFF	110110yyyyyyyyyy 110111xxxxxxxxxx （两个16位整数） (即使是最大值0x10FFFF，减去0x010000后为0xFFFFF)	`0xFFFFF`从右开始有20个有效bit，依次对应前面10个y和后面10个x y或x与前面的固定bit各组成一个16位整数

正确使用方法

调用String.getBytes的无参数版本

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;

public class test2 {
    public static void main(String[] args) {
        ByteBuffer bb = ByteBuffer.wrap("Some text".getBytes());
        String encoding = System.getProperty("file.encoding");
        CharBuffer cb = Charset.forName(encoding).decode(bb);//通过特定字符集的decode函数来获得CharBuffer
        String s = cb.toString();
        System.out.print(s);
    }
}
/*output:
Some text
*/

之前讲了，getBytes的无参数版本会使用"file.encoding"的字符集进行encode。
所以decode时，也使用"file.encoding"的字符集来进行decode。
综上，可以得到正确结果。

调用String.getBytes的有参数版本

import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;

public class test2 {
    public static void main(String[] args) throws UnsupportedEncodingException {
        ByteBuffer bb = ByteBuffer.wrap("Some text".getBytes("UTF-16BE"));
        CharBuffer cb = bb.asCharBuffer();
        String s = cb.toString();
        System.out.print(s);
    }
}
/*output:
Some text
*/

getBytes的有参数版本，encode时指定了"UTF-16BE"这个字符集，BE代表big-endian大端。
bb.asCharBuffer()默认返回了一个大端的CharBuffer。
cb.toString()构造字符串时，以每两个字节作为一个Unicode码。又由于两字节的Unicode码和UTF-16的字节流一样，所以直接使用UTF-16的两个字节就相当于使用Unicode码。
综上，可以得到正确结果。

调用ByteBufferAsCharBufferB实例的put方法

import java.nio.ByteBuffer;
import java.nio.CharBuffer;

public class test2 {
    public static void main(String[] args) {
        ByteBuffer bb = ByteBuffer.allocate(24);//需要分配地足够大
        CharBuffer cb = bb.asCharBuffer().put("Some text");
        cb.flip();
        String s = cb.toString();
        System.out.println(s);

        CharBuffer cb2 = bb.asCharBuffer();
        String s2 = cb2.toString();
        System.out.println(s2);
    }
}
/*output:
Some text
Some text
*/

在这里插入图片描述

ByteBuffer.allocate(24)。只是给字节数组分配大小，值都是默认值。
CharBuffer cb = bb.asCharBuffer().put("Some text")。cb对象持有一个bb对象，调用put时，使用Unicode码的两个字节，和默认的大小端模式（默认大端），来向bb对象里依次放置各个字节。
调用put后，position成员已经变成9了，如果没有这句cb.flip()，那么position不能归零，会使得调用toString方法时不会从头开始读取。
cb2对象也能打印出同样的字符串，这说明调用put，改变了bb对象的底层存储。

总结

第一种方法里，cb对象是一个HeapCharBuffer实例；第二、第三种方法里，cb对象是一个ByteBufferAsCharBufferB实例。
打印HeapCharBuffer实例时，依靠的是char[] hb成员（继承自CharBuffer）；打印ByteBufferAsCharBufferB实例时，依靠的是ByteBuffer bb成员（来自自身的类定义）。

在这里插入图片描述

anlian523

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
从String获得ByteBuffer、从ByteBuffer获得CharBuffer的正确姿势

cb.toString()时，认为cb对象持有的ByteBuffer成员的字节数组都是UTF-16字符集转换而来的字节，同时它又利用了当Unicode码<0x10000时，UTF-16字符集对应字节与Unicode码一样。而java的char类型就是使用二字节长度的Unicode码作为底层存储，所以执行cb.toString()时，就是把每两个字节作为一个char来进行的转换。
复制链接

扫一扫

专栏目录