Java9为何将String的底层实现由char[ ]改成了byte[ ]

会思考懂感受的编码人

已于 2023-06-07 20:54:24 修改

阅读量588

点赞数

文章标签： java

于 2023-06-07 20:45:20 首次发布

本文链接：https://blog.csdn.net/m0_62335720/article/details/131091877

版权

Java9为了优化内存使用，将String的底层实现从char[]改为byte[]，主要针对存储多数为拉丁字符（单字节）的情况。引入了Latin-1和UTF-16两种编码方案，单字节字符使用byte[]存储可节省一半内存，多字节字符如UTF-16编码的汉字则保持与原char[]相同的空间需求。

摘要由CSDN通过智能技术生成

前言：Java9为何将String的底层实现由char[]改成了byte[]，看了一些博客，有些地方表述不正确，例如：String的底层char数组的编码是UTF-16而非UTF-8，UTF-8编码的汉字占3个字节，故翻阅了官方文档，整理了一份，如有错误之处，还望大佬们指正。

官方文档地址：JEP 254: Compact Strings

一、String的底层实现由char[]变更为byte[]的目的

The current implementation of the String class stores characters in a char array, using two bytes (sixteen bits) for each character. Data gathered from many different applications indicates that strings are a major component of heap usage and, moreover, that most String objects contain only Latin-1 characters. Such characters require only one byte of storage, hence half of the space in the internal char arrays of such String objects is going unused.

上面是官方文档的原文，就是说String类的当前底层实现是将字符存储在char数组中，每个字符使用2个字节（16bit）。大量数据表明，字符串是堆内存的主要组成部分，此外，大多数String对象只包含拉丁字符（英文字符占比大）。这样的字符只需要1个字节的存储空间，而char数组分配2个字节，因此有一半的空间未被使用。

实际开发中，单字节字符（例如：英文字符）使用频率要远高于双字节（例如：汉字，UTF-16编码汉字字符占2个字节），将 char[] 优化为 byte[] 来存储单字节字符，将大大节省字符串占用的内存。

二、编码方案

Java9后，String类的底层编码支持两种编码方案：Latin-1 和 UTF-16。

/**
 * The identifier of the encoding used to encode the bytes in
 * {@code value}. The supported values in this implementation are
 *
 * LATIN1
 * UTF16
 *
 * @implNote This field is trusted by the VM, and is a subject to
 * constant folding if String instance is constant. Overwriting this
 * field after construction will cause problems.
 */
private final byte coder;

编码方案1：Latin-1编码，该编码方式是用1个字节来表示字符，所以当字符是单字节字符时，采用byte[]的底层实现，结合Latin-1编码来存储字符，相较于char[]来说能节省一半的内存。