Java string 字符集编码以及转换

最新推荐文章于 2024-07-29 14:22:26 发布

weixin_34292287

最新推荐文章于 2024-07-29 14:22:26 发布

阅读量7.8k

点赞数 2

文章标签： java 后端 json

原文链接：http://blog.51cto.com/9797337/1767774

版权

Java string 字符集编码以及转换

基本概念

关于字符集的种类常用的有utf-8, unicode，gbk，gbk2312等，详细的字符集列表可以查看java.nio.charset.Charset类。

关键的字符集处理方法介绍如下：

String.getBytes()	获取当前string表示的字符，在使用系统默认的字符集(关于系统默认的字符集后面详细讨论)时，所映射的二进制数据。不同的默认字符集生成的二进制数据是不同的，解码时只有使用与该默认字符集相同的字符集才能获得正确的字符。
String.getBytes(String charset)	获取当前string表示的字符，在使用指定字符集时，所映射的二进制数据。传入不同的字符集生成的二进制数据是不同，解码时只有使用相同的字符集才能获得正确的字符。
new String(byte[] data)	使用系统默认的字符集，将给定的二进制数据映射为相应的字符，并构造成一个string。如果系统默认的字符集不变那么可以通过String.getBytes()或String.getBytes(Charset.defaultCharset().displayName())来还原原来的二进制数据。
new String(byte[] data, String charset)	使用给定的字符集，将给定的二进制数据映射为相应的字符，并构造成一个string。只有通过传入相同的字符集才能还原原来的二进制数据，String.getBytes("charset")。

日常使用场景及解决方案

场景1——修改JVM系统字符集

系统默认的字符集是指，JVM运行时调用java.nio.Charset.defaultCharset().displayName()所显示的字符集。我们有如下几种方式更改JVM在运行时的系统字符集：

方法1

Properties pps=System. getProperties();

pps.put("file.encoding","<your-charset>");

System.setProperties(pps);

方法2

System.setProperty("file.encoding","<your-charset>");

方法3

java -D file.encoding=<your-charset>

上表中尖括弧斜体部分应该替换为你想要的字符集。

需要注意的是，如果是在运行时更改了字符集，那么再调用java.nio.Charset.defaultCharset().diaplayName()可能并不会变，因为Charset源码对default charset做了内容缓存，具体可查看Charset源码：

private static volatile Charset defaultCharset;

/**
* Returns the default charset of this Java virtual machine.
*
* <p> The default charset is determined during virtual-machine startup and
* typically depends upon the locale and charset of the underlying
* operating system.
*
* @return A charset object for the default charset
*
* @since 1.5
*/
public static Charset defaultCharset() {
if (defaultCharset == null) {
synchronized (Charset.class) {
String csn = AccessController.doPrivileged(
new GetPropertyAction("file.encoding"));
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
}
}
return defaultCharset;
}

因此要获取更改后的字符集编码，使用方法1或方法2后，需要使用System.getProperty("file.encoding")获取最新的字符集。在实际项目中，很多库都会默认调用String.getBytes(), 而该方法中使用的是java.nio.Charset.defaultCharset().displayName()来获取默认字符集。具体可查看String类和StringCoding类源码：

String类

/**
* Encodes this {@code String} into a sequence of bytes using the
* platform's default charset, storing the result into a new byte array.
*
* <p> The behavior of this method when this string cannot be encoded in
* the default charset is unspecified. The {@link
* java.nio.charset.CharsetEncoder} class should be used when more control
* over the encoding process is required.
*
* @return The resultant byte array
*
* @since JDK1.1
*/
public byte[] getBytes() {
return StringCoding.encode(value, 0, value.length);
}

StringCoding类

static byte[] encode(char[] ca, int off, int len) {
String csn = Charset.defaultCharset().name();
try {
// use charset name encode() variant which provides caching.
return encode(csn, ca, off, len);
} catch (UnsupportedEncodingException x) {
warnUnsupportedCharset(csn);
}
try {
return encode("ISO-8859-1", ca, off, len);
} catch (UnsupportedEncodingException x) {
// If this code is hit during VM initialization, MessageUtils is
// the only way we will be able to get any kind of error message.
MessageUtils.err("ISO-8859-1 charset not available: "
+ x.toString());
// If we can not find ISO-8859-1 (a required encoding) then things
// are seriously wrong with the installation.
System.exit(1);
return null;
}
}

综上所述，如果需要完美的修改系统默认的字符集，方法3最好。

场景2——网络通信的字符集编码

在开发中，前后端通信时出现乱码问题，特别是中文更易出现乱码。一般情况下，都是前后端默认的字符集和业务上需求的字符集不匹配。下面基于一个常见的具体情况，来做说明。

假设服务端默认的字符集编码为GBK，前端默认的字符集编码为GB2312。

如果双方都是明文通信，并且http请求和响应的header中都正确指定了字符集(正确指定是指，内容的编码与header中指定的字符集是匹配的)，那么通信双方，都从对方发过来的header中拿到charset，然后调用new String(byte[] httpBodyRawData, charset)即可。

如果双方都是明文通信，但是并没有正确指定header中的字符集，那么就需要约定的字符集，如果约定的字符集为UTF-8，那么调用new String(byte[] httpBodyRawData, "UTF-8")即可。

如果双方使用密文通信，约定的字符集为UTF-8。假设需要加密的字符串为toEncryStr, 解密后的原始数据为byte[] decryData。加密分为如下几种情况：

如果加密算法接收byte[]类型，那么需要：toEncryStr.getBytes("UTF-8")。

如果加密算法接收string类型，加密时调用String.getBytes("UTF-8")，那么需要做的转换是： new String(toEncryStr.getBytes("UTF-8"), "UTF-8")；

该转换的意思是，首先将当前字符通过UTF-8字符集映射为二进制数据(此时只有用UTF-8解码才能正常显示)，然后将二进制数据通过UTF-8字符集映射为相应字符(一定是正常字符)。之后加密时，调用String.getBytes("UTF-8")时会将转换后的字符(正常)还原为UTF-8编码的二进制数据。

解码比较简单，只需要：new String(decryData, "UTF-8")。// 约定的字符集为UTF-8

转载于:https://blog.51cto.com/9797337/1767774

weixin_34292287

关注

2
点赞
踩
7

收藏

觉得还不错? 一键收藏
0
评论
Java string 字符集编码以及转换

Java string 字符集编码以及转换基本概念关于字符集的种类常用的有utf-8, unicode，gbk，gbk2312等，详细的字符集列表可以查看java.nio.charset.Charset类。关键的字符集处理方法介绍如下：String.getBytes()获取当前string表示的字符，在使用系统默认的字符集(关于系统默认的字符集后面详细讨论)时，所映射的二进制...
复制链接

扫一扫