Java字符编码的那些事

最新推荐文章于 2021-02-19 12:50:28 发布

码路编

最新推荐文章于 2021-02-19 12:50:28 发布

阅读量190

点赞数

分类专栏： Java基础

本文链接：https://blog.csdn.net/l2580258/article/details/103839029

版权

Java基础专栏收录该内容

15 篇文章 1 订阅

订阅专栏

1. 问题引入

1.1 GBK、UTF8、ISO-8859-1

由于GBK中采用的中文是2个字节的，而UTF8中的中文是3个字节的，特殊生僻字采用的是4个字节，这就导致我们在将GBK编码的中文转为UTF8的中文需要补充字节数，才能进行编码的转换。如果直接强行转换的话，那么将会导致自动补充编码的情况发生，出现乱码。

以非UTF-8编码编码出的字节数组，一旦以UTF-8进行解码，通常这是一条不归路！

而ISO-8859-1采用的编码是单字节的，在代码中中转是安全的，不会产生自动补充字节的的情况发生。

1.2 Unicode

Java采用的是Unicode编码的，Unicode 是全球文字统一编码。它把世界上的各种文字的每一个字符指定唯一编码，实现跨语种、跨平台的应用。

unicode 编码规则 :
unicode 码对每一个字符用4位16进制数表示。具体规则是：将一个字符(char)的高8位与低8位分别取出，转化为16进制数，如果转化的16进制数的长度不足2位，则在其后补0，然后将高、低8位转成的16进制字符串拼接起来并在前面补上"\u" 即可。

2. GBK转UTF8

GBK转UTF-8需要补充字符，使用代码：

public static String getUTF8StringFromGBKString(String gbkStr) {
    try {
        return new String(getUTF8BytesFromGBKString(gbkStr), "UTF-8");
    } catch (UnsupportedEncodingException e) {
        throw new InternalError();
    }
}

public static byte[] getUTF8BytesFromGBKString(String gbkStr) {
    int n = gbkStr.length();
    byte[] utfBytes = new byte[3 * n];
    int k = 0;
    for (int i = 0; i < n; i++) {
        int m = gbkStr.charAt(i);
        if (m < 128 && m >= 0) {
            utfBytes[k++] = (byte) m;
            continue;
        }
        utfBytes[k++] = (byte) (0xe0 | (m >> 12));
        utfBytes[k++] = (byte) (0x80 | ((m >> 6) & 0x3f));
        utfBytes[k++] = (byte) (0x80 | (m & 0x3f));
    }
    if (k < utfBytes.length) {
        byte[] tmp = new byte[k];
        System.arraycopy(utfBytes, 0, tmp, 0, k);
        return tmp;
    }
    return utfBytes;
}

3. 字符串与Unicode转换

转换的工具类：

package sherry.com.javalib.base;


/**
 * 字符串与unicode的相互转换工具类
 */
public class UnicodeConvertUtil {

    private static boolean isBig = true;

    /**
     * 将String内容转成Unicode,且忽略\\u
     *
     * @param str
     * @return
     */
    public static byte[] putString2UnicodeBytes(String str) {
        return putString2UnicodeBytes(str, isBig);
    }

    /**
     * 将字符串转成unicode
     *
     * @param str 待转字符串
     * @return unicode字符串
     */
    public static byte[] putString2UnicodeBytes(String str, boolean isBig) {
        str = (str == null ? "" : str);
        char c;
        int i, j;

        byte[] strByteRes = new byte[str.length() * 2];
        for (i = 0; i < str.length(); i++) {
            c = str.charAt(i);

            if (isBig) { // 大端
                j = (c >>> 8); // 取出高8位
                strByteRes[2 * i] = (byte) j;

                j = (c & 0xFF); // 取出低8位
                strByteRes[2 * i + 1] = (byte) j;

            } else { // 小端
                j = (c & 0xFF); // 取出低8位
                strByteRes[2 * i] = (byte) j;

                j = (c >>> 8); // 取出高8位
                strByteRes[2 * i + 1] = (byte) j;
            }
        }

        return strByteRes;
    }

    /**
     * 将字符串转成unicode
     *
     * @param str
     * @return
     */
    public static String putString2UnicodeString(String str) {
        str = (str == null ? "" : str);
        String tmp;
        StringBuffer sb = new StringBuffer(1000);
        char c;
        int i, j;
        sb.setLength(0);
        for (i = 0; i < str.length(); i++) {
            c = str.charAt(i);
            sb.append("\\u");
            j = (c >>> 8); //取出高8位
            tmp = Integer.toHexString(j);
            if (tmp.length() == 1)
                sb.append("0");
            sb.append(tmp);
            j = (c & 0xFF); //取出低8位
            tmp = Integer.toHexString(j);
            if (tmp.length() == 1)
                sb.append("0");
            sb.append(tmp);

        }
        return (new String(sb));
    }

    /**
     * 将16进制表示的unicode转成中文，开头不含\\u
     *
     * @param unicodeBytes
     * @return
     */
    public static String unicodeBytes2Str(byte[] unicodeBytes) {
        return unicodeBytes2Str(unicodeBytes, true);
    }

    /**
     * \\u5f20   new byte[]{0x5f,0x20}
     *
     * @param unicodeBytes
     * @param isBig
     * @return
     */
    public static String unicodeBytes2Str(byte[] unicodeBytes, boolean isBig) {
        String strRes = "";
        if (null == unicodeBytes) {
            return strRes;
        }

        byte[] valueBytes = new byte[2];

        for (int i = 0; i < unicodeBytes.length; i += 2) {
            if (isBig) {
                valueBytes[0] = unicodeBytes[i];
                valueBytes[1] = unicodeBytes[i + 1];
                // strRes += ((char) Integer.valueOf(TBaseNumber.byte2HexString(valueBytes), 16).intValue());
                strRes += TBaseNumber.byte2Char(valueBytes);
            } else {
                valueBytes[0] = unicodeBytes[i + 1];
                valueBytes[1] = unicodeBytes[i];
                //  strRes += ((char) Integer.valueOf(TBaseNumber.byte2HexString(valueBytes), 16).intValue());
                strRes += TBaseNumber.byte2Char(valueBytes);
            }

        }

        return strRes;
    }



    public  static String unicodeToCn(String unicode) {
        /** 以 \ u 分割，因为java注释也能识别unicode，因此中间加了一个空格 */
        String[] strs = unicode.split("\\\\u");
        String returnStr = "";
        // 由于unicode字符串以 \ u 开头，因此分割出的第一个字符是""。
        for (int i = 1; i < strs.length; i++) {
            returnStr += (char) Integer.valueOf(strs[i], 16).intValue();
        }
        return returnStr;
    }

}

测试：

 public static void main(String[] args) {
        String characketStr = "测试";

        System.out.println("-----字符转成 unicode 编码(带\\u)：" + UnicodeConvertUtil.putString2UnicodeString(characketStr));
        System.out.println("-----字符转成 unicode[] ：" + TBaseNumber.byte2HexString(UnicodeConvertUtil.putString2UnicodeBytes(characketStr, true)));

        byte[] unicodeByte = TBaseNumber.hexStringToByte("6D4B8BD5");
        System.out.println("-----unicode[]  转 字符：" + UnicodeConvertUtil.unicodeBytes2Str(unicodeByte, true));
        System.out.println("-----unicode编码 转 字符：" + UnicodeConvertUtil.unicodeToCn("\\u6d4b\\u8bd5"));
    }

结果：
-----字符转成 unicode 编码(带\u)：\u6d4b\u8bd5
-----字符转成 unicode[] ：6D4B8BD5
-----unicode[]  转 字符：测试
-----unicode编码 转 字符：测试

4. 参考链接

在这里插入图片描述

码路编

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Java字符编码的那些事

1. 问题引入1.1 GBK、UTF8、ISO-8859-1由于GBK中采用的中文是2个字节的，而UTF8中的中文是3个字节的，特殊生僻字采用的是4个字节，这就导致我们在将GBK编码的中文转为UTF8的中文需要补充字节数，才能进行编码的转换。如果直接强行转换的话，那么将会导致自动补充编码的情况发生，出现乱码。以非UTF-8编码编码出的字节数组，一旦以UTF-8进行解码，通常这是一条不归路！...
复制链接

扫一扫