汉字占多个字节，若按指定字节长度截取字符串，如何处理1/3个汉字？

最新推荐文章于 2023-10-19 22:09:26 发布

学习编程知识

最新推荐文章于 2023-10-19 22:09:26 发布

阅读量4.1k

点赞数

分类专栏： java 文章标签：函数 substring string 编码 utf8

本文链接：https://blog.csdn.net/never_cxb/article/details/49668927

版权

java 专栏收录该内容

90 篇文章 0 订阅

订阅专栏

截取字符串的函数按照字节

编写一个截取字符串的函数，输入为一个字符串和字节数，输出为按字节截取的字符串。但是要保证汉字不被截半个，如“我ABC”4，应该截为“我AB”，输入“我ABC汉DEF”，6，应该输出为“我ABC”而不是“我ABC+汉的半个”。

分析

不能使用substring(beginIndex, endIndex)，因为它是返回的字符，题目要求的是字节

Returns a new string that is a substring of this string. The substring begins at the specified beginIndex and extends to the character at index endIndex - 1. Thus the length of the substring is endIndex-beginIndex.

UTF- 8 和 GBK

UTF- 8是用以解决国际上字符的一种多字节编码，它对英文使用8位（即一个字节），中文使用24为（三个字节）来编码。

GBK是国家标准GB2312基础上扩容后兼容GB2312的标准。GBK的文字编码是用双字节来表示的，即不论中、英文字符均使用双字节来表示。

String x = "我";
System.out.println(x.getBytes("utf-8").length);
System.out.println(x.getBytes("GBK").length);
/**
 * 输出
 * 3
 * 2
 */

String s = "我ABC汗";
System.out.println(new String(s.getBytes("GBK"), "GBK"));
输出"我ABC汗"

System.out.println(new String(s.getBytes(), "GBK"));
乱码 鎴慉BC姹�

System.out.println(new String(s.getBytes(), "utf8"));
输出"我ABC汗" 

System.out.println(new String(s.getBytes("utf8"), "utf8"));
输出"我ABC汗"  

System.out.println(new String(s.getBytes(), "ascii"));
���ABC���
可以看出默认使用 utf8 编码，然后 ascii 解码，英文正常，但是汉字是3个

分析
默认是utf8编码，所以不写没事。encode 和 decode 需要相同
记住 jvm 里面是 unicode，出来时候才会具体编码

解决方法

思路就是从 String 的每个字符遍历，然后如果是中文的，就-2，如果是英文的，-1。
String.valueOf(b[i]).getBytes().length > 1判断是否是中文

static String split(String orignal, int count) {
    // count 表示多少个字节
    // 1个中文字符是2个字节
    char[] b = orignal.toCharArray();
    // constructor need character's no, so <= byte's count
    StringBuilder sb = new StringBuilder(count);
    for (int i = 0; i < b.length; i++) {
        if (count <= 0) {
            break;
        }

        String temp = String.valueOf(b[i]);
        // System.out.println(temp + "->" + temp.getBytes().length);

        // Chinese character
        if (temp.getBytes().length > 1) {
            count -= 2;
            if (count < 0) {
                break;
            }

        } else {
            count--;
        }

        sb.append(temp);
    }

    return sb.toString();
}

改进方法

static String otherSplit(String original, int count) {

    StringBuilder sb = new StringBuilder();
    // 这儿是 count 和 i 两头并进
    // i 每次都 +1 ,每次都会至少找到1个字符（英文1个，中文2个）
    // 如果是中文字符，count 就-1
    // 对于半个中文
    for (int i = 0; i < count - 1; i++) {
        char c = original.charAt(i);
        if (String.valueOf(c).getBytes().length > 1) {
            count--;
        }
        sb.append(c);
    }
    return sb.toString();

}

方法有问题

String s = "我ABCD爱哈BAC汗";
System.out.println(otherSplit(s, 6));
//我ABC
System.out.println(split(s, 6));
//我ABCD

对于最后一个英文字符，会少一个字母

测试

public static void main(String[] args) throws UnsupportedEncodingException {

    String s = "我额A爱哈BAC汗";

    System.out.println(otherSplit(s, 2));
    System.out.println(otherSplit(s, 3));
    System.out.println(otherSplit(s, 4));
    System.out.println(otherSplit(s, 5));
    System.out.println(otherSplit(s, 6));
    System.out.println(otherSplit(s, 9));

    System.out.println(split(s, 2));
    System.out.println(split(s, 3));
    System.out.println(split(s, 4));
    System.out.println(split(s, 5));
    System.out.println(split(s, 6));
    System.out.println(split(s, 9));
}
/**
 我
我
我额
我额
我额A
我额A爱哈
我
我
我额
我额A
我额A
我额A爱哈
 */

判断是否是中文

public static void main(String[] args) throws UnsupportedEncodingException {

        char c = '我';
        System.out.println(Character.getNumericValue(c));

        String s = String.valueOf(c);
        System.out.println(s.getBytes("utf8").length);
        System.out.println(s.getBytes("gbk").length);

        System.out.println(Arrays.toString(s.getBytes("utf8")));
        System.out.println(Arrays.toString(s.getBytes("gbk")));

    }/**
    -1
    3
    2
    [-26, -120, -111]
    [-50, -46]
     */

可以通过 Character.getNumericValue 返回-1
也可以根据字节数，中文是多个字节，英文1个字节
根据字节得到的数字，中文是负数，英文和符号在0-256之间

getNumericValue() only applies to characters that represent numbers, such as the digits ‘0’ through ‘9’. As a convenience, it also treats the ASCII letters as if they were digits in a base-36 number system (so ‘A’ is 10 and ‘Z’ is 35).

学习编程知识

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
汉字占多个字节，若按指定字节长度截取字符串，如何处理1/3个汉字？

截取字符串的函数按照字节编写一个截取字符串的函数，输入为一个字符串和字节数，输出为按字节截取的字符串。但是要保证汉字不被截半个，如“我ABC”4，应该截为“我AB”，输入“我ABC汉DEF”，6，应该输出为“我ABC”而不是“我ABC+汉的半个”。分析 substring(beginIndex, endIndex) 是返回的字符，题目要求的是字节 Returns a ne
复制链接

扫一扫