按字节截取含有中文汉字的字符串

最新推荐文章于 2022-01-21 15:04:19 发布

Styx1222

最新推荐文章于 2022-01-21 15:04:19 发布

阅读量671

点赞数

要求实现一个按字节截取字符串的方法，比如对于字符串"我ZWR 爱 JAVA"，截取它的前四位字节应该是"我 ZW"，而不是"我ZWR"，同时要保证不会出现截取了半个汉字的情况。

英文字母和中文汉字在不同的编码格式下，所占用的字节数也是不同的，我们可以通过下面的例子来看看在一些常见的编码格式下，一个英文字母和一个中文汉字分别占用多少字节。

Java代码

import java.io.UnsupportedEncodingException;

public class EncodeTest {

/**

* 打印字符串在指定编码下的字节数和编码名称到控制台

* @param s

* 字符串

* @param encodingName

* 编码格式

public static void printByteLength(String s, String encodingName) {

System.out.print("字节数：");

try {

System.out.print(s.getBytes(encodingName).length);

} catch (UnsupportedEncodingException e) {

e.printStackTrace();

}

System.out.println(";编码：" + encodingName);

}

public static void main(String[] args) {

String en = "A";

String ch = "人";

// 计算一个英文字母在各种编码下的字节数

System.out.println("英文字母：" + en);

EncodeTest.printByteLength(en, "GB2312");

EncodeTest.printByteLength(en, "GBK");

EncodeTest.printByteLength(en, "GB18030");

EncodeTest.printByteLength(en, "ISO-8859-1");

EncodeTest.printByteLength(en, "UTF-8");

EncodeTest.printByteLength(en, "UTF-16");

EncodeTest.printByteLength(en, "UTF-16BE");

EncodeTest.printByteLength(en, "UTF-16LE");

System.out.println();

// 计算一个中文汉字在各种编码下的字节数

System.out.println("中文汉字：" + ch);

EncodeTest.printByteLength(ch, "GB2312");

EncodeTest.printByteLength(ch, "GBK");

EncodeTest.printByteLength(ch, "GB18030");

EncodeTest.printByteLength(ch, "ISO-8859-1");

EncodeTest.printByteLength(ch, "UTF-8");

EncodeTest.printByteLength(ch, "UTF-16");

EncodeTest.printByteLength(ch, "UTF-16BE");

EncodeTest.printByteLength(ch, "UTF-16LE");

}

运行结果如下：

1.英文字母：A

2.字节数：1;编码：GB2312

3.字节数：1;编码：GBK

4.字节数：1;编码：GB18030

5.字节数：1;编码：ISO-8859-1

6.字节数：1;编码：UTF-8

7.字节数：4;编码：UTF-16

8.字节数：2;编码：UTF-16BE

9.字节数：2;编码：UTF-16LE

10.

11.中文汉字：人

12.字节数：2;编码：GB2312

13.字节数：2;编码：GBK

14.字节数：2;编码：GB18030

15.字节数：1;编码：ISO-8859-1

16.字节数：3;编码：UTF-8

17.字节数：4;编码：UTF-16

18.字节数：2;编码：UTF-16BE

19.字节数：2;编码：UTF-16LE

UTF-16BE和 UTF-16LE 是 UNICODE编码家族的两个成员。UNICODE 标准定义了

UTF-8、UTF-16、UTF-32 三种编码格式，共有

UTF-8、UTF-16、UTF-16BE、UTF-16LE、UTF-32、UTF-32BE、UTF-32LE七种编码方案。

JAVA 所采用的编码方案是 UTF-16BE。从上例的运行结果中我们可以看出，

GB2312、GBK、GB18030 三种编码格式都可以满足题目的要求。下面我们就以 GBK编码为例来进行解答。

如果我们直接按照字节截取会出现什么情况呢？我们来测试一下：

Java代码

import java.io.UnsupportedEncodingException;

public class CutString {

public static void main(String[] args) throws UnsupportedEncodingException {

String s = "我 ZWR 爱 JAVA";

// 获取 GBK 编码下的字节数据

byte[] data = s.getBytes("GBK"); byte[] tmp = new byte[6];

// 将 data 数组的前六个字节拷贝到 tmp 数组中

System.arraycopy(data, 0, tmp, 0, 6);

// 将截取到的前六个字节以字符串形式输出到控制台 s = new String(tmp); System.out.println(s);

}

输出结果：

1. 我 ZWR?

在截取前六个字节时，第二个汉字“爱”被截取了一半，导致它无法正常显示了，这样显然是有问题的。

我们不能直接使用 String 类的 substring(intbeginIndex, int endIndex)方法，因为它是按字符截取的。'我'和'Z'都被作为一个字符来看待，length 都是 1。实际上我们只要能区分开中文汉字和英文字母，这个问题就迎刃而解了，而它们的区别就是，中文汉字是两个字节，英文字母是一个字节。

Java代码

import java.io.UnsupportedEncodingException; public class CutString {

/**

* 判断是否是一个中文汉字

* @param c

* 字符

* @return true 表示是中文汉字，false 表示是英文字母

* @throws UnsupportedEncodingException

* 使用了 JAVA 不支持的编码格式

*/ public static boolean isChineseChar(char c)

throws UnsupportedEncodingException {

// 如果字节数大于 1，是汉字