java utf8 长度,在没有实际编码的情况下计算Java String的UTF-8长度

Does anyone know if the standard Java library (any version) provides a means of calculating the length of the binary encoding of a string (specifically UTF-8 in this case) without actually generating the encoded output? In other words, I'm looking for an efficient equivalent of this:

"some really long string".getBytes("UTF-8").length

I need to calculate a length prefix for potentially long serialized messages.

解决方案

Here's an implementation based on the UTF-8 specification:

public class Utf8LenCounter {

public static int length(CharSequence sequence) {

int count = 0;

for (int i = 0, len = sequence.length(); i < len; i++) {

char ch = sequence.charAt(i);

if (ch <= 0x7F) {

count++;

} else if (ch <= 0x7FF) {

count += 2;

} else if (Character.isHighSurrogate(ch)) {

count += 4;

++i;

} else {

count += 3;

}

}

return count;

}

}

This implementation is not tolerant of malformed strings.

Here's a JUnit 4 test for verification:

public class LenCounterTest {

@Test public void testUtf8Len() {

Charset utf8 = Charset.forName("UTF-8");

AllCodepointsIterator iterator = new AllCodepointsIterator();

while (iterator.hasNext()) {

String test = new String(Character.toChars(iterator.next()));

Assert.assertEquals(test.getBytes(utf8).length,

Utf8LenCounter.length(test));

}

}

private static class AllCodepointsIterator {

private static final int MAX = 0x10FFFF; //see http://unicode.org/glossary/

private static final int SURROGATE_FIRST = 0xD800;

private static final int SURROGATE_LAST = 0xDFFF;

private int codepoint = 0;

public boolean hasNext() { return codepoint < MAX; }

public int next() {

int ret = codepoint;

codepoint = next(codepoint);

return ret;

}

private int next(int codepoint) {

while (codepoint++ < MAX) {

if (codepoint == SURROGATE_FIRST) { codepoint = SURROGATE_LAST + 1; }

if (!Character.isDefined(codepoint)) { continue; }

return codepoint;

}

return MAX;

}

}

}

Please excuse the compact formatting.

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值