java 字节解析,Java unicode字节解析

I'm just in the process of reading some data from a file as a stream of bytes, and I've just encountered some unicode strings that I'm not sure how best to handle.

Each character is using two bytes, with only the first seeming to contain actual data, so for example the string 'trust' is stored in the file as:

0x74 0x00(t) 0x72 0x00(r) ...and so on

Normally I'd just use a regex to replace the zeros with nothing and therefore remove the whitespace. However, the spaces between words within the file are implemented using 0x00 0x00, so trying to do a simple String 'replaceAll' is kind of messing it up a little.

I've tried playing around with the String encoding sets, such as 'ISO-8859-1' and 'UTF-8/16', but everytime I end up with white space.

I did create a simple regex to remove the double zero hex values, which is:

new String(bytes).replaceAll("[\\00]{2,},"");

But this obviously only works for the double zero, and I'd really like to replace single zeros with nothing, and double zeros with a an actual ASCII/Unicode space character.

I could have sworn that one of the Java string format settings dealt with this kind of thing, but I might be wrong. So should I work on creating a regex to strip out the zeros, or does Java actually provide the mechanisms for doing it?

Thanks

解决方案

That's "UTF-16LE". 0x00 0x00 actually encodes the NUL character in UTF-16 so that's what you will get.

This encoding can encode about a million different characters, using 2 or 4 bytes per character. The first 256 characters are encoded with the second byte 0x00 and if the text only contains those it could be seen as useless, but it's required for the rest of the characters. For instance, the euro currency symbol € would show up as 0xAC 0x20.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值