java文件输入流 utf-8,如何检测非法的UTF-8字节序列来替代它们在java输入流?

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).

I want to do my best do extract as much information as possible.

The file contains a few illegal byte sequences, those should be replaces with the replacement character.

It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.

Oracle has a wrapper which does what I need:

UTF8ValidationFilter javadoc

Is there something like that available (commercially or as free software)?

Thanks

-stephan

Solution:

final BufferedInputStream in = new BufferedInputStream(istream);

final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();

charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);

charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);

final Reader inputReader = new InputStreamReader(in, charsetDecoder);

解决方案

java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).

CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值