java文件输入流 utf-8,如何检测非法的UTF-8字节序列来替代它们在java输入流？

最新推荐文章于 2023-09-28 23:50:38 发布

更好会很好

最新推荐文章于 2023-09-28 23:50:38 发布

阅读量174

点赞数

文章标签： java文件输入流 utf-8

The file in question is not under my control. Most byte sequences are valid UTF-8, it is not ISO-8859-1 (or an other encoding).

I want to do my best do extract as much information as possible.

The file contains a few illegal byte sequences, those should be replaces with the replacement character.

It's not an easy task, it think it requires some knowledge about the UTF-8 state machine.

Oracle has a wrapper which does what I need:

UTF8ValidationFilter javadoc

Is there something like that available (commercially or as free software)?

Thanks

-stephan

Solution:

final BufferedInputStream in = new BufferedInputStream(istream);

final CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();

charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);

charsetDecoder.onUnmappableCharacter(CodingErrorAction.REPLACE);

final Reader inputReader = new InputStreamReader(in, charsetDecoder);

解决方案

java.nio.charset.CharsetDecoder does what you need. This class provides charset decoding with user-definable actions on different kinds of errors (see onMalformedInput() and onUnmappableCharacter()).

CharsetDecoder writes to an OutputStream, which you can pipe into an InputStream using java.io.PipedOutputStream, effectively creating a filtered InputStream.

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

更好会很好

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

Java解决UTF-8的BOM问题

03-16

然而，UTF-8有一个特殊特性，那就是它可以带有Byte Order Mark（BOM），这是一个特殊的字节序列，用于标识数据的编码方式。在某些情况下，BOM可能会引起问题，例如在读取文本文件时，可能会导致额外的乱码字符出现在...

解决：UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0x81 in position 18: illegal multibyte sequence

Funing7的博客

07-26

3367

报错解决

参与评论您还未登录，请先登录后发表或查看评论

判断从输入流中获取的字符串是什么编码（UTF-8环境）

WolfShadow的博客

11-29

1万+

当你从一个未知编码的文件中，通过输入流读取内容时，假如是乱码怎么办？如果你不知道字符串的编码，可能你只能靠尝试常用的编码的方式，将字符串处理成正确编码格式。举个例子：“#鍑借喘鍚岃櫣娆惧紡f” 这是从某文件中读取的一行信息，怎么处理，一个一个去尝试么？不妨这样思考，用程序来帮忙判断其编码格式，同时将之转换成UTF-8编码怎么样。特别说明：（1）以...

Java之BufferedInputStream