一般的生成Reader时,指定了字符集编码格式,不会报异常错误
例如:
try {
reader = new InputStreamReader(input,"GBK");
} catch (UnsupportedEncodingException e1) {
// TODO 自动生成的 catch 块
e1.printStackTrace();
}
BufferedReader bfr = new BufferedReader(reader);
try {
System.out.println(bfr.readLine());
} catch (IOException e) {
// TODO 自动生成的 catch 块
e.printStackTrace();
}
此时结果为:
锘�閮ㄧ讲鏃舵敞鎰忎簨椤�----乱码,因为字符集不正确内容是乱码
因此需要添加文件字符集检测,在学习lucene源码时发现已经有相关接口了。
public static Reader getDecodingReader(InputStream stream, Charset charSet) {
final CharsetDecoder charSetDecoder = charSet.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
return new BufferedReader(new InputStreamReader(stream, charSetDecoder));
}
在生成Reader时添加一次检测,当文件字符集编码格式不匹配时,会报异常信息
java.nio.charset.UnmappableCharacterException: Input length = 2
at java.nio.charset.CoderResult.throwException(CoderResult.java:278)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:338)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:203)
at java.io.BufferedReader.read(BufferedReader.java:279)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at org.apache.lucene.util.DecodingReaderTest.test(DecodingReaderTest.java:34)
at org.apache.lucene.util.DecodingReaderTest.main(DecodingReaderTest.java:18)
完整代码如下;
package org.apache.lucene.util;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
import java.nio.file.Files;
import java.nio.file.Paths;
public class DecodingReaderTest {
public static void main(String[] args) {
test();
}
private static void test(){
String path = "./mytest/configuration.properties";
InputStream input = null;
try {
input = Files.newInputStream(Paths.get(path));
} catch (IOException e) {
// TODO 自动生成的 catch 块
e.printStackTrace();
}
Reader reader =getDecodingReader(input, Charset.forName("GBK"));
BufferedReader bfr = new BufferedReader(reader);
try {
System.out.println(bfr.readLine());
} catch (IOException e) {
// TODO 自动生成的 catch 块
e.printStackTrace();
}
}
/**
* Wrapping the given {@link InputStream} in a reader using a {@link CharsetDecoder}.
* Unlike Java's defaults this reader will throw an exception if your it detects
* the read charset doesn't match the expected {@link Charset}.
* <p>
* Decoding readers are useful to load configuration files, stopword lists or synonym files
* to detect character set problems. However, its not recommended to use as a common purpose
* reader.
* 检测配置文件,词典文件等字符集与设置的字符集是否匹配
* @param stream the stream to wrap in a reader
* @param charSet the expected charset
* @return a wrapping reader
*/
public static Reader getDecodingReader(InputStream stream, Charset charSet) {
final CharsetDecoder charSetDecoder = charSet.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT)
.onUnmappableCharacter(CodingErrorAction.REPORT);
return new BufferedReader(new InputStreamReader(stream, charSetDecoder));
}
}