概述
项目中经常遇到上传文本文件,当含有中文时,由于编码就会出现乱码,其根本原因就是用户上传文件的编码与解析的编码不一致.基本有两种解决方案
- 规定用户上传的文件的编码
- 自动识别文件编码
对于第一种的解决方法简单粗暴,通常的做法是提供一个规定了默认编码示例文件供用户下载,但是这种的不确定性因素比较大,因此考虑通用的自动识别也是有必要的.自动识别文件编码的工具包有很多,仅摘取几例学习.
自动识别编码工具包其基本原理就是取一串字节流,然后根据各个不同编码集的编码规则依次进行匹配判断.为了简化操作,不采用真实的web环境,直接使用本地文件测试(因为web传递的字节流,更简单的,直接使用字节数组测试).
example
识别工具类有很多,此处举例仅作参考
测试主要以ansi,unicode,unicode big endian,utf-8,以文件流的形式进行测试
还有另一种简化操作,使用字节数组测试,为了取到与文件流相同的效果,将字节数组写入流中
/* 对于需要重复读取的流(判断编码取一次,获取内容取一次),需要使用支持reset的流.
注:有些解析器支持字节数组,但是处理字节数组与处理流是有区别的,可能会得到不同的结果*/
BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(content.getBytes("GBK")));
tika
package charset;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import org.apache.tika.detect.AutoDetectReader;
import org.junit.Test;
public class ParseCharset {
public static String content = "中国";
@Test
public void parseByTika() {
AutoDetectReader detect = null;
InputStream in = null;
try {
in = new FileInputStream("C:\\Users\\admin\\Desktop\\temp\\test.txt");
//detect = new AutoDetectReader(getInputStream(charsetName));
detect = new AutoDetectReader(in);
Charset charset = detect.getCharset();
//System.out.println(charset.name());
String row = null;
while ((row = detect.readLine()) != null) {
if (!charset.name().startsWith("UTF"))
row = new String(row.getBytes(charset.name()), "GBK");
System.out.println("charset : " + charset.name() +"; content : "+ row);
}
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
in.close();
detect.close();
} catch (IOException e) {
e.printStackTrace();
}
}
/***************运行结果****************/
/* unicode big endian
charset : UTF-16BE; content : 中国
ansi
charset : IBM855; content : 中国
unicode
charset : UTF-16LE; content : 中国
utf-8
charset : UTF-8; content : 中国
注:一般解析不出来,当ISO-8859-1(字节编码,数据不会丢失)处理
*/
/***************相关依赖****************/
/*
* pom依赖
* <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>1.16</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.16</version>
</dependency>
*解析基本所有常见格式的文件,得到文件的metadata,content等内容,返回格式化信息
*解析的内容有 文件格式,文件内容,文件编码,字符串语言等
*
*
*/
}
}
tika解析的核心源码,AutoDetectReader配置了三种解析器Icu4jEncodingDetector,UniversalEncodingDetector ,HtmlEncodingDetector,轮询解析,以UniversalEncodingDetector 为例
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.tika.parser.txt;
import java.io.IOException;
import java.io.InputStream;
import java.nio.charset.Charset;
import org.apache.tika.detect.EncodingDetector;
import org.apache.tika.metadata.Metadata;
public class UniversalEncodingDetector implements EncodingDetector {
private static final int BUFSIZE = 1024;
private static final int LOOKAHEAD = 16 * BUFSIZE;
public Charset detect(InputStream input, Metadata metadata)
throws IOException {
if (input == null) {
return null;
}
input.mark(LOOKAHEAD);
try {
UniversalEncodingListener listener =
new UniversalEncodingListener(metadata);
byte[] b = new byte[BUFSIZE];
int n = 0;
int m = input.read(b);
while (m != -1 && n < LOOKAHEAD && !listener.isDone()) {
n += m;
listener.handleData(b, 0, m);
m = input.read(b, 0, Math.min(b.length, LOOKAHEAD - n));
}
return listener.dataEnd();
} catch (LinkageError e) {
return null; // j