解析文本的字符格式

元小帅

已于 2023-11-10 22:08:34 修改

阅读量128

点赞数

文章标签： CharsetDetector CharsetMatch

于 2023-11-10 14:42:13 首次发布

本文链接：https://blog.csdn.net/zengrenyuan/article/details/134332563

版权

apache tika

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parser-text-module</artifactId>
    <version>2.9.1</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.9.1</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>2.9.1</version>
</dependency>

读取字符编码

import org.apache.tika.parser.txt.CharsetDetector;
import org.apache.tika.parser.txt.CharsetMatch;


  CharsetDetector detector = new CharsetDetector();
  detector.setText(FileUtils.readFileToByteArray(file));
  //返回匹配到的第一个字符集
  CharsetMatch charsetMatch = detector.detect();
  //返回所有字符集
  CharsetMatch[] matches = detector.detectAll();