Tika简单实例应用

最新推荐文章于 2024-08-09 07:12:56 发布

程裕强

最新推荐文章于 2024-08-09 07:12:56 发布

阅读量2k

点赞数

分类专栏： Java程序设计文章标签： tika 文档解析

本文链接：https://blog.csdn.net/chengyuqiang/article/details/85295902

版权

Java程序设计专栏收录该内容

41 篇文章 0 订阅

订阅专栏

1、Maven pom.xml

创建Maven项目，添加以下依赖

	<!-- https://mvnrepository.com/artifact/org.apache.tika/tika-core -->
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-core</artifactId>
            <version>1.20</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-parsers -->
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.20</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.tika/tika-app -->
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-app</artifactId>
            <version>1.20</version>
        </dependency>

2、Java类

package cn.hadron.tikademo.util;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import java.io.File;
import java.io.FileInputStream;

public class TikaUtil {

    public static String parse(String filePath) throws Exception{
        return parse(filePath,10*1024*1024);
    }

    public static String parse(String filePath,int limit) throws Exception{
        File file=new File(filePath);
        if(!file.exists()){
            System.out.println("目标文件不存在！");
            return null;
        }
        BodyContentHandler handler=null;
        if(limit>10*1024*1024) {
            handler = new BodyContentHandler(limit);
        }else{
            handler = new BodyContentHandler(10 * 1024 * 1024);
        }
        Metadata meta=new Metadata();
        FileInputStream input=new FileInputStream(file);
        ParseContext context=new ParseContext();
        new AutoDetectParser().parse(input,handler,meta,context);
        return handler.toString();
    }

    public static void main(String[] args) throws Exception {
        String content=TikaUtil.parse("D:\\tika\\a.doc");
        System.out.println(content);
    }
}

程序说明：
默认可读取10万以内个字符文档，如果文档文件过大，则报错。
org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).
解决办法：
通过BodyContentHandler()有参构造器，设置更大的字符数限制。比如10 * 1024 * 1024，可读取1000万左右的字符文档。

 new BodyContentHandler(10 * 1024 * 1024);

3、运行效果

在这里插入图片描述

程裕强

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录