如何使用Tika提取文件内容

7 篇文章 0 订阅

如何使用Tika提取文件内容

在这里插入图片描述

什么是Tika?

  • Tika全名Apache Tika,是用于文件类型检测和从各种格式的文件中提取内容的一个库。

  • Tika使用现有的各种文件解析器和文档类型的检测技术来检测和提取数据。

  • 使用Tika,可以轻松提取到的不同类型的文件内容,如电子表格,文本文件,图像,PDF文件甚至多媒体输入格式,在一定程度上提取结构化文本以及元数据。

  • Tika提供用于解析不同文件格式的一个通用API。它采用83个现有的专业解析器库,所有这些解析器库是根据一个叫做Parser接口单一接口封装。

Tika支持的文件格式

文件格式类库Tika中的类
XMLorg.apache.tika.parser.xmlXMLParser
HTMLorg.apache.tika.parser.htmll and it uses Tagsoup LibraryHtmlParser
MS-Office compound document Ole2 till 2007 ooxml 2007 onwardsorg.apache.tika.parser.microsoftorg.apache.tika.parser.microsoft.ooxml and it uses Apache Poi libraryOfficeParser(ole2)OOXMLParser(ooxml)
OpenDocument Format openofficeorg.apache.tika.parser.odfOpenOfficeParser
portable Document Format(PDF)org.apache.tika.parser.pdf and this package uses Apache PdfBox libraryPDFParser
Electronic Publication Format (digital books)org.apache.tika.parser.epubEpubParser
Rich Text formatorg.apache.tika.parser.rtfRTFParser
Compression and packaging formatsorg.apache.tika.parser.pkg and this package uses Common compress libraryPackageParser and CompressorParser and its sub-classes
Text formatorg.apache.tika.parser.txtTXTParser
Feed and syndication formatsorg.apache.tika.parser.feedFeedParser
Audio formatsorg.apache.tika.parser.audio and org.apache.tika.parser.mp3AudioParser MidiParser Mp3- for mp3parser
Imageparsersorg.apache.tika.parser.jpegJpegParser-for jpeg images
Videoformatsorg.apache.tika.parser.mp4 and org.apache.tika.parser.video this parser internally uses Simple Algorithm to parse flash video formatsMp4parser FlvParser
java class files and jar filesorg.apache.tika.parser.asmClassParser CompressorParser
Mobxformat (email messages)org.apache.tika.parser.mboxMobXParser
Cad formatsorg.apache.tika.parser.dwgDWGParser
FontFormatsorg.apache.tika.parser.fontTrueTypeParser
executable programs and librariesorg.apache.tika.parser.executableExecutableParser

图形用户界面(GUI)

在这里插入图片描述

代码实现

Maven依赖:

	<dependencies>
      
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>1.17</version>            
        </dependency>
        
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>jbig2-imageio</artifactId>
            <version>3.0.0</version>
        </dependency>
        
        <dependency>
            <groupId>org.xerial</groupId>
            <artifactId>sqlite-jdbc</artifactId>
            <version>3.8.11.2</version>
        </dependency>

        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.17</version>
        </dependency>       
    </dependencies>

注:第二第三两个依赖并不是必须,没有也不影响,只是运行时会报警告⚠

Tika提取pdf文件内容

public String paserPdf() {

    try {
        File file = new File("C:\\Users\\FileRecv\\test1.pdf");

        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        FileInputStream fileInputStream = new FileInputStream(file);
        ParseContext parseContext = new ParseContext();

        //提取图像信息
        //JpegParser JpegParser = new JpegParser();
        //提取PDF
        PDFParser pdfParser = new PDFParser();
        pdfParser.parse(fileInputStream,handler,metadata,parseContext);

        return handler.toString();
        /*String[] names = metadata.names();
        for (String name : names) {
            System.out.println("name:"+metadata.get(name));
        }*/
    } catch (Exception e) {
        e.printStackTrace();
    }

    return "";
}

Tika提取Excel内容

public String parseExcel() {

    try {
        File file = new File("C:\\Users\\FileRecv\\book1.xlsx");

        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        FileInputStream fileInputStream = new FileInputStream(file);
        ParseContext parseContext = new ParseContext();

        OOXMLParser msofficeparser = new OOXMLParser();
        msofficeparser.parse(fileInputStream, handler, metadata, parseContext);
        return handler.toString();
    } catch (Exception e) {
        e.printStackTrace();
    }
    return "";
}

Tika提取文本文档

public String parseTxt() {

    try {
        File file = new File("C:\\Users\\FileRecv\\笔记.txt");

        BodyContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        FileInputStream fileInputStream = new FileInputStream(file);
        ParseContext parseContext = new ParseContext();

        TXTParser txtParser = new TXTParser();
        txtParser.parse(fileInputStream, handler, metadata, parseContext);
        return handler.toString();
    } catch (Exception e) {
        e.printStackTrace();
    }
    return "";
}

Tika语言检测

public String LanguageDetection() throws IOException, TikaException, SAXException {

    Parser parser = new AutoDetectParser();
    File file = new File("C:\\Users\\FileRecv\\笔记.txt");

    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    FileInputStream fileInputStream = new FileInputStream(file);
    ParseContext parseContext = new ParseContext();

    parser.parse(fileInputStream,handler,metadata,parseContext);
    LanguageIdentifier languageIdentifier = new LanguageIdentifier(handler.toString());
    return languageIdentifier.getLanguage();
}

Tika获取文件格式,提取doc文件

public String getContext() throws IOException, TikaException {
    File file = new File("C:\\Users\\FileRecv\\oracle安装教程.docx");
    Tika tika = new Tika();
    //获取格式
    String detect = tika.detect(file);
    //获取内容
    String filecontent = tika.parseToString(file);
    
    return detect;

    /*File file = new File("C:\\Users\\PANSOFT\\Documents\\Tencent Files\\944916258\\FileRecv\\oracle安装教程.docx");
    FileInputStream inputStream = new FileInputStream(file);
    XWPFDocument document = new XWPFDocument(inputStream);
    XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
    String doc = wordExtractor.getText();
    return doc;*/
}
  • 1
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
您可以使用以下Java代码使用Tika将PDF文件转换为图像文件: ``` import java.io.File; import java.io.FileOutputStream; import java.io.InputStream; import java.io.OutputStream; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.pdf.PDFParserConfig; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.ToXMLContentHandler; import org.apache.tika.sax.ToXMLContentHandler.XHTML; import org.apache.tika.sax.XHTMLContentHandler; import org.apache.tika.sax.image.ImageContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; public class TikaPDFtoImageConverter { public static void main(String[] args) throws Exception { File pdfFile = new File("input.pdf"); File outputFile = new File("output.png"); int pageNumber = 1; String imageFormat = "png"; convertPDFtoImage(pdfFile, outputFile, pageNumber, imageFormat); } public static void convertPDFtoImage(File pdfFile, File outputFile, int pageNumber, String imageFormat) throws Exception { InputStream inputStream = null; OutputStream outputStream = null; try { inputStream = TikaPDFtoImageConverter.class.getResourceAsStream(pdfFile.getName()); outputStream = new FileOutputStream(outputFile); AutoDetectParser parser = new AutoDetectParser(); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true); ParseContext parseContext = new ParseContext(); parseContext.set(PDFParserConfig.class, pdfConfig); ContentHandler contentHandler = new ImageContentHandler(outputStream); Metadata metadata = new Metadata(); parser.parse(inputStream, contentHandler, metadata, parseContext); } catch (Exception e) { e.printStackTrace(); throw e; } finally { if (inputStream != null) { try { inputStream.close(); } catch (Exception e) { e.printStackTrace(); } } if (outputStream != null) { try { outputStream.close(); } catch (Exception e) { e.printStackTrace(); } } } } } ``` 该代码使用Tika解析器和ImageContentHandler将PDF文件转换为图像文件。您可以指定转换的页面号和图像格式。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值