如何使用Tika提取文件内容
什么是Tika?
-
Tika全名Apache Tika,是用于文件类型检测和从各种格式的文件中提取内容的一个库。
-
Tika使用现有的各种文件解析器和文档类型的检测技术来检测和提取数据。
-
使用Tika,可以轻松提取到的不同类型的文件内容,如电子表格,文本文件,图像,PDF文件甚至多媒体输入格式,在一定程度上提取结构化文本以及元数据。
-
Tika提供用于解析不同文件格式的一个通用API。它采用83个现有的专业解析器库,所有这些解析器库是根据一个叫做Parser接口单一接口封装。
Tika支持的文件格式
文件格式 | 类库 | Tika中的类 |
---|---|---|
XML | org.apache.tika.parser.xml | XMLParser |
HTML | org.apache.tika.parser.htmll and it uses Tagsoup Library | HtmlParser |
MS-Office compound document Ole2 till 2007 ooxml 2007 onwards | org.apache.tika.parser.microsoftorg.apache.tika.parser.microsoft.ooxml and it uses Apache Poi library | OfficeParser(ole2)OOXMLParser(ooxml) |
OpenDocument Format openoffice | org.apache.tika.parser.odf | OpenOfficeParser |
portable Document Format(PDF) | org.apache.tika.parser.pdf and this package uses Apache PdfBox library | PDFParser |
Electronic Publication Format (digital books) | org.apache.tika.parser.epub | EpubParser |
Rich Text format | org.apache.tika.parser.rtf | RTFParser |
Compression and packaging formats | org.apache.tika.parser.pkg and this package uses Common compress library | PackageParser and CompressorParser and its sub-classes |
Text format | org.apache.tika.parser.txt | TXTParser |
Feed and syndication formats | org.apache.tika.parser.feed | FeedParser |
Audio formats | org.apache.tika.parser.audio and org.apache.tika.parser.mp3 | AudioParser MidiParser Mp3- for mp3parser |
Imageparsers | org.apache.tika.parser.jpeg | JpegParser-for jpeg images |
Videoformats | org.apache.tika.parser.mp4 and org.apache.tika.parser.video this parser internally uses Simple Algorithm to parse flash video formats | Mp4parser FlvParser |
java class files and jar files | org.apache.tika.parser.asm | ClassParser CompressorParser |
Mobxformat (email messages) | org.apache.tika.parser.mbox | MobXParser |
Cad formats | org.apache.tika.parser.dwg | DWGParser |
FontFormats | org.apache.tika.parser.font | TrueTypeParser |
executable programs and libraries | org.apache.tika.parser.executable | ExecutableParser |
图形用户界面(GUI)
-
Tika 提供了一个jar文件访问图形化界面。
-
Win+R打开命令行窗口,运行
java -jar jar文件路径
来打开GUI界面。 -
点击open,选择文件即可解析成相应类型。
代码实现
Maven依赖:
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.17</version>
</dependency>
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>jbig2-imageio</artifactId>
<version>3.0.0</version>
</dependency>
<dependency>
<groupId>org.xerial</groupId>
<artifactId>sqlite-jdbc</artifactId>
<version>3.8.11.2</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
</dependencies>
注:第二第三两个依赖并不是必须,没有也不影响,只是运行时会报警告⚠
Tika提取pdf文件内容
public String paserPdf() {
try {
File file = new File("C:\\Users\\FileRecv\\test1.pdf");
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream fileInputStream = new FileInputStream(file);
ParseContext parseContext = new ParseContext();
//提取图像信息
//JpegParser JpegParser = new JpegParser();
//提取PDF
PDFParser pdfParser = new PDFParser();
pdfParser.parse(fileInputStream,handler,metadata,parseContext);
return handler.toString();
/*String[] names = metadata.names();
for (String name : names) {
System.out.println("name:"+metadata.get(name));
}*/
} catch (Exception e) {
e.printStackTrace();
}
return "";
}
Tika提取Excel内容
public String parseExcel() {
try {
File file = new File("C:\\Users\\FileRecv\\book1.xlsx");
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream fileInputStream = new FileInputStream(file);
ParseContext parseContext = new ParseContext();
OOXMLParser msofficeparser = new OOXMLParser();
msofficeparser.parse(fileInputStream, handler, metadata, parseContext);
return handler.toString();
} catch (Exception e) {
e.printStackTrace();
}
return "";
}
Tika提取文本文档
public String parseTxt() {
try {
File file = new File("C:\\Users\\FileRecv\\笔记.txt");
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream fileInputStream = new FileInputStream(file);
ParseContext parseContext = new ParseContext();
TXTParser txtParser = new TXTParser();
txtParser.parse(fileInputStream, handler, metadata, parseContext);
return handler.toString();
} catch (Exception e) {
e.printStackTrace();
}
return "";
}
Tika语言检测
public String LanguageDetection() throws IOException, TikaException, SAXException {
Parser parser = new AutoDetectParser();
File file = new File("C:\\Users\\FileRecv\\笔记.txt");
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream fileInputStream = new FileInputStream(file);
ParseContext parseContext = new ParseContext();
parser.parse(fileInputStream,handler,metadata,parseContext);
LanguageIdentifier languageIdentifier = new LanguageIdentifier(handler.toString());
return languageIdentifier.getLanguage();
}
Tika获取文件格式,提取doc文件
public String getContext() throws IOException, TikaException {
File file = new File("C:\\Users\\FileRecv\\oracle安装教程.docx");
Tika tika = new Tika();
//获取格式
String detect = tika.detect(file);
//获取内容
String filecontent = tika.parseToString(file);
return detect;
/*File file = new File("C:\\Users\\PANSOFT\\Documents\\Tencent Files\\944916258\\FileRecv\\oracle安装教程.docx");
FileInputStream inputStream = new FileInputStream(file);
XWPFDocument document = new XWPFDocument(inputStream);
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
String doc = wordExtractor.getText();
return doc;*/
}