tika html body,Apache Tika

Apache Tika提供了一种方便的方式来解析文件内容,包括自动检测文件类型并调用合适的解析器。它可以返回不同格式的文本内容,如纯文本、HTML或XHTML。通过使用特定的内容处理器,你可以提取文档正文、执行XPath查询,甚至检测电话号码。此外,Tika还支持翻译系统,允许将内容翻译成其他语言,并能识别文本的语言。
摘要由CSDN通过智能技术生成

Apache Tika API Usage Examples

This page provides a number of examples on how to use the various Tika APIs. All of the examples shown are also available in the Tika Example module in GIT.

Tika provides a number of different ways to parse a file. These provide different levels of control, flexibility, and complexity.

The Tika facade, provides a number of very quick and easy ways to have your content parsed by Tika, and return the resulting plain text

For more control, you can call the Tika Parsers directly. Most likely, you'll want to start out using the Auto-Detect Parser, which automatically figures out what kind of content you have, then calls the appropriate parser for you.

With Tika, you can get the textual content of your files returned in a number of different formats. These can be plain text, html, xhtml, xhtml of one part of the file etc. This is controlled based on the ContentHandler you supply to the Parser.

By using the BodyContentHandler, you can request that Tika return only the content of the document's body as a plain-text string.

By using the ToXMLContentHandler, you can get the XHTML content of the whole document as a string.

If you just want the body of the xhtml document, without the header, you can chain together a BodyContentHandler and a ToXMLContentHandler as shown:

It possible to execute XPath queries on the parse results, to fetch only certain bits of the XHTML.

The textual output of parsing a file with Tika is returned via the SAX ContentHandler you pass to the parse method. It is possible to customise your parsing by supplying your own ContentHandler which does special things.

By using the PhoneExtractingContentHandler, you can have any phone numbers found in the textual content of the document extracted and placed into the Metadata object for you.

Sometimes, you want to chunk the resulting text up, perhaps to output as you go minimising memory use, perhaps to output to HDFS files, or any other reason! With a small custom content handler, you can do that.

Tika provides a pluggable Translation system, which allow you to send the results of parsing off to an external system or program to have the text translated into another language.

In order to use the Microsoft Translation API, you need to sign up for a Microsoft account, get an API key, then pass the key to Tika before translating.

Tika provides support for identifying the language of text, through the LanguageIdentifier class.

A number of other examples are also available, including all of the examples from the Tika In Action book. These can all be found in the Tika Example module in GIT.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值