Running Apache Tika in Server Mode

141 篇文章 0 订阅

Running Apache Tika in Server Mode

We are using Apache Tika for plain-text extraction of pdf files. Tika is doing a good job here except for the fact that it takes quite long to get results. As an example, extracting the text from a 234 slides pdf presentation takes about 3.5 seconds on my laptop. You might become a performance problem here, if you do not only want to extract the text of a single file but let's say for 12.000 files.

Here is the command with which I figured out, how long it takes to get the plain text of a document:

?
1
2
3
4
5
$ time java -jar tika-app-1.3.jar -h some.pdf
[...]
real    0m2.935s
user    0m4.640s
sys 0m0.178s

Now Tika can also be run in a server mode. Here is the command to start Tika as a server:

?
1
$ java -jar tika-app-1.3.jar -t --server --port 12345

You can now pass your (pdf) documents to that server (e.g. with NetCat) and get your results as before. As you can see, things become a lot faster:

?
1
2
3
4
$ time nc 127.0.0.1 12345 < some.pdf
real    0m0.386s
user    0m0.003s
sys 0m0.015s

So if you have the same performance problems with Tika as we had, this might be a solution!


Tika supports two "server" modes. The simpler and original is the --server flag of Tika-App. The more functional, but also more recent is the JAX-RS JSR-311 server component, which is an additional jar.

The Tika-App Network Server is very simple to use. Simply start Tika-App with the --server flag, and a --port ### flag telling it what port to listen on. Then, connect to that port and send it a single file. You'll get back the html version. NetCat works well for this, something like java -jar tika-app.jar --server --port 12345 followed by nc 127.0.0.1 12345 < MyFileToExtract will get you back the html

The JAX-RS JSR-311 server component supports a few different urls, for things like metadata, plain text etc. You start the server with java -jar tika-server.jar, then do HTTP put calls to the appropriate url with your input document and you'll get the resource back. There are loads of details and examples (including using curl for testing) on the wiki page

The Tika App Network Server is fairly simple, only supports one mode (extract to HTML), and is generally used for testing / demos / prototyping / etc. The Tika JAXRS Server is a fully RESTful service which talks HTTP, and exposes a wide range of Tika's modes. It's the generally recommended way these days to interface with Tika over the network, and/or from non-Java stacks.


  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值