tika解析加密的office文件

use Tika(https://tika.apache.org) to detect file MME type and check whether it's correct type for specific file extension.

For internal minetype/file extension not covered by Tika, we could configure it in customize minetype configuration file like below:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<mime-info>
  <mime-type 
type="application/octet-stream">
    <glob 
pattern="*.unl"/>
  </mime-type>
</mime-info>

 

As mentioned in Tika documentation(https://tika.apache.org/1.13/detection.html ), For typically container based formats, the
magic detection may not be enough.

password protected OOXML files are actually stored in an OLE2 (application/x-tika-msoffice) container.(I tried with
tika-parsers, Encrypted Microsoft Office OOXML files return the same media type- 'application/x-tika-ooxml-protected'.
Referring to fucntion testDetectProtectedOOXML() and testDetectProtectedOLE2()
in  https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java

in tika-mimetypes.xml,which defines the valid mime types used by Tika.

<mime-type type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet">

  <_comment>Office Open XML Workbook</_comment>

  <glob pattern="*.xlsx"/>

  <sub-class-of type="application/x-tika-ooxml"/>

</mime-type>

...

<mime-type type="application/vnd.ms-excel"> 

 

  <!-- Use DefaultDetector / org.apache.tika.parser.microsoft.POIFSContainerDetector for more reliable detection of OLE2
documents -->

 ...

  <glob pattern="*.xls"/>

...

 

  <sub-class-of type="application/x-tika-msoffice"/>

</mime-type>

so,it works well if you change file extension from 'xlsx' to '.xls' as inputStream and fileName have the same media type
'application/x-tika-msoffice'.

(Note:

Using magic detection, it is easy to spot that a given file is an OLE2 document, or a Zip file. Using magic detection alone, it is very difficult (and often impossible) to tell what kind of file lives inside the container.

For some use cases, speed is important, so having a quick way to know the container type is sufficient. For other cases however, you don't mind spending a bit of time (and memory!) processing the container to get a more accurate answer on its contents. For these cases, the additional container aware detectors contained in the Tika Parsers jar should be used.

 

 

 

 

转载于:https://my.oschina.net/cdt/blog/1837606

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值