兼容JDK1.6的最高版本的是 1.4版本的TIKA
1.4版本的TIKA,在读取TXT内容时候存在一些问题,有更好的解决办法,可以在下方留言。
以下为doc、docx、ppt、pptx、xls、xlsx,wps,ptf,rtf,htm。txtl格式
获取流,元数据,内容方式(笔者亲测过)
1)获取BufferedReader 流
public BufferedReader getReader(File file) throws Exception
{
BufferedReader reader = null;
String extension = FilenameUtils.getExtension(file.getName());
if ("txt".equals(extension))
{
FileInputStream fis = new FileInputStream(file);
AutoDetectReader dr = new AutoDetectReader(fis);
reader = new BufferedReader(new InputStreamReader(
new FileInputStream(file), dr.getCharset()));
fis.close();
}
else
{
reader = new BufferedReader(new Tika().parse(file));
}
return reader;
}
2)获取内容(txt内容读取,用apache的common.io包实现)
public String getContent(File file) throws Exception
{
String content = null;
String extension = FilenameUtils.getExtension(file.getName());
if ("txt".equals(extension))
{
FileInputStream fis = null;
try
{
fis = new FileInputStream(file);
AutoDetectReader dr = new AutoDetectReader(fis);
content = FileUtils
.readFileToString(file, dr.getCharset());
}
catch (Exception e)
{
throw e;
}
finally
{
if (fis != null)
fis.close();
}
}
else
{
content = new Tika().parseToString(file);
}
return content;
}
3)获取元数据
public Metadata getMetadata(File file) throws Exception
{
FileInputStream fis = new FileInputStream(file);
Metadata metadata = new Metadata();
Tika tika = new Tika();
tika.setMaxStringLength(0);
tika.parseToString(fis, metadata, 0);
return metadata;
}
数据的进一步封装:
Map<String, String> metaMap = new HashMap<String, String>();
Metadata metadata = getMetadata();
for (String name : metadata.names())
{
metaMap.put(name, metadata.get(name));
}
性能方面还未进行详细测试