The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc (a docx file essentially) if you open using winrar, you will find xml files. As it is known that a docx file is a zip file consists of xml files.
I believe this problem must not be rare. But I have not found any information about thi