本文章来自脚本之家,在此对脚本之家网站编辑人员表示由衷的感谢。
读取PDF文件jar引用
1 2 3 4 5 | <dependency> <groupid>org.apache.pdfbox</groupid> pdfbox</artifactid> <version> 1.8 . 13 </version> </dependency> |
读取WORD文件jar引用
1 2 3 4 5 6 7 8 9 10 | <dependency> <groupid>org.apache.poi</groupid> poi-scratchpad</artifactid> <version> 3.16 -beta1</version> </dependency> <dependency> <groupid>org.apache.poi</groupid> poi</artifactid> <version> 3.16 -beta1</version> </dependency> |
读取WORD文件方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | /** * * @Title: getTextFromWord * @Description: 读取word * @param filePath * 文件路径 * @return: String 读出的Word的内容 */ public static String getTextFromWord(String filePath) { String result = null ; File file = new File(filePath); FileInputStream fis = null ; try { fis = new FileInputStream(file); @SuppressWarnings ( "resource" ) WordExtractor wordExtractor = new WordExtractor(fis); result = wordExtractor.getText(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { if (fis != null ) { try { fis.close(); } catch (IOException e) { e.printStackTrace(); } } } return result; } |
读取PDF文件方法
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | /** * * @Title: getTextFromPdf * @Description: 读取pdf文件内容 * @param filePath * @return: 读出的pdf的内容 */ public static String getTextFromPdf(String filePath) { String result = null ; FileInputStream is = null ; PDDocument document = null ; try { is = new FileInputStream(filePath); PDFParser parser = new PDFParser(is); parser.parse(); document = parser.getPDDocument(); PDFTextStripper stripper = new PDFTextStripper(); result = stripper.getText(document); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } finally { if (is != null ) { try { is.close(); } catch (IOException e) { e.printStackTrace(); } } if (document != null ) { try { document.close(); } catch (IOException e) { e.printStackTrace(); } } } return result; } 本段代码的核心在 WordExtractor 类和 PDFParser类,这两个类均来自与appache下,word和pdf文档的处理和地城调用都被封装在这两个类中了,有兴趣的朋友可以下载这两个类来查看研究一下其底层的实现原理。 |