Apache POI - HWPF and XWPF - Java API to Handle Microsoft Word Files
对Doc文件的解析
需要poi-scratchpad/3.7.jar
POI-HWPF - A Quick Guide
基本的文本提取
有两个输入参数:inputstream,HWPFDocument,
getText()方法是得到所有的文本内容,
getParagraphText()是得到每一段的文本内容,
getTextFromPieces()是得到每一页的文本内容
特定文本属性提取
To get specific bits of text, first create aorg.apache.poi.hwpf.HWPFDocument. Fetch the range withgetRange(), then get paragraphs from that. You can then get text and other properties.
第一步:创建HWPFDocument
第二步:得到Range
getRange():Returns the range which covers the whole of the document, but excludes any headers(页眉) and footers(页脚).
int
Used to get the number of paragraphs in a range.
int
Used to get the number of sections in a range(这个是“节”,就是插入、分隔符中的“节”)
第三步:得到段落
getParagraph():
getText()
public static void main(String[] args) throwsException {
InputStream istream= newFileInputStream("e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");
HWPFDocument doc= newHWPFDocument(istream);
Range range= doc.getRange();//Returns the range which covers the whole//of the document, but excludes any//headers and footers.
for (int i = 0; i < range.numParagraphs(); i++) {
Paragraph poiPara=range.getParagraph(i);int j = 0;while (true) {
CharacterRun run= poiPara.getCharacterRun(j++);
System.out.println("Color " + run.getColor());//颜色
System.out.println("Font size " + run.getFontSize());//字体大小
System.out.println("Font Name " + run.getFontName());//字体名称
System.out.println(run.isBold() + " " + run.isItalic() + " "
+ run.getUnderlineCode());//加粗,斜体,下划线
System.out.println("Text is " + run.text());//文本内容
if (run.getEndOffset() ==poiPara.getEndOffset()) {break;
}
}
}
}
对Docx文件的解析
需要poi-ooxml/3.7.jar
packagetest;importjava.io.FileInputStream;importjava.io.FileNotFoundException;importjava.io.InputStream;importjava.util.ArrayList;importjava.util.List;importorg.apache.poi.hwpf.HWPFDocument;importorg.apache.poi.hwpf.usermodel.CharacterRun;importorg.apache.poi.hwpf.usermodel.Paragraph;importorg.apache.poi.hwpf.usermodel.Range;importorg.apache.poi.xwpf.usermodel.XWPFDocument;importorg.apache.poi.xwpf.usermodel.XWPFParagraph;importorg.apache.poi.xwpf.usermodel.XWPFRun;public classParseWordDocxTest {/***@paramargs
*@throwsException*/
public static void main(String[] args) throwsException {
InputStream istream= newFileInputStream("e:\\Users\\ywf\\Desktop\\文本校对\\1.docx");
XWPFDocument docx= newXWPFDocument(istream);
List paraGraph =docx.getParagraphs();for(XWPFParagraph para :paraGraph ){
List run =para.getRuns();for(XWPFRun r : run){int i = 0;
System.out.println("字体颜色:"+r.getColor());
System.out.println("字体名称:"+r.getFontFamily());
System.out.println("字体大小:"+r.getFontSize());
System.out.println("Text:"+r.getText(i++));
System.out.println("粗体?:"+r.isBold());
System.out.println("斜体?:"+r.isItalic());
}
}
}
}