读取word文件有两种方法,用jacob包,可以修改生成word文件内容http://danadler.com/jacob/。如果只读取word里的文本内容的话,可以用poi读取word文件,先到http://www.ibiblio.org/maven2/org/textmining/tm-extractors/下载tm-extractors-0.4.jar包
1。用jacob.
其实jacob是一个bridage,连接java和com或者win32函数的一个中间件,jacob并不能直接抽取word,excel等文件,需要自己写dll哦,不过已经有为你写好的了,就是jacob的作者一并提供了。
jacob下载:http://www.matrix.org.cn/down_view.asp?id=13
下载了jacob并放到指定的路径之后(dll放到path,jar文件放到classpath),就可以写你自己的抽取程序了,下面是一个例子:
import java.io.File;
import com.jacob.com.*;
import com.jacob.activeX.*;
public class FileExtracter{
public static void main(String[] args) {
ActiveXComponent app = new ActiveXComponent("Word.Application");
String inFile = "c://test.doc";
String tpFile = "c://temp.htm";
String otFile = "c://temp.xml";
boolean flag = false;
try {
app.setProperty("Visible", new Variant(false));
Object docs = app.getProperty("Documents").toDispatch();
Object doc = Dispatch.invoke(docs,"Open", Dispatch.Method, new Object[]{inFile,new Variant(false), new Variant(true)}, new int[1]).toDispatch();
Dispatch.invoke(doc,"SaveAs", Dispatch.Method, new Object[]{tpFile,new Variant(8)}, new int[1]);
Variant f = new Variant(false);
Dispatch.call(doc, "Close", f);
flag = true;
} catch (Exception e) {
e.printStackTrace();
} finally {
app.invoke("Quit", new Variant[] {});
}
}
}
2、OFFICE文档使用POI控件,PDF可以使用PDFBOX0.7.3控件,完全支持中文,用XPDF也行,不过感觉PDFBOX比较好,而且作者也在更新。水平有限,万望各位指正
WORD:
import
org.apache.lucene.document.Document;
import
org.apache.lucene.document.Field;
import
org.apache.poi.hwpf.extractor.WordExtractor;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
import
java.io.File;
import
java.io.InputStream;
import
java.io.FileInputStream;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
import
com.search.code.Index;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/a41954a27d6ad96fa2c2cf816e677448.gif)
public
Document getDocument(Index index, String url, String title, InputStream is)
throws
DocCenterException
...
{
![](https://i-blog.csdnimg.cn/blog_migrate/6a9c071a08f1dae2d3e1c512000eef41.gif)
String bodyText = null;
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
try ...{
WordExtractor ex = new WordExtractor(is);//is是WORD文件的InputStream
bodyText = ex.getText();
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if(!bodyText.equals(""))...{
index.AddIndex(url, title, bodyText);
}
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
}catch (DocCenterException e) ...{
throw new DocCenterException("无法从该Mocriosoft Word文档中提取内容", e);
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
}catch(Exception e)...{
e.printStackTrace();
}
}
return
null
;
}
Excel:
import
org.apache.lucene.document.Document;
import
org.apache.lucene.document.Field;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
import
org.apache.poi.hwpf.extractor.WordExtractor;
import
org.apache.poi.hssf.usermodel.HSSFWorkbook;
import
org.apache.poi.hssf.usermodel.HSSFSheet;
import
org.apache.poi.hssf.usermodel.HSSFRow;
import
org.apache.poi.hssf.usermodel.HSSFCell;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
import
java.io.File;
import
java.io.InputStream;
import
java.io.FileInputStream;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
import
com.search.code.Index;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/a41954a27d6ad96fa2c2cf816e677448.gif)
public
Document getDocument(Index index, String url, String title, InputStream is)
throws
DocCenterException
...
{
StringBuffer content = new StringBuffer();
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
try...{
HSSFWorkbook workbook = new HSSFWorkbook(is);//创建对Excel工作簿文件的引用
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
for (int numSheets = 0; numSheets < workbook.getNumberOfSheets(); numSheets++) ...{
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if (null != workbook.getSheetAt(numSheets)) ...{
HSSFSheet aSheet = workbook.getSheetAt(numSheets);//获得一个sheet
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
for (int rowNumOfSheet = 0; rowNumOfSheet <= aSheet.getLastRowNum(); rowNumOfSheet++) ...{
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if (null != aSheet.getRow(rowNumOfSheet)) ...{
HSSFRow aRow = aSheet.getRow(rowNumOfSheet); //获得一个行
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
for (short cellNumOfRow = 0; cellNumOfRow <= aRow.getLastCellNum(); cellNumOfRow++) ...{
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if (null != aRow.getCell(cellNumOfRow)) ...{
HSSFCell aCell = aRow.getCell(cellNumOfRow);//获得列值
content.append(aCell.getStringCellValue());
}
}
}
}
}
}
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if(!content.equals(""))...{
index.AddIndex(url, title, content.toString());
}
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
}catch (DocCenterException e) ...{
![](https://i-blog.csdnimg.cn/blog_migrate/6a9c071a08f1dae2d3e1c512000eef41.gif)
throw new DocCenterException("无法从该Mocriosoft Word文档中提取内容", e);
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
}catch(Exception e) ...{
System.out.println("已运行xlRead() : " + e );
}
return null;
}
PowerPoint:
import
java.io.InputStream;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
import
org.apache.lucene.document.Document;
import
org.apache.poi.hslf.HSLFSlideShow;
import
org.apache.poi.hslf.model.TextRun;
import
org.apache.poi.hslf.model.Slide;
import
org.apache.poi.hslf.usermodel.SlideShow;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
public
Document getDocument(Index index, String url, String title, InputStream is)
![](https://i-blog.csdnimg.cn/blog_migrate/a41954a27d6ad96fa2c2cf816e677448.gif)
throws
DocCenterException
...
{
StringBuffer content = new StringBuffer("");
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
try...{
SlideShow ss = new SlideShow(new HSLFSlideShow(is));//is 为文件的InputStream,建立SlideShow
Slide[] slides = ss.getSlides();//获得每一张幻灯片
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
for(int i=0;i<slides.length;i++)...{
TextRun[] t = slides[i].getTextRuns();//为了取得幻灯片的文字内容,建立TextRun
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
for(int j=0;j<t.length;j++)...{
content.append(t[j].getText());//这里会将文字内容加到content中去
}
content.append(slides[i].getTitle());
}
index.AddIndex(url, title, content.toString());
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
}catch(Exception ex)...{
System.out.println(ex.toString());
}
return null;
}
3、PDF:但是pdfbox对中文支持还不好,先下载pdfbox
import
java.io.InputStream;
import
java.io.IOException;
import
org.apache.lucene.document.Document;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
import
org.pdfbox.cos.COSDocument;
import
org.pdfbox.pdfparser.PDFParser;
import
org.pdfbox.pdmodel.PDDocument;
import
org.pdfbox.pdmodel.PDDocumentInformation;
import
org.pdfbox.util.PDFTextStripper;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
import
com.search.code.Index;
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/a41954a27d6ad96fa2c2cf816e677448.gif)
public
Document getDocument(Index index, String url, String title, InputStream is)
throws
DocCenterException
...
{
COSDocument cosDoc = null;
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
try ...{
cosDoc = parseDocument(is);
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
} catch (IOException e) ...{
closeCOSDocument(cosDoc);
throw new DocCenterException("无法处理该PDF文档", e);
}
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if (cosDoc.isEncrypted()) ...{
if (cosDoc != null)
closeCOSDocument(cosDoc);
throw new DocCenterException("该PDF文档是加密文档,无法处理");
}
String docText = null;
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
try ...{
PDFTextStripper stripper = new PDFTextStripper();
docText = stripper.getText(new PDDocument(cosDoc));
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
} catch (IOException e) ...{
closeCOSDocument(cosDoc);
throw new DocCenterException("无法处理该PDF文档", e);
}
![](https://i-blog.csdnimg.cn/blog_migrate/6a9c071a08f1dae2d3e1c512000eef41.gif)
PDDocument pdDoc = null;
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
try ...{
pdDoc = new PDDocument(cosDoc);
PDDocumentInformation docInfo = pdDoc.getDocumentInformation();
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if(docInfo.getTitle()!=null && !docInfo.getTitle().equals(""))...{
title = docInfo.getTitle();
}
![](https://i-blog.csdnimg.cn/blog_migrate/6a9c071a08f1dae2d3e1c512000eef41.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
} catch (Exception e) ...{
closeCOSDocument(cosDoc);
closePDDocument(pdDoc);
System.err.println("无法取得该PDF文档的元数据" + e.getMessage());
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
} finally ...{
closeCOSDocument(cosDoc);
closePDDocument(pdDoc);
}
return null;
}
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/a41954a27d6ad96fa2c2cf816e677448.gif)
private
static
COSDocument parseDocument(InputStream is)
throws
IOException
...
{
PDFParser parser = new PDFParser(is);
parser.parse();
return parser.getDocument();
}
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/a41954a27d6ad96fa2c2cf816e677448.gif)
private
void
closeCOSDocument(COSDocument cosDoc)
...
{
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if (cosDoc != null) ...{
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
try ...{
cosDoc.close();
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
} catch (IOException e) ...{
}
}
}
![](https://i-blog.csdnimg.cn/blog_migrate/6810355c2f78c12e91b7997a8e8c583a.gif)
![](https://i-blog.csdnimg.cn/blog_migrate/a41954a27d6ad96fa2c2cf816e677448.gif)
private
void
closePDDocument(PDDocument pdDoc)
...
{
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
if (pdDoc != null) ...{
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
try ...{
pdDoc.close();
![](https://i-blog.csdnimg.cn/blog_migrate/37c8bf68cdc3cc81759c34160776bc53.gif)
} catch (IOException e) ...{
}
}
}
4.抽取支持中文的pdf文件-xpdf
xpdf是一个开源项目,我们可以调用他的本地方法来实现抽取中文pdf文件。
下载xpdf函数包:http://www.matrix.org.cn/down_view.asp?id=15
同时需要下载支持中文的补丁包:http://www.matrix.org.cn/down_view.asp?id=16
按照readme放好中文的patch,就可以开始写调用本地方法的java程序了
下面是一个如何调用的例子:
import java.io.*;
/**
* <p>Title: pdf extraction</p>
* <p>Description: email:chris@matrix.org.cn</p>
* <p>Copyright: Matrix Copyright (c) 2003</p>
* <p>Company: Matrix.org.cn</p>
* @author chris
* @version 1.0,who use this example pls remain the declare
*/
public class PdfWin {
public PdfWin() {
}
public static void main(String args[]) throws Exception
{
String PATH_TO_XPDF="C://Program Files//xpdf//pdftotext.exe";
String filename="c://a.pdf";
String[] cmd = new String[] { PATH_TO_XPDF, "-enc", "UTF-8", "-q", filename, "-"};
Process p = Runtime.getRuntime().exec(cmd);
BufferedInputStream bis = new BufferedInputStream(p.getInputStream());
InputStreamReader reader = new InputStreamReader(bis, "UTF-8");
StringWriter out = new StringWriter();
char [] buf = new char[10000];
int len;
while((len = reader.read(buf))>= 0) {
//out.write(buf, 0, len);
System.out.println("the length is"+len);
}
reader.close();
String ts=new String(buf);
System.out.println("the str is"+ts);
}
}
代码复制可能出错,不过代码经过测试,绝对能用,POI为3.0-rc4,PDFBOX为0.7.3