采用POI和ANTLR提取WORD文档数据

1. POI提取WORD文档内容

POI是Apache开源项目之一,用Java实现跨平台MS Word/Excel文档解析。 也就是说可以在非Windows平台提取MS Word/Excel文档内容。 本文采用POI的一个扩展Jar包tm-extractors_0.4.jar提取Word文档内容。 Word文档内容如下:(该文档有2页,每页记录一个组件内容),

Java类中Import import org.textmining.text.extraction.WordExtractor;然后在方法中:

  protected String getText() throws Exception {
                WordExtractor extractor=null;
                String text=null;
                extractor = new WordExtractor();
                text=extractor.extractText(in);   // in为 FileInputStream(new File("Word文档地址"));
                return text;
 }

 运行结果如下:

Colimas Component Specification

1. Component: Apache Jakarta POI Java API To Access Microsoft Format Files
 
1.1 Basic Information
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦

Alias                               : POI
Author                            :
http://jakarta.apache.org/poi/index.html
Version                           : 0.0.1
Language                        : Java
Platform                          : Windows, Linux, Unix
Status                              : Confirmed
Is public?                         : Y
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
1.2 Developers
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦

Apache developer1
Apache developer2
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦

1.3 License
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
Apache License 2.0
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦

1.4 Function Description
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
http://jakarta.apache.org/poi/index.html
The POI project consists of APIs for manipulating various file formats based upon Microsoft's OLE 2 Compound Document format using pure Java. In short, you can read and write MS Excel files using Java. Soon, you'll be able to read and write Word files using Java. POI is your Java Excel solution as well as your Java Word solution. However, we have a complete API for porting other OLE 2 Compound Document formats and welcome others to participate.
OLE 2 Compound Document Format based files include most Microsoft Office files such as XLS and DOC as well as MFC serialization API based file formats.
As a general policy we try to collaborate as much as possible with other projects to provide this functionality. Examples include: Cocoon for which there are serializers for HSSF; Open Office.org with whom we collaborate in documenting the XLS format; and Lucene for which we'll soon have file format interpretors. When practical, we donate components directly to those projects for POI-enabling them.
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
1.5 Extends Info
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦

¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
1. Component: ANother Tool for Language Recognition
 
1.1 Basic Information
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
Alias                               : ANTLR
Author                            : http://www.antlr.org/
Version                           : 2.7.6
Language                        : Java
Platform                          : Windows, Linux
Status                              : Confirmed
Is public?                         : Y
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
1.2 Developers
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
Terence Parr
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦

1.3 License
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
ANTLR 3 License
 [The BSD License]Copyright (c) 2005, Terence Parr All rights reserved. http://www.antlr.org/license.html
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
1.4 Function Description
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
What is ANTLR?
ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. ANTLR provides excellent support for tree construction, tree walking, and translation. There are currently about 5,000 ANTLR source downloads a month.
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
1.5 Extends Info
¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦


¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦

由于Blog不知支持日文字符所以花符号变成乱麻,但不要紧,只需要知道该字符的HEX码为0x81,0xA6。

花符号作为分隔符区别文字,也可以使用其他分隔符,但要保证文档文字内不会出现该字符。

2 Antlr编程
 所需要提取的数据有:

Alias, Author,Version,Language,Platform,Status, Is public?‘:’右侧的文字和

Developers, License,Function Description 花符号内的文字。

Antlr的词法分析如下:

// ColimasComponentWordLexer
//   
class ColimasComponentWordLexer extends Lexer;
options {
 exportVocab=ColimasComponentWord;     // call the vocabulary "Java"
 testLiterals=true;                    // test for literals
 k=3;                   // 2 characters of lookahead
 charVocabulary='/u0003'..'/uFFFF';
 codeGenBitsetTestThreshold=20;
}
{
 List records=new ArrayList();
 Component component=null; //保存需要得到的数据
 
 public List getRecords(){
  return this.records;
 }
}
DIGIT  : '0'..'9';
// character literals
PARA   : ('/u0081''/u00A6')+ EMPTY;  //花符号分割符的HEX码加上空格和换行符
CHAR_LITERAL
 : (~('/n'|'/r'|'//'|'/u0081'|'/u00A6'))| '/u000C'|'/u0003'|'/u0004'   //文档使用的文字符
 ;
protected
VALUE  : (CHAR_LITERAL)+ ;  //数据
EMPTY  : ( (('.')?((('/t')+)?(('/r')?'/n'))+)| ((' ')+) )?;  //空格或换行
TITLE  : ("C"|"c")"olimas "("C"|"c")"omponent "("S"|"s")"pecification" EMPTY  //文档第一行
  ;
NAME   : ( DIGIT "." EMPTY)?("C"|"c")"omponent:" EMPTY v:VALUE EMPTY //组件名称
 {
  if(component==null)
    component=new Component();
  component.setName(v.getText());  //将组件名称保存到Component对象里。
 }
 ;
BASIC  : (DIGIT ".1" EMPTY)?("B"|"b")"asic "("I"|"i")"nformation" EMPTY
   ;
COLON  : ':';
ALIAS : //组件别名
  (("A"|"a")"lias") EMPTY COLON  EMPTY v:VALUE EMPTY
  {
   if(component==null)
    component=new Component();
   component.setAlias(v.getText());
  }
  ;
AUTHOR :  ("Author"|"author") EMPTY COLON  EMPTY v:VALUE EMPTY  //组件出处
 {
  if(component==null)
    component=new Component();
  component.setAuthor(v.getText());
 }
 ;
VERSION:  ("Version"|"version") EMPTY COLON  EMPTY v:VALUE EMPTY //组件版本
 {
  if(component==null)
    component=new Component();
  component.setVersion(v.getText());
 }
 ;
LANGUAGE: (("L"|"l")"anguage") EMPTY COLON  EMPTY v:VALUE EMPTY  //组件使用语言
 {
  if(component==null)
    component=new Component();
  component.setLanguage(v.getText());
 }
 ;
PLATFORM: (("P"|"p")"latform") EMPTY COLON  EMPTY v:VALUE EMPTY  //组件运行平台
 {
  if(component==null)
    component=new Component();
  component.setPlatform(v.getText());
 }
 ;
STATUS  : (("S"|"s")"tatus") EMPTY COLON  EMPTY v:VALUE EMPTY //组件状态
 {
  if(component==null)
    component=new Component();
  component.setStatus(v.getText());
 }
 ;
PUBLIC  : (("I"|"i")"s public?") EMPTY COLON  EMPTY v:VALUE EMPTY  //是否发布
 {
  if(component==null)
    component=new Component();
  component.setIspublic(v.getText());
 }
 ;
SECTION : (VALUE EMPTY)+;
USERIDS : (DIGIT ".2" EMPTY)?("D"|"d")"evelopers" EMPTY PARA  //开发者
    v:SECTION
    PARA
 {
  if(component==null)
    component=new Component();
  component.setDevelopers(v.getText());
 }
 ;
LICENSE : (DIGIT ".3" EMPTY)?("L"|"l")"icense" EMPTY PARA (v:VALUE EMPTY)* PARA  //许可证
 {
  if(component==null)
    component=new Component();
  component.setLicense(v.getText());
 }
 ;
FUNCTION: (DIGIT ".4" EMPTY)?("F"|"f")"unction "("D"|"d")"escription" EMPTY PARA v:SECTION PARA  //功能
 {
  if(component==null)
    component=new Component();
  component.setFunction(v.getText());
  records.add(component);
  component=null;
 }
 ;
EXTEND  : (DIGIT ".5" EMPTY)?("E"|"e")"xtends "("I"|"i")"nfo" EMPTY PARA (VALUE EMPTY)* PARA ;//其他

antlr语法分析:

//----------------------------------------------------------------------------
// The Colimas component specification word scanner
//----------------------------------------------------------------------------
header{
 package org.colimas.doc.parser;
 import java.util.List;
 import java.util.ArrayList;
}
class ColimasComponentWordParser extends Parser;
options {
 k = 2;                                   // two token lookahead
 exportVocab=ColimasComponentWord;   // Call its vocabulary "Java"
 codeGenMakeSwitchThreshold = 2;          // Some optimizations
 codeGenBitsetTestThreshold = 3;
 defaultErrorHandler = false;              // Don't generate parser error handlers
}

document :  //全文
 TITLE document   
 | component document  
 | footer 
 ; 
component : //组件信息
 NAME basic USERIDS LICENSE FUNCTION (EXTEND)? (footer)? ;
basic     :
 BASIC PARA ALIAS AUTHOR VERSION LANGUAGE
 PLATFORM STATUS PUBLIC PARA
    ;
footer    : //其他
 CHAR_LITERAL (EMPTY)?
 | EMPTY
 ;

使用Antlr编译后生成java类, ColimasComponentWordLexer,ColimasComponentWordParser,ColimasComponentWordTokenTypes

保存组件内容的类内容如下:

public class Component {
 
 private String alias=null;
 
 private String name=null;
 
 private String version=null;
 
 private String author=null;
 
 private String language=null;
 
 private String platform=null;
 
 private String ispublic=null;
 
 private List developers=new ArrayList();
 
 private List function=new ArrayList();
 
 private List license=new ArrayList();

 private String status=null;
 
 /**
  *<p>get status</p>
  * @return Returns the status.
  */
 public String getStatus() {
  return status;
 }

 /**
  * <p>set status</p>
  * @param status The status to set.
  */
 public void setStatus(String status) {
  this.status = status;
 }

 /**
  *<p>get alias</p>
  * @return Returns the alias.
  */
 public String getAlias() {
  return alias;
 }

 /**
  * <p>set alias</p>
  * @param alias The alias to set.
  */
 public void setAlias(String alias) {
  this.alias = alias;
 }

 /**
  *<p>get author</p>
  * @return Returns the author.
  */
 public String getAuthor() {
  return author;
 }

 /**
  * <p>set author</p>
  * @param author The author to set.
  */
 public void setAuthor(String author) {
  this.author = author;
 }

 /**
  *<p>get developers</p>
  * @return Returns the developers.
  */
 public List getDevelopers() {
  return this.developers;
 }

 /**
  * <p>set developers</p>
  * @param developers The developers to set.
  */
 public void setDevelopers(String developers) {
  this.developers.add(developers);
 }

 /**
  *<p>get function</p>
  * @return Returns the function.
  */
 public String getFunction() {
  String functions="";
  for(int i=0;i<function.size();i++){
   String tmp=(String)function.get(i);
   functions+=tmp+"/n";
  }
  return functions;
 }

 /**
  * <p>set function</p>
  * @param function The function to set.
  */
 public void setFunction(String function) {
  this.function.add(function);
 }

 /**
  *<p>get ispublic</p>
  * @return Returns the ispublic.
  */
 public String getIspublic() {
  return ispublic;
 }

 /**
  * <p>set ispublic</p>
  * @param ispublic The ispublic to set.
  */
 public void setIspublic(String ispublic) {
  this.ispublic = ispublic;
 }

 /**
  *<p>get language</p>
  * @return Returns the language.
  */
 public String getLanguage() {
  return language;
 }

 /**
  * <p>set language</p>
  * @param language The language to set.
  */
 public void setLanguage(String language) {
  this.language = language;
 }

 /**
  *<p>get license</p>
  * @return Returns the license.
  */
 public String getLicense() {
  String licenses="";
  for(int i=0;i<function.size();i++){
   String tmp=(String)this.license.get(i);
   licenses+=tmp+"/n";
  }
  return licenses;
 }

 /**
  * <p>set license</p>
  * @param license The license to set.
  */
 public void setLicense(String license) {
  this.license.add(license);
 }

 /**
  *<p>get name</p>
  * @return Returns the name.
  */
 public String getName() {
  return name;
 }

 /**
  * <p>set name</p>
  * @param name The name to set.
  */
 public void setName(String name) {
  this.name = name;
 }

 /**
  *<p>get platform</p>
  * @return Returns the platform.
  */
 public String getPlatform() {
  return platform;
 }

 /**
  * <p>set platform</p>
  * @param platform The platform to set.
  */
 public void setPlatform(String platform) {
  this.platform = platform;
 }

 /**
  *<p>get version</p>
  * @return Returns the version.
  */
 public String getVersion() {
  return version;
 }

 /**
  * <p>set version</p>
  * @param version The version to set.
  */
 public void setVersion(String version) {
  this.version = version;
 }

 public String toString(){
  String result=null;
  List deve=this.getDevelopers();
  String deves="";
  for(int i=0;i<deve.size();i++){
   String tmp=(String)deve.get(i);
   deves+=tmp;
  }

  result="alias:"+this.getAlias()+
    ";name:"+this.getName()+
    ";author:"+this.getAuthor()+
    ";version:"+this.getVersion()+
    ";language:"+this.getLanguage()+
    ";platform:"+this.getPlatform()+
    ";status:"+this.getStatus()+
    ";ispublic:"+this.getIspublic()+
    ";developers:"+deves+
    ";function:"+this.getFunction()+
    ";license:"+this.getLicense();
  return result;
 }


 

3 输出结果

/*
 *  Colimas is copyright (C) 2005-2006 Zhao Lei.
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 */
package org.colimas.doc.parser;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
/*
 *  Colimas is copyright (C) 2005-2006 Zhao Lei.
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 */
import java.util.List;

import antlr.RecognitionException;
import antlr.TokenStreamException;

/**
 * <h3>ColimasComponentWordLexerTest.java</h3>
 *
 * <P>
 * Function:<BR />
 * test
 * </P>
 * @author zhao lei
 * @version 1.0
 * <br>
 * Modification History:
 * <PRE>
 * SEQ DATE       ORDER DEVELOPER      DESCRIPTION
 * --- ---------- ----- -------------- -----------------------------
 * 001 2006/04/15          tyrone        INIT
 * </PRE>
 */
public class ColimasComponentWordLexerTest {

 /**
  * <p> </p>
  * @param args
  */
 public static void main(String[] args) {
  try {
   FileInputStream in=new FileInputStream(new File(args[0])); // 传入文档文件(该文件为解析Word文档后的平文文件)
   ColimasComponentWordLexer lexer=
    new ColimasComponentWordLexer(in); //调用词法分析器;
   ColimasComponentWordParser parser = new ColimasComponentWordParser(lexer); //调用语法分析器
   try {
    parser.document(); //全文档分析
    List records=lexer.getRecords(); //输出结果
    for(int i=0;i<records.size();i++){
     Component component=(Component)records.get(i);
     System.out.println(component.toString());
    }
   } catch (RecognitionException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
   } catch (TokenStreamException e1) {
    // TODO Auto-generated catch block
    e1.printStackTrace();
   }
  } catch (FileNotFoundException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }

 }
}

结果如下:

alias:POI;name:Apache Jakarta POI Java API To Access Microsoft Format Files;author:http://jakarta.apache.org/poi/index.html;version:0.0.1;language:Java;platform:Windows, Linux, Unix;status:Confirmed;ispublic:Y;developers:Apache developer1
Apache developer2
;function:http://jakarta.apache.org/poi/index.html
The POI project consists of APIs for manipulating various file formats based upon Microsoft's OLE 2 Compound Document format using pure Java. In short, you can read and write MS Excel files using Java. Soon, you'll be able to read and write Word files using Java. POI is your Java Excel solution as well as your Java Word solution. However, we have a complete API for porting other OLE 2 Compound Document formats and welcome others to participate.
OLE 2 Compound Document Format based files include most Microsoft Office files such as XLS and DOC as well as MFC serialization API based file formats.
As a general policy we try to collaborate as much as possible with other projects to provide this functionality. Examples include: Cocoon for which there are serializers for HSSF; Open Office.org with whom we collaborate in documenting the XLS format; and Lucene for which we'll soon have file format interpretors. When practical, we donate components directly to those projects for POI-enabling them.

;license:Apache License 2.0

alias:ANTLR;name:ANother Tool for Language Recognition;author:http://www.antlr.org/;version:2.7.6;language:Java;platform:Windows, Linux;status:Confirmed;ispublic:Y;developers:Terence Parr
;function:What is ANTLR?
ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C#, C++, or Python actions. ANTLR provides excellent support for tree construction, tree walking, and translation. There are currently about 5,000 ANTLR source downloads a month.

;license: [The BSD License]Copyright (c) 2005, Terence Parr All rights reserved. http://www.antlr.org/license.html

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值