需求:从word中提取手机号码
首先下载Apache POI 下载地址
我下载的是二进制的文件:
解压后:
注意:为了方便我把这六个jar包都导入了eclipse里,还要把ooxml-lib里的jar包也导入,要不然会报错:java.lang.ClassNotFoundException: org.apache.xmlbeans.XmlException
所以最后导入的包为:
至此还要注意不要有旧版本的poi的jar包存在,要不然会报一些错误,我就是在刚一开始时候先导入了一个tm-extractors-0.4.jar(内部封装的是poi),然后没有删掉又导入了新的poi的包,在运行时一直报错:java.lang.VerifyError: (class: 。。。。。
后来goole后才找到原因是因为有旧的版本POI:常见问题集合
在准备好了之后就可以对word操作了,由于word一般有.doc和.docx两种格式,所以要分开判断并执行代码:
import java.io.FileInputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.poi.POIXMLDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.poi.POIXMLDocument;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
public class ReadWordByPoi4 {
public static String getPhoneNum(String filePath,String fileName){
String text="";
String phoneNum="";
String realPath=filePath+"/"+fileName;//拼接为含名字的路径
try {
if(fileName.endsWith(".doc")){ //doc为后缀的
FileInputStream in;
in = new FileInputStream(realPath);
WordExtractor extractor = new WordExtractor(in);
text = extractor.getText();
}
if(fileName.endsWith(".docx")){ //docx为后缀的
XWPFWordExtractor docx = new XWPFWordExtractor(POIXMLDocument.openPackage(realPath));
text = docx.getText();
}
} catch (Exception e) {
e.printStackTrace();
}
//正则表达式判断手机号
if(!"".equals(text)){
Pattern pattern = Pattern.compile("(?<!\\d)(?:(?:1[34578]\\d{9})|(?:861[34578]\\d{9}))(?!\\d)");
Matcher matcher = pattern.matcher(text);
StringBuffer bf = new StringBuffer(64);
while (matcher.find()) {
bf.append(matcher.group()).append(",");
}
int len = bf.length();
if (len > 0) {
bf.deleteCharAt(len - 1);
}
phoneNum=bf.toString();
}
return phoneNum;
}
public static void main(String[] args) throws Exception{
System.out.println(getPhoneNum("D:/shiyanshuju","xx.doc"));
}
}