用poi将excel解析成对象_JavaPOI解析word提取数据到excel

最新推荐文章于 2022-08-29 16:10:30 发布

weixin_39715348

最新推荐文章于 2022-08-29 16:10:30 发布

阅读量198

点赞数

文章标签：用poi将excel解析成对象

Java POI解析Word提取数据存储在Excel

一、了解POI

POI以前有了解，这次需求是解析word读取其中标题，还有内容赛选获取自己想要的内容

经过两天的学习，开始熟悉Java这么读取word和解析。

本文中运用是读取整个页面模块的range，通过对range里面的数据进行筛选，获取自己想要的数据。

https://github.com/zxiang179/POI

http://deepoove.com/poi-tl/#_why_poi_tl

主要是了解POI的数据调用的解析。

https://poi.apache.org/官方的文档API

想要实现的效果

以下测试的IP或是域名都是随便找不同类型：IP+port /url/http://www.XXX.com/www.xxx.net:8080等等存在的url组合。

目前需求是抓取这个几个关键的内容。

二、POI解析word

maven添加jar依赖

全部的POI的依赖

<dependency>

<groupId>org.apache.poi</groupId>

<artifactId>poi</artifactId>

<version>3.17</version>

</dependency>

<dependency>

<groupId>org.apache.poi</groupId>

<artifactId>poi-ooxml</artifactId>

<version>3.17</version>

</dependency>

<dependency>

<groupId>org.apache.poi</groupId>

<artifactId>poi-scratchpad</artifactId>

<version>3.17</version>

</dependency>

<dependency>

<groupId>org.apache.poi</groupId>

<artifactId>ooxml-schemas</artifactId>

<version>1.1</version>

</dependency>

<dependency>

<groupId>org.apache.poi</groupId>

<artifactId>poi-ooxml-schemas</artifactId>

<version>3.17</version>

</dependency>

关键代码

public class GetWord1 {

public static void main(String[] args) {

try {

InputStream is = new FileInputStream(new File("path")); //需要将文件路更改为word文档所在路径。

POIFSFileSystem fs = new POIFSFileSystem(is);

HWPFDocument document = new HWPFDocument(fs);

Range range = document.getRange();

CharacterRun run1 = null;//用来存储第一行内容的属性

CharacterRun run2 = null;//用来存储第二行内容的属性

int q=1;

for (int i = 0; i < range.numParagraphs()-2; i++) {

Paragraph para1 = range.getParagraph(i);// 获取第i段

Paragraph para2 = range.getParagraph(i+1);// 获取第i段

int t=i; //记录当前分析的段落数

String paratext1 = para1.text(); //当前段落和下一段

String paratext2 = para2.text();

run1=para1.getCharacterRun(0);

run2=para2.getCharacterRun(0);

if (paratext1.length() > 0&&paratext2.length() > 0) {

//这个if语句为的是去除大标题，连续三个段落字体大小递减就跳过

if(run1.getFontSize()>run2.getFontSize()&&run2.getFontSize()>range.getParagraph(i+2).getCharacterRun(0).getFontSize()) {

break;

}

//连续两段字体格式不同

if(run1.getFontSize()>run2.getFontSize()) {

content=paratext2;

run1=run2; //从新定位run1 run2

run2=range.getParagraph(t+2).getCharacterRun(0);

t=t+1;

while(run1.getFontSize()==run2.getFontSize()) {

//连续的相同

Content+=range.getParagraph(t+1).text();

if(content.matches("是否系统自身渗透结果")) break;

if(content.contains(":")||content.contains("/")){//如果是http或者是IP+端口形式

word.setName(getDomain(content));//ip

}else{//纯IP形式

word.setName(content);//ip

}

word.setAsset_name(paratext1);//标题

run1=run2;

run2=range.getParagraph(t+i).getCharacterRun(0);

t++;

……..

}

………

}

if(paratext1.matches("中危")||paratext1.matches("很高")||paratext1.matches("低危")){

list2.add(paratext1);

}

}

}

if(word1.getAsset_name()==wordname.getAsset_name())){

…..

wordend.setId(num);

wordend.setName(word1.getName());

wordend.setAsset_name(word1.getAsset_name());

if(wordname.getAvailability_assignment().equals("很高")){

wordend.setAvailability_assignment("5");

}

if(wordname.getAvailability_assignment().equals("高危")){

wordend.setAvailability_assignment("4");

}

if(wordname.getAvailability_assignment().equals("中危")){

wordend.setAvailability_assignment("3");

}

if(wordname.getAvailability_assignment().equals("低危")){

wordend.setAvailability_assignment("2");

}

wordend.setRisk("网络攻击");

wordmax.add(wordend);

num++;

……

}

}

}

System.out.println("------------------Finished-详情查看-中间文档转excel.xlsx------------------");

//Excel无模版导出

ExcelUtil.getInstar ().exportl(wordmax, Word.class, "中间文档转excel.xlsx");

} catch (Exception e) {

e.printStackTrace();

}

}

} }

其中有URl获取IP或者Domain

public class Test{

public static void main(String[] args){

String url = "http://www.abc.def.com.cn/123456 ";

String url1 = "192.168.1.100/admin";

System.out.println(getDomain(url));

System.out.println(getDomain(url1));

}

private static String getDomain(String url){

String regex = "^(?:[a-zA-Z]+[.])?(w+([.]w+)*)/.*$";

Matcher matcher = Pattern.compile(regex).matcher(url);

String result = null;

if(matcher.find()){

result = matcher.group(1);

}

return result;

}

}

运行结果：

完整的测试数据

输出到excel

GUI界面化数据操作完成和输出excel

三、总结

1、接触POI三天左右，主要是通过对文档的整个的range遍历获取数据，再通过自己的不同数据的需求进行筛选和数据处理。

2、相比HTml获取数据java获取word解析里面的数据更加的难一些，html里面有标签，可以通过操作js、css、html的标签来实现数据的获取，但是word没有特定的标签去获取，这个只能通过遍历，对比文本的大小、文本的字体等等相关信息。

3、这次word解析，结合上次的html的解析获取数据，对应java的操作不同的类型的文件流有了新的认识。

Java爬虫&html解析-Jsoup(绿盟极光报告)(https://mp.weixin.qq.com/s?__biz=MzIyNjk0ODYxMA==&mid=2247483708&idx=1&sn=4734ff8b79069eef3d47b9f7eedca269&chksm=e869e251df1e6b471ac158e121b845ed0fe0d0a88fe7b16d8124db5180eb3b6d4091d1444f72&token=1606015301&lang=zh_CN#rd)

四、参考文献

https://github.com/zxiang179/POI

http://deepoove.com/poi-tl/#_why_poi_tl

https://poi.apache.org

公众号：

thelostworld：

个人知乎：https://www.zhihu.com/people/fu-wei-43-69/columns

个人简书：https://www.jianshu.com/u/bf0e38a8d400

weixin_39715348

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用poi将excel解析成对象_JavaPOI解析word提取数据到excel

Java POI解析Word提取数据存储在Excel 一、了解POIPOI以前有了解，这次需求是解析word读取其中标题，还有内容赛选获取自己想要的内容经过两天的学习，开始熟悉Java这么读取word和解析。本文中运用是读取整个页面模块的range，通过对range里面的数据进行筛选，获取自己想要的数据。https://github.com/zxiang179/POIhttp://deepoove...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。