Java POI解析Word提取数据存储在Excel

最新推荐文章于 2024-07-26 16:03:13 发布

thelostworld-公众号

最新推荐文章于 2024-07-26 16:03:13 发布

阅读量1k

点赞数

分类专栏：网络安全 tools 文章标签： java poi

本文链接：https://blog.csdn.net/qq_37602797/article/details/116654375

版权

网络安全同时被 2 个专栏收录

81 篇文章 24 订阅

订阅专栏

tools

9 篇文章 0 订阅

订阅专栏

JavaPOI解析word提取数据到excel

一、了解POI

POI以前有了解，这次需求是解析word读取其中标题，还有内容赛选获取自己想要的内容

经过两天的学习，开始熟悉Java这么读取word和解析。

本文中运用是读取整个页面模块的range，通过对range里面的数据进行筛选，获取自己想要的数据。

https://github.com/zxiang179/POI

http://deepoove.com/poi-tl/#_why_poi_tl

主要是了解POI的数据调用的解析。

https://poi.apache.org/官方的文档API

想要实现的效果

以下测试的IP或是域名都是随便找不同类型：IP+port /url/http://www.XXX.com/www.xxx.net:8080等等存在的url组合。

目前需求是抓取这个几个关键的内容。

二、POI解析word

maven添加jar依赖

全部的POI的依赖

<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.17</version>
        </dependency>

        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.17</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.17</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>ooxml-schemas</artifactId>
            <version>1.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>3.17</version>
        </dependency>

关键代码

public class GetWord {
public static void main(String[] args) {
        try {
            InputStream is = new FileInputStream(new File("path"));  //需要将文件路更改为word文档所在路径。
            POIFSFileSystem fs = new POIFSFileSystem(is);
            HWPFDocument document = new HWPFDocument(fs);
            Range range = document.getRange();

            CharacterRun run1 = null;//用来存储第一行内容的属性
            CharacterRun run2 = null;//用来存储第二行内容的属性
            int q=1;
            for (int i = 0; i < range.numParagraphs()-2; i++) {
                Paragraph para1 = range.getParagraph(i);// 获取第i段
                Paragraph para2 = range.getParagraph(i+1);// 获取第i段
                int t=i;              //记录当前分析的段落数

                String paratext1 = para1.text();   //当前段落和下一段
                String paratext2 = para2.text();
                run1=para1.getCharacterRun(0);
                run2=para2.getCharacterRun(0);
                if (paratext1.length() > 0&&paratext2.length() > 0) {
                    //这个if语句为的是去除大标题，连续三个段落字体大小递减就跳过
                    if(run1.getFontSize()>run2.getFontSize()&&run2.getFontSize()>range.getParagraph(i+2).getCharacterRun(0).getFontSize()) {
                        break;
                    }
                    //连续两段字体格式不同
                    if(run1.getFontSize()>run2.getFontSize()) {
                        content=paratext2;
                        run1=run2;  //从新定位run1  run2
                        run2=range.getParagraph(t+2).getCharacterRun(0);
                        t=t+1;
                        while(run1.getFontSize()==run2.getFontSize()) {
                            //连续的相同
                            Content+=range.getParagraph(t+1).text();
                            if(content.matches("是否系统自身渗透结果")) break;                            
                            if(content.contains(":")||content.contains("/")){//如果是http或者是IP+端口形式
                                word.setName(getDomain(content));//ip

                            }else{//纯IP形式
                                word.setName(content);//ip
                            }
                            word.setAsset_name(paratext1);//标题
                            run1=run2;
                            run2=range.getParagraph(t+i).getCharacterRun(0);
                            t++;
        ……..
                              }
                  ………
                    }
                        if(paratext1.matches("中危")||paratext1.matches("很高")||paratext1.matches("低危")){
                           list2.add(paratext1);
                        }
                }
            }
                    if(word1.getAsset_name()==wordname.getAsset_name())){
…..
                        wordend.setId(num);
                        wordend.setName(word1.getName());
                        wordend.setAsset_name(word1.getAsset_name());

                        if(wordname.getAvailability_assignment().equals("很高")){
                            wordend.setAvailability_assignment("5");
                        }
                        if(wordname.getAvailability_assignment().equals("高危")){
                            wordend.setAvailability_assignment("4");
                        }
                        if(wordname.getAvailability_assignment().equals("中危")){
                            wordend.setAvailability_assignment("3");
                        }
                        if(wordname.getAvailability_assignment().equals("低危")){
                            wordend.setAvailability_assignment("2");
                        }
                        wordend.setRisk("网络攻击");
                        wordmax.add(wordend);
                        num++;
      ……
                    }
                }
            }
            System.out.println("------------------Finished-详情查看-中间文档转excel.xlsx------------------");
            //Excel无模版导出
            ExcelUtil.getInstar ().exportl(wordmax, Word.class, "中间文档转excel.xlsx");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
   } }

其中有URl获取IP或者Domain

public class Test{
    public static void main(String[] args){
        String url = "www.abc.def.com.cn/123456 ";
        String url1 = "192.168.1.100/admin";
        System.out.println(getDomain(url));
        System.out.println(getDomain(url1));
    }
    private static String getDomain(String url){
        String regex = "^(?:[a-zA-Z]+[.])?(\\w+([.]\\w+)*)/.*$";
        Matcher matcher = Pattern.compile(regex).matcher(url);
        String result = null;
        if(matcher.find()){
            result = matcher.group(1);
        }
        return result;
    }
}

运行结果：