Word文档转html并提取标题

最新推荐文章于 2021-06-20 23:34:04 发布

weixin_30416871

最新推荐文章于 2021-06-20 23:34:04 发布

阅读量371

点赞数

原文链接：http://www.cnblogs.com/channingwong/p/9698924.html

版权

最近做了一个功能，需要将word文档转化成html的格式，并提取出标题生成导航。考虑到功能的复杂程度，将需要降低为对“标题1”这种格式进行提取。

docx为后缀的文档（word2007）支持XML的文件格式，本质上是一个zip压缩包，解压出来就可以看到所有信息，可能正因为如果，使用XHTMLConverter便可以得到对应的html文档，且标题元素的class属性被标注为"X"+n（n为标题层级）。

但doc文档但相对麻烦，doc文档一般使用poi读取，用的比较多的html转换方式是使用poi中的WordToHtmlConverter进行转换，这个转换器并不会对标题进行特殊处理，将其当做普通有样式的一个段落(Paragraph)进行处理，因此会和其他普通段落混合在一起。对此有两种处理方法：

方案一：重写processParagraph方法，在注释的判断处加上对标题的判断，对标题进行特殊处理，但由于WordToHtmlConverter的成员变量均声明为private，因此我采用了另一种方案。

protected void processParagraph(HWPFDocumentCore hwpfDocument, Element parentElement, int currentTableLevel, Paragraph paragraph, String bulletText) {
    Element pElement = this.htmlDocumentFacade.createParagraph();
    parentElement.appendChild(pElement);
    StringBuilder style = new StringBuilder();
    WordToHtmlUtils.addParagraphProperties(paragraph, style);
    int charRuns = paragraph.numCharacterRuns();
    if(charRuns != 0) {
        CharacterRun characterRun = paragraph.getCharacterRun(0);
        String pFontName;
        int pFontSize;
        if(characterRun != null) {
            Triplet triplet = this.getCharacterRunTriplet(characterRun);
            pFontSize = characterRun.getFontSize() / 2;
            pFontName = triplet.fontName;
            WordToHtmlUtils.addFontFamily(pFontName, style);
            WordToHtmlUtils.addFontSize(pFontSize, style);
        } else {
            pFontSize = -1;
            pFontName = "";
        }

        this.blocksProperies.push(new WordToHtmlConverter.BlockProperies(pFontName, pFontSize));

        try {
            if(WordToHtmlUtils.isNotEmpty(bulletText)) {
                if(bulletText.endsWith("\t")) {
                    float defaultTab = 720.0F;
                    float firstLinePosition = (float)(paragraph.getIndentFromLeft() + paragraph.getFirstLineIndent() + 20);
                    float nextStop = (float)(Math.ceil((double)(firstLinePosition / 720.0F)) * 720.0D);
                    float spanMinWidth = nextStop - firstLinePosition;
                    Element span = this.htmlDocumentFacade.getDocument().createElement("span");
                    this.htmlDocumentFacade.addStyleClass(span, "s", "display: inline-block; text-indent: 0; min-width: " + spanMinWidth / 1440.0F + "in;");
                    pElement.appendChild(span);
                    Text textNode = this.htmlDocumentFacade.createText(bulletText.substring(0, bulletText.length() - 1) + '\u200b' + ' ');
                    span.appendChild(textNode);
                } else {
                    Text textNode = this.htmlDocumentFacade.createText(bulletText.substring(0, bulletText.length() - 1));
                    pElement.appendChild(textNode);
                }
            }

            this.processCharacters(hwpfDocument, currentTableLevel, paragraph, pElement);
        } finally {
            this.blocksProperies.pop();
        }

　　　　　// 此处需要修改
        if(style.length() > 0) {
            this.htmlDocumentFacade.addStyleClass(pElement, "p", style.toString());
        }

        WordToHtmlUtils.compactSpans(pElement);
    }
}

　　方案二：在word文档中进行埋点，然后在处理过后的html文档中根据itTitleMap进行再处理

private Map<String,String> setTitleElements(HWPFDocument wordObject ){
    // 获取样式表
    StyleSheet styleSheet = wordObject.getStyleSheet();
    int styleTotal = wordObject.getStyleSheet().numStyles();
    // 使用map映射存储标题信息
    Map<String,String> idTitleMap = Maps.newHashMap();
    Range range = wordObject.getRange();
    for (int i = 0; i < range.numParagraphs(); i++) {
        // 获取样式信息
        Paragraph paragraph = range.getParagraph(i);
        int styleIndex = paragraph.getStyleIndex();
        if (styleTotal > styleIndex) {
            StyleDescription styleDescription = styleSheet.getStyleDescription(styleIndex);
            String descriptionName = styleDescription.getName();
            if ( descriptionName != null  &&  descriptionName.contains(FIRST_LEVEL_TITLE_DESCRIPTION)) {
                String uuid = UUIDHelper.getUuid();
                String text = paragraph.text().replaceAll( "[\r\n]", "" );
                paragraph.replaceText( uuid, false );
                idTitleMap.put( uuid, text );
            }
        }
    }

    return idTitleMap;
}

转载于:https://www.cnblogs.com/channingwong/p/9698924.html

weixin_30416871

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Word文档转html并提取标题

最近做了一个功能，需要将word文档转化成html的格式，并提取出标题生成导航。考虑到功能的复杂程度，将需要降低为对“标题1”这种格式进行提取。docx为后缀的文档（word2007）支持XML的文件格式，本质上是一个zip压缩包，解压出来就可以看到所有信息，可能正因为如果，使用XHTMLConverter便可以得到对应的html文档，且标题元素的class属性被标注为"X"+n（n为标题层级...
复制链接

扫一扫