解析ods文件心得

霂宇

已于 2024-10-05 16:02:19 修改

阅读量711

点赞数 17

文章标签： java

于 2024-10-05 15:58:33 首次发布

本文链接：https://blog.csdn.net/szy350/article/details/142715262

版权

一、什么是ODS文件

ods文件是基于XML格式，开放文档格式ODF文档中的一种。可以用Excel打开ods文件。如果解压ods文件，可以发现其实际由几个XML组成，包括setting.xml，content.xml，meta.xml，styles.xml。其中的content.xml中记录了文件的数据。
在这里插入图片描述

用编辑器打开content.xml观察，大致可以分析出文件中的标签含义如下：
在这里插入图片描述

table:table标签：表格
table:table-colum标签：表格列
table:table-row标签：表格行
table:table-cell标签：表格单元格
text:p标签：单元格内容

二、ODS文件解析

2.1 用odfTool解析

项目地址:https://github.com/szy350/odftest.git
配置依赖pom:

    <dependency>
        <groupId>org.odftoolkit</groupId>
        <artifactId>odfdom-java</artifactId>
        <version>0.8.7</version>
    </dependency>

代码编写:

private static void convertOdsToExcelByOdfTool() throws Exception {
        OdfSpreadsheetDocument odfSheet = OdfSpreadsheetDocument.loadDocument("src/main/resources/test.ods");
        List<OdfTable> odfTableList = odfSheet.getTableList();
        Workbook workbook = new XSSFWorkbook();
        for (int i = 0; i < odfTableList.size(); i++) {
            OdfTable table = odfTableList.get(i);
            Sheet sheet = workbook.createSheet(table.getTableName());
            for (int j = 0; j < 2000; j++) {
                OdfTableRow row = table.getRowByIndex(j);
                Row workBookRow = sheet.createRow(j);
                for (int k = 0; k < 20; k ++) {
                    Cell workBookCell = workBookRow.createCell(k);
                    OdfTableCell cell = row.getCellByIndex(k);
                    workBookCell.setCellValue(cell.getStringValue());
                }
            }
        }

上述代码实现的是将ods文件转化为excel文件的逻辑。不过在使用过程中，觉得上述的解析方法存在两个问题：
（1）解析速度较慢，且内存占用较高，loadDocument方法会将ods文件按照表格的形式读取到内存中，好处是后续的对ods文件的操作会很简单，但是也增加了内存的消耗。之前尝试读取11M的ods文件，大概16W行数据，内存用量为2-3G，将读到内存中，并转化为excel的耗时大约需2小时以上。

（2）用OdfTable::getRowCount获取的行数，以及用OdfTableRow::getCellCount获取的单元格格数量，有时会超出预期。原因在于对content.xml属性的解析：
在这里插入图片描述

如上图中的，table:number-rows-repeated属性的值为1048474，那么通过getRowCount获取的行数，就会有10W+，其中的table:num-colums-repeated属性的值为1024，通过getCellCount获取的列数，也就会有1000+，但实际表格的大小远远没有这个规模。

2.2 使用XMLReader解析XML，达到解析ods文件的效果

既然已经知道ods文件的内容实际都存储在content.xml的文件中，考虑如下的几步来解析ods文件:
（1）解压ods文件获取其中的content.xml
（2）使用xmlReader读取content.xml，这里使用xmlReader的好处就是可以避免读取大量的文件内容到内存中。
（3）解析content.xml的每个标签处理。

使用这种方法解析，可以解决odfTool读取内存占用大，处理慢的问题，自己测试16万的数据转excel，内存消耗在200M，时间在10s左右处理完成。亦可以自定义方法，规避解析出来很多空单元格的问题。

大致的代码如下,下面的代码实现的也是将ods文件转化为excel文件的功能，详细的代码可以看这个：https://github.com/szy350/odftest.git
主函数

Workbook workbook = new SXSSFWorkbook();
        ZipFile zipFile = new ZipFile("src/main/resources/test.ods");
        Enumeration<? extends ZipEntry> zipFileEntries = zipFile.entries();
        OdsParseHandler handler = new OdsParseHandler(new OdsParseContext(workbook));
        InputStream inputStream = null;
        InputSource inputSource = null;
        while (zipFileEntries.hasMoreElements()) {
            ZipEntry entry = zipFileEntries.nextElement();
            // 这里解析ods文件中的content.xml文件
            if (StringUtils.equals(entry.getName(), OdsParseConstant.CONTENT_XML)) {
                inputStream = zipFile.getInputStream(entry);
                inputSource = new InputSource(inputStream);
                break;
            }
        }
        SAXParserFactory saxFactory = SAXParserFactory.newInstance();
        SAXParser saxParser = saxFactory.newSAXParser();
        XMLReader xmlReader = saxParser.getXMLReader();
        xmlReader.setContentHandler(handler);
        xmlReader.parse(inputSource);

        FileOutputStream fos = new FileOutputStream("src/main/resources/test.xlsx");
        workbook.write(fos);
        inputStream.close();
        fos.flush();

xml处理器

public class OdsParseHandler extends DefaultHandler {

    private OdsParseContext context;

    public OdsParseHandler() {
    }

    public OdsParseHandler(OdsParseContext context) {
        this.context = context;
    }

    public Map<String, AbstractTabHandler> tabHandlerMap = new HashMap<String, AbstractTabHandler>() {
        {
            put(OdsParseConstant.TAB_TABLE, new TableTabHandler());
            put(OdsParseConstant.TAB_ROW, new RowTabHandler());
            put(OdsParseConstant.TAB_CELL, new CellTabHandler());
            put(OdsParseConstant.TAB_P, new ValueTabHandler());
        }
    };

    public Map<String, AbstractTabHandler> tabEndHandlerMap = new HashMap<String, AbstractTabHandler>() {
        {
            put(OdsParseConstant.TAB_P, new ValueTabEndHandler());
        }
    };

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        CurrentCell currentCell = context.getCell();
        if (currentCell == null) {
            return;
        }
        Cell cell = currentCell.getCurrentCell();
        String valueType = currentCell.getValueType();
        String exactValue = currentCell.getExactValue();
        if (StringUtils.isNotBlank(valueType) && StringUtils.isNotBlank(exactValue)) {
            // 单元格设置了指定的值，则用指定的值
            cell.setCellValue(exactValue);
            return;
        }
        if (context.getValueBuilder() != null) {
            context.getValueBuilder().append(ch, start, length);
            cell.setCellValue(context.getValueBuilder().toString());
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        AbstractTabHandler handler = tabEndHandlerMap.get(qName);
        if (handler == null) {
            return;
        }
        handler.doHandle(uri, localName, qName, null, context);
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        AbstractTabHandler handler = tabHandlerMap.get(qName);
        if (handler == null) {
            return;
        }
        handler.doHandle(uri, localName, qName, attributes, context);
    }
}

xml标签处理器

public class TableTabHandler extends AbstractTabHandler {

    @Override
    public void doHandle(String uri, String localName, String qName, Attributes attributes,
        OdsParseContext odsParseContext) {
        // 表格标签页处理
        String sheetName = Optional.ofNullable(attributes.getValue(OdsParseConstant.SHEET_NAME))
            .orElse(OdsParseConstant.DEFAULT_SHEET_NAME);
        Workbook workbook = odsParseContext.getWorkbook();
        Sheet currentSheet = workbook.createSheet(sheetName);
        odsParseContext.initSheet(currentSheet);
    }
}

public class RowTabHandler extends AbstractTabHandler {

    @Override
    public void doHandle(String uri, String localName, String qName, Attributes attributes,
        OdsParseContext odsParseContext) {
        // 行处理器
        Sheet sheet = odsParseContext.getCurrentSheet();
        Row row = sheet.createRow(odsParseContext.getCurrentSheetRowIndex());
        if (odsParseContext.getRow() == null) {
            odsParseContext.initRow(row);;
        } else {
            odsParseContext.changeRowIndex(row, 1);
        }
    }
}

public class CellTabHandler extends AbstractTabHandler {
    @Override
    public void doHandle(String uri, String localName, String qName, Attributes attributes,
        OdsParseContext odsParseContext) {
        // 单元格处理器
        Row row = odsParseContext.getCurrentRow();
        Cell cell = row.createCell(odsParseContext.getCurrentRowCellIndex());
        String cellType = attributes.getValue(OdsParseConstant.VALUE_TYPE);
        String exactValue = attributes.getValue(OdsParseConstant.OFFICE_VALUE);
        odsParseContext.initCell(cell, cellType, exactValue);
    }
}