Apache POI 用eventmodel 读取大文件Excel 【项目已开源】

代码介绍

此代码来源于我开发的一个小工具jar包,项目名称为ExcelToDB,顾名思义,从各种文件系统读取Excel提取数据插入到各种数据库。

特点:轻量级jar包(小jar包灵活性强,而不是kettle那么笨重),配置好后一键运行,可搭载到Job调度中一键调用,甚至可以把它集成到你们公司的管理系统后端🤭。

内存消耗小,速度快,灵活性强,编写代码自动化配置既可实现一键导入,否则需要简单的手工配置。

读取几十万行的.xlsx文件,并将数据转换为标准SQL并执行的一个小jar包,打包后17MB左右,优化后可以更小,可配置日志记录内容,可写到多种目标数据库,从多种文件系统读取Excel,只需要简单的XML配置即可。

转换并执行的速度:
实际测试中,8172行,8MB,每行60个字段的Excel里的真实业务数据,8700ms即可导入到数据库,初版未优化,预计优化后速度可以更快。

https://github.com/airdest/ExcelToPostgreDB

认准v1.0.4版本哦!项目已开源,拿走不谢哈哈哈哈哈哈
在这里插入图片描述

下面就是研发过程需要的知识了,进行了详细整理(●ˇ∀ˇ●),不想浪费时间再造一遍轮子的,可以去我项目里把写好的代码拿来改吧改吧,就是你的了嘿嘿嘿嘿。

1,初识XLSX格式

前言

最近被要求用Java读取大文件Xlsx(超过10MB的Excel文档),并导入数据库,百度搜到了POI。

一开始对POI了解不够深入,于是轻松地用POI的usermodel开发出了第一版,然鹅,usermodel对于内存的要求太高,只要文件超过5MB就开始GC overhead limit exceeded了。

(而且读取6MB的xlsx文件居然能占用3个G的内存??)

于是我决定使用eventmodel模式,然鹅,由于excel里带有公式的缘故,需要开发的代码量非常大,而且在百度很难找到使用event model的POI的案例,百度,谷歌找了一整天都没发现合适的案例,最终决定自己从底层开始,一点一点啃掉这个知识点。顺便整理个文档,输入输出,福曼学习法

也不是没考虑过市面上其它的工具,比如:

  • easyexcel
  • hutool操作poi
  • excel-streaming-reader
  • gridexcel
  • jxl

然鹅这些工具都不能同时满足以下几个特性:

  • 低内存消耗(至少不能导致GC overhead limit exceeded,或者OOM
  • 自动处理复杂公式(最难受的是把2020/12/20,ABC/BCS-DS/BUC这样的日期字符串读取为公式,总之现有的类库对公式的支持都不够多)
  • 读取超大文件(增量数据20MB以上,全量数据将近500MB)

也许easyexcel可以做到,然鹅黑盒子不敢用呀,而且easyexcel抽象级别过高,不够灵活,没有文档,学习成本过高的同时无法满足项目需求。

Excel 之 Xlsx格式详解

以下内容部分来自于以下链接,并对其内容做了部分修改。
https://www.loc.gov/preservation/digital/formats/fdd/fdd000398.shtml

了解一下 XLSX, (Office Open XML, Spreadsheet ML)

Office Open XML简称 OOXML,是xlsx格式的标准,它在ECMA-376中制定,
下面这个链接是ECMA-376的文档下载地址。
http://www.ecma-international.org/publications/standards/Ecma-376.htm
还有关于OOXML,wiki上有详细的介绍
https://en.wikipedia.org/wiki/Office_Open_XML

The Open Office XML-based spreadsheet format using .xlsx as a file extension has been the default format produced for new documents by versions of Microsoft Excel since Excel 2007.

The format was designed to be equivalent to the binary .xls format produced by earlier versions of Microsoft Excel (see MS-XLS).

For convenience, this format description uses XLSX to identify the corresponding format.

The primary content of a XLSX file is marked up in SpreadsheetML, which is specified in parts 1 and 4 of ISO/IEC 29500, Information technology – Document description and processing languages – Office Open XML File Formats (OOXML).

This description focuses on the specification in ISO/IEC 29500:2012 and represents the format variant known as “Transitional.”

Although editions of ISO 29500 were published in 2008, 2011, 2012, and 2016, the specification has had very few changes other than clarifications and corrections to match actual usage in documents since SpreadsheetML was first standardized in ECMA-376, Part 1 in 2006.

This description can be read as applying to all SpreadsheetML versions published by ECMA International and by ISO/IEC through 2016.

See Notes below for more detail on the chronological versions and differences.

The XLSX format uses the SpreadsheetML markup language and schema to represent a spreadsheet “document.”

Conceptually, using the terminology of the Spreadsheet ML specification in ISO/IEC 29500-1, the document comprises one or more worksheets in a workbook.

A worksheet typically consists of a rectangular grid of cells.

Each cell can contain a value or a formula, which will be used to calculate a value, with a cached value usually stored pending the next recalculation.

A single spreadsheet document may serve several purposes: as a container for data values;

as program code (based on the formulas in cells) to perform analyses on those values; and as one or more formatted reports (including charts) of the analyses.

Beyond basics, spreadsheet applications have introduced support for more advanced features over time.

These include mechanisms to extract data dynamically from external sources, to support collaborative work, and to perform an increasing number of functions that would have required a database application in the past, such as sorting and filtering of entries in a table to display a temporary subset.

The markup specification must support both basic and more advanced functionalities in a structure that supports the robust performance expected by users.

看看它的结构:

An XLSX file is packaged using the Open Packaging Conventions (OPC/OOXML_2012, itself based on ZIP_6_2_0).

The package can be explored, by opening with ZIP software, typically by changing the file extension to .zip.

The top level of a minimal package will typically have three folders (_rels, docProps, and xl) and one file part ([Content_Types].xml).

在这里插入图片描述

The xl folder holds the primary content of the document including the file part workbook.xml and a worksheets folder containing a file for each worksheet, as well as other files and folders that support functionality (such as controlling calculation order) and presentation (such as formatting styles for cells) for the spreadsheet.

在这里插入图片描述

Any embedded graphics are also stored in the xl folder as additional parts.

The other folders and parts at the top level of the package support efficient navigation and manipulation of the package:

_rels

_rels is a Relationships folder, containing a single file .rels (which may be hidden from file listings, depending on operating system and settings).

It lists and links to the key parts in the package, using URIs to identify the type of relationship of each key part to the package.

In particular it specifies a relationship to the primary officeDocument (typically named /xl/workbook.xml ) and typically to parts within docProps as core and extended properties.

docProps

docProps is a folder that contains properties for the document as a whole, typically including a set of core properties, a set of extended or application-specific properties, and a thumbnail preview for the document.

[Content_Types].xml

[Content_Types].xml is a file part, a mandatory part in any OPC package, that lists the content types (using MIME Internet Media Types as defined in RFC 6838) for parts within the package.

The standards documents that specify this format run to over six thousand pages.

Useful introductions to the XLSX format can be found at:

Anatomy of a SpreadsheetML File by Daniel Dick of Reuters.

Structure of a SpreadsheetML document from Open XML SDK documentation. Includes diagram showing typical spreadsheet document parts.

下一篇:Apache POI 用eventmodel 读取大文件Excel (2) Xlsx格式内容详解

2,Xlsx格式详解

此节部分内容来自于:
http://officeopenxml.com/anatomyofOOXML-xlsx.php
http://officeopenxml.com/SScontentOverview.php
http://officeopenxml.com/SSstyles.php

Package Structure

A SpreadsheetML or .xlsx file is a zip file (a package) containing a number of “parts” (typically UTF-8 or UTF-16 encoded) or XML files.

The package may also contain other media files such as images.

The structure is organized according to the Open Packaging Conventions as outlined in Part 2 of the OOXML standard ECMA-376.

You can look at the file structure and the files that comprise a SpreadsheetML file by simply unzipping the .xlsx file.
在这里插入图片描述
The number and types of parts will vary based on what is in the spreadsheet, but there will always be a [Content_Types].xml, one or more relationship parts, a workbook part , and at least one worksheet.

The core data of the spreadsheet is contained within the worksheet part(s), discussed in more detail at xsxl Content Overview.

Content Types

Every package must have a [Content_Types].xml, found at the root of the package.

This file contains a list of all of the content types of the parts in the package.

Every part and its type must be listed in [Content_Types].xml.

The following is a content type for the main content part:

<Override PartName="/xl/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>

It’s important to keep this in mind when adding new parts to the package.

Relationships

Every package contains a relationships part that defines the relationships between the other parts and to resources outside of the package.

This separates the relationships from content and makes it easy to change relationships without changing the sources that reference targets.

在这里插入图片描述
For an OOXML package, there is always a relationships part (.rels) within the _rels folder that identifies the starting parts of the package, or the package relationships.

For example, the following defines the identity of the start part for the content:

<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="xl/workbook.xml"/>.

There are also typically relationships within .rels for app.xml and core.xml.

In addition to the relationships part for the package, each part that is the source of one or more relationships will have its own relationships part.

Each such relationship part is found within a _rels sub-folder of the part and is named by appending '.rels' to the name of the part.

Typically the main content part (workbook.xml) has its own relationships part (workbook.xml.rels).

It will contain relationships to the other parts of the content, such as sheet1.xml, sharedStrings.xml, styles.xml, theme1.xml, as well as the URIs for external links.

在这里插入图片描述
A relationship can be either explicit or implicit.

For an explicit relationship, a resource is referenced using the Id attribute of a <Relationship> element.

That is, the Id in the source maps directly to an Id of a relationship item, with an explicit reference to the target.

For example, a worksheet might contain a hyperlink such as this:

<w:hyperlink ref="A11" r:id="rId4">

The r:id=“rId4” references the following relationship within the relationships part for the worksheet (worksheet1.xml.rels).

<Relationship Id="rId4" Type="http://. . ./hyperlink" Target="http://www.google.com/" TargetMode="External"/>

For an implicit relationship, there is no such direct reference to a <Relationship> Id.

Instead, the reference is understood.

Parts Specific to SpreadsheetML Documents

Below is a list of the possible parts of a SpreadsheetML package that are specific to SpreadsheetML spreadsheets.

Keep in mind that a spreadsheet may only have a few of these parts.

For example, if a spreadsheet has no pivot table, then a pivot table part will not be included in the package.

PartDescription
Calculation ChainWhen the values of cells are calculated from formulas, the order of calculation can be affected by the order in which the values are calculated. This part contains specifies the ordering. A package can contain only one such part.
ChartsheetContains a chart that is stored in its owne sheet. A package can contain multiple such parts, referenced from the workbook part.
CommentsContains the comments for a given worksheet. Since there may be more than one worksheet, there may be more than one comments part.
(可能是每个sheet的名称)
ConnectionsA spreadsheet may have connections to external data sources. This part contains such connections, explaining both how to get such externnal data and how the connection is used within the workbook. There may be only one such part.
Custom PropertyContains user-defined data. There may be zero or more such parts.
Customer XML MappingsContains a schema for an XML file, and information on the behavior to be used when allowing the schema to be mapped into the spreadsheet. There may be only one such part.
DialogsheetContains information about a legacy customer dialog box for a user form. There may be zero or more such parts.
DrawingsContains the presentation and layout information for one or more drawing elements that are present in the worksheet. There should be drawings part for each worksheet that has a drawing.
External Workbook ReferencesContains information about data referenced in other spreadsheet packages. For example, a spreadsheet may have a cell whose value is calculated from data in another spreadsheet. There may be zero or more such parts.
MetadataContains information relating to a cell whose value is related to one or more other cells via Online Analytical Processing (OLAP) technology.
Pivot TableContains the definition of a pivot table. It describes the particulars of the layout of the pivot table, indicating what fields are on the row axis, the column axis, and values areas of the pivot table. And it indicates formatting for the pivot table. There is a pivot table part for each pivot table within the package.
Pivot Table Cache DefinitionThe pivot table cache definition defines each field in the pivot cache records part (i.e., the underlying data), including field name and information about the data contained in the field. There is a pivot table cahe definition part for each pivot table within the package.
Pivot Table Cache RecordsContains the underlying data for a pivot table. There will be zero or one such part for each pivot table in the package.
Query TableContains information that describes how the source table is connected to an external data source and defines the properties that are used when the table is refreshed from the source. There may be one such part for each table.
Shared String TableContains one occurrence of each unique string that occurs in any worksheet within the workbook. There is one such part for a package.
Shared Workbook Revision LogContains information about edits performed on individual cells in the workbook’s worksheets. There should be one such part for each editing session
Shared Workbook User DataContains a list of all users that are sharing the workbook. A package contained zero or one such part.
Single Cell Table DefinitionContains information on how to map non-repeating elements from a custom XML file into cells in a worksheet. There may be one such part per worksheet.
StylesContains all the characteristics for all cells in the workbook, including numeric and text formatting, alignment, font, color, and border. A package contains no more than one such part.
Table DefinitionA spreadsheet can contain regions that have clearly labeled columns, rows, and data regions. Such regions are called tables, and each worksheet can have zero or more tables. For each table in a worksheet there must be a table definitions part containing a description of the table. (The data for the table is stored in the corresponding worksheet part.)
Volatile DependenciesCells can contain real-time data formulas that return values that change over time and that require connectivity to programs outside of the workbook. In cases where those programs are not available, the formulas can use information stored in the volatile dependencies part. A package can have only one such part.
WorkbookContains data and references to all of the worksheets. There must be one and only one workbook part.
WorksheetContains all the data, formulas, and characteristics of a given worksheet. There is one such part for each worksheet in the package.

Parts Shared by Other OOXML Documents

There are a number of part types that may appear in any OOXML package.

Below are some of the more relevant parts for SpreadsheetML documents.

PartDescription
Embedded packageContains a complete package, either internal or external to the referencing package. For example, a SpreadsheetML document might contain a Wordprocessing or PresentationML document.
Extended File Properties (often found at docProps/app.xml)Contains properties specific to an OOXML document–properties such as the template used, the number of pages and words, and the application name and version.
File Properties, CoreCore file properties enable the user to discover and set common properties within a package–properties such as creator name, creation date, title. Dublin Core properties (a set of metadate terms used to describe resources) are used whenever possible.
ImageSpreadsheets often contain images. An image can be stored in a package as a zip item. The item must be identified by an image part relationship and the appropriate content type.
ThemeDrawingML is a shared language across the OOXML document types. It includes a theme part that is included in SpreadsheetML documents when the spreadsheet uses a theme. The theme part contains information about a document’s theme, that is, such information as the color scheme, font and format schemes.

Spreadsheet Content Overview

A SpreadsheetML document is a package containing a number of different parts, mostly XML files.

在这里插入图片描述

However, most of the actual content is found within one or more worksheet parts (one for each worksheet), and one sharedStrings part.

For Microsoft Excel, the content is found within an xl folder, and the worksheets are within a worksheet sub-folder.

在这里插入图片描述

The workbook part contains no actual content but merely some properties of the spreadsheet, with references to the separate worksheet parts which contain the data.

<workbook . . .>
	. . .
	<workbookPr . . ./>
	<sheets>
		<sheet name="sheet1" r:id="rId1">
		<sheet name="sheet2" r:id="rId2">
		<sheet name="sheet3" r:id="rId3">
	</sheets>
	. . .
</workbook>

A worksheet can be either a grid, a chart, or a dialog sheet.

The Grid(主要用到这些)

A grid of cells (or a “cell table”) is the most common type or worksheet.

Cells can contain text, booleans, numbers, dates, and formulas.

It is important to understand from the outset that most text values are not stored within a worksheept part.

In an effort to minimize duplication of values, a cell value that is a string is stored separately in the shareStrings part.(There is an exception to this generalization, however. A cell can be of type inlineStr, in which case the string is stored in the cell itself, within an is element.)

All other cell values–booleans, numbers, dates, and formulas (as well as the values of formulas) are stored within the cell.

Some properties for the sheet are at the beginning of the root <worksheet> element.

The number and sizes of the columns of the grid are defined within a <cols>.

And then the core data of the worksheet follows within the <sheetData> element.

The sheet data is divided into rows (<row>), and within each row are cells (<c>).

Rows are numbered or indexed, beginning with 1, with the r attribute (e.g., row r="1").

Each cell in the row also has a reference attribute which combines the row number with the column to make the reference attribute (e.g., <c r="D3">).

If a cell within a row has no content, then the cell is omitted from the row definition.

<worksheet . . .>
	. . .
	<cols>
		<col min="1" max="1" width="26.140625" customWidth="1"/>
		. . .
	</cols>
	<sheetData>
		<row r="1">
			<c r="A1" s="1" t="s">
				<v>0</v>
				. . .
			</c>
		</row>
		. . .
	</sheetData>
	. . .
	<mergeCells count="1">
		<mergeCell ref="B12:J16"/>
	</mergeCells>
	<pageMargins . . ./>
	<pageSetup . . ./>
	<tableParts ccount="1">
		<tableParts count="1">
	</tablePart r:id="rId2"/>
</worksheet>

The make-up of a cell is important in understanding the overall architecture of the spreadsheet content.

Each cell specifies its type with the t attribute.

Possible values include:

  • b for boolean
  • d for date
  • e for error
  • inlineStr for an inline string (i.e., not stored in the shared strings part, but directly in the cell)
  • n for number
  • s for shared string (so stored in the shared strings part and not in the cell)
  • str for a formula (a string representing the formula)

When a cell is a number, then the value is stored in the <v> element as a child of <c> (the cell element).

<c r="B2" s="5" t="n">
	<v>400</v>
</c>

A date is the same, though the date is stored as a value in the ISO 8601 format.

For inline strings, the value is within an <is> element.

But of course the actual text is further nested within a t since the text can be formatted.

<c r="C4" s="2" t="inlineStr">
	<is>
		<t>my string</t>
	</is>
</c>

For a formula, the formula itself is stored within an f element as a child element of <c>.

Following the formula is the actual calculated value within a <v> element.

<c r="B9" s="3" t="str">
	<f>SUM(B2:B8)</f>
	<v>2105</v>
</c>

When the data type of the cell is s for shared string, then the string is stored in the shared strings part.

However, the cell still contains a value within a <v> element, and that value is the index (zero-based) of the stored string in the shared strings part.

So, for example, in the example below, the actual string is the 9th occurrence of the <si> element within the shared strings part.

<c r="C1" s="4" t="s">
	<v>8</v>
</c>

The shared string part may look like this:
(感觉shared string应该是类似于常量的东西)

<sst xmls="http://schemas.openmlformats.org/spreadsheetml/2006/main" count="19" uniqueCount="13">
	<si><t>Expenses</t></si>
	<si><t>Amount</t></si>
	<si><t>Food</t></si>
	<si><t>Totals</t></si>
	<si><t>Entertainment</t></si>
	<si><t>Car Payment</t></si>
	<si><t>Rent</t></si>
	<si><t>Utilities</t></si>
	<si><t>Insurance</t></si>
	<si><t>Date Paid</t></si>
	. . .
</sst>

Tables

Data on a worksheet can be organized into tables.

Tables help provide structure and formatting to the data by having clearly labeled columns, rows, and data regions.

Rows and columns can be added easily, and filter and sort abilities are automatically added with the drop down arrows.
在这里插入图片描述
The actual table data for the cells is usually stored in the worksheet part as any other data, but the definition of the table is stored in a separate table part which is referenced from the worksheet in which the table appears.

<worksheet . . .>
	. . .
	<sheetData>
		. . .
	</sheetData>
	. . .
	<tableParts count="1">
		<tableParts count="1">
	</tablePart r:id="rId2"/>
</worksheet>

Within the rels part for the worksheet is the following:

<Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/table" Target=".. /tables/table1.xml"/>

The table part is shown below.
在这里插入图片描述
The content of the table part is below.

<table xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" id="1" name="Table1" displayName="Table1" ref="A18:C22" totalRowShown="0">
	<autoFilter ref="A18:C22"/>
	<tableColumns count="3">
		tableColumn id="1" name="Expenses"
		tableColumn id="2" name="Amount"
		tableColumn id="3" name="Date Paid"
	</tableColumns>
	<tableStyleInfo name="TableStyleMedium9" showFirstColumn="0" showLastColumn="0" showRowStripes="1" showColumnStripes="0"/>
</table>

The ref attribute in red above defines the range of cells within the worksheet that comprise the table.

Pivot Tables

Pivot tables are used to aggregate data, and to summarize and display it in an understandable layout.

For example, suppose I have a large spreasheet which captures the sales of four products in four cities.

I may have a column for the product, date, quantity sold, city, and state.

Each day has an entry for each product in each city, or 16 entries per day.

So even with only 4 products in 4 cities, I could have 5840 rows of data for a year.

What if wanted to determine what city had the most sales in the spring months?

What product was improving?

What city had the greatest sales of red widgets?

Pivot tables help to summarize the data and quickly provide the answers to these questions.

在这里插入图片描述
Pivot tables have a row axis, a column axis, a values area, and a report filter area.

Each table also has a field list from which users can select which fields to include in the pivot table.

Below is a pivot table that summarizes the sales and revenue by product.

在这里插入图片描述
A pivot table is comprised of the following components.

  • There is the underlying data that the pivot table summarizes. This data may be on the same worksheet as the pivot table, on a different worksheet, or it may be from an external source.
  • A cache or copy of that data is created in a part called the pivotCacheRecords part; a cache is needed when, e.g., the external data source is unavailable.
  • There is a pivotCacheDefinition part that defines each field in the pivot table and contains shared items, much like the sharedStrings part contains strings to remove redundancy in a worksheet.
  • The pivotTable part defines the layout of the pivot table itself, specifying what fields are on the row axix, the column axix, the report filter, and the values area.

在这里插入图片描述
The workbook points to and owns the pivotCacheDefinition part.

There is the reference in the workbook to the cache of data for the pivot table, following the references to the worksheets:

<pivotCaches>
	<pivotCache cacheId="13" r:id="rId4"/>
</pivotCaches>

The rels part for the workbook contains that reference:

<Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/pivotCacheDefinition" Target="pivotCache/pivotCacheDefinition1.xml"/>

The pivotCacheDefinition part in turn points to the pivotCacheRecords part.

<pivotCacheDefinition xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/spreadsheetml/2006/relationships" r:id="rId1" refreshBy="XXXX" refreshedDate="41059.666109143516" createdVersion="1" refreshedVersion="3" recordCount="32" upgradeOnRefresh="1">
. . .
</pivotCacheDefinition>

The rels part for the pivotCacheDefinition contains that reference:

<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/pivotCacheRecord" Target="pivotCacheRecords1.xml"/>

The pivotCacheDefinition part also references the source data in its

<cacheSource> element:

<cacheSource type="worksheet">
<worksheetSource ref="A1:F33" sheet="Sheet1"/>
</cacheSource>

The worksheet that contains the pivot table references the pivotTable part.

(There may be more than one, since a worksheet can have more than one pivot table.)

The rels part for the worksheet contains that reference:

<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/pivotTable" Target="../pivotTables/pivotTable1.xml"/>

The pivotTable part references the pivotCacheDefinitions part.

The rels part for the pivotTable part contains that reference:

<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/pivotCacheDefinition" Target="../pivotCache/pivotCacheDefinition1.xml"/>
pivotCacheDefinition

Now let’s look briefly at these parts and try to make sense out of them.

Let’s begin with the pivotCacheDefinition.

As mentioned above, it specifies the location of the source data.

It also defines each field (such as data type and formatting to be used) in the source data, including those not used in the pivot table.

(What fields are actually used is specified in the pivot table part.) And it is used as a cache for shared strings, just as the SharedStrings part is used to store strings that appear in worksheets.

The definition of the six fields in our example worksheet is below.

<pivotCacheDefinition . . .>
	<cacheSource type="worksheet">
		>worksheetSource ref="A1:F33" sheet="Sheet1"/>
	</cacheSource>
	<cacheFields count="6">
		<cacheField name="Product" numFmtId="0">
			<sharedItems count="4">
				<s v="Green Widget"/>
				<s v="Red Widget"/>
				<s v="Grey Widget"/>
				<s v="Blue Widget"/>
			</sharedItems>
		</cacheField>
		<cacheField name="Quantity Sold" numFmtId="0">
			<sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" minValue="1" maxValue="9"/>
		</cacheField>
		<cacheField name="Date" numFmtId="14">
			<sharedItems containsSemiMixedTypes="0" containsNoDate="0" containsDate="1" containsString="0" minDate="2012-03-04T00:00:00" maxDate="2012-03-06T00:00:00 count=2">
				<d v="2012-03-04T00:00:00"/>
				<d v="2012-03-05T00:00:00"/>
			</sharedItems>
		</cacheField>
		<cacheField name="Revenue" numFmtId="165">
			<sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" minValue="1" maxValue="9"/>
		</cacheField>
		<cacheField name="City" numFmtId="0">
			<sharedItems count="4">
				<s v="Rochester"/>
				<s v="Albany"/>
				<s v="Pittsburgh"/>
				<s v="Philadelphia"/>
			</sharedItems>
		</cacheField>
		<cacheField name="State" numFmtId="0">
			<sharedItems count=2">
				<s v="NY"/>
				<s v="PA"/>
			</sharedItems>
		</cacheField>
	</cacheFields>
</pivotCacheDefinition>

The first field defined above is the product field.

It consists of shared string values.

If the field does not have shared string values (such as the second field defined above–the Quantity Sold field), then the values are stored directly in the pivotCacheRecords part.

pivotCacheRecords

Let’s look at the pivotCacheRecords part to see how the field definitions relate to the cached data.

Below are the first two rows of data in the cache.

<pivotCacheRecords . . .>
	<r>
		<x v="0"/>
		<n v="2"/>
		<x v="0"/>
		<n v="2"/>
		<x v="0"/>
		<x v="0"/>
	</r>
	<r>
		<x v="1"/>
		<n v="3"/>
		<x v="0"/>
		<n v="3"/>
		<x v="0"/>
		<x v="0"/>
	</r>
	. . .
</pivotCacheRecords>

This corresponds to the data from the worksheet shown below.

在这里插入图片描述
Note first that each record (<r>) of the cached data has the same number of values as are defined in the pivotCacheDefinition–in our case, six.

Within each record are the following possible elements:

<x> - indicating the index value referencing an item for the field as defined in the pivotCacheDefinition
<s> - indicating a string value is being expressed inline in the record
<n> - indicating a numeric value is being expressed inline in the record

Looking at the two sample records from the pivotCacheRecords above, we know from the pivotCacheDefinition that the six values are product, quantity, date, revenue, city, and state in that order.

The Product field in the first record is <x v="0"/>, so the value (0) is an index into the items listed in the product field.

The first one listed (index 0) is Green Widget.

The second or quantity field value is <n v="2"/>, so the value (2) is a numeric value expressed inline.

The third or date field value is <x v="0"/>, so the value (0) is an index into the items listed in the date field (2012-03-04T00:00:00 or 3/4). Etc.

pivotTable

Now let’s look at the pivotTable part.

The root element is the <pivotTableDefinition> element.

There are several components within this.

First, the location of the pivot table on the worksheet is specified.

The location is straightforward.

Note that both the first header and data columns are specified.

<pivotTableDefinition . . .>
	<location ref="B37:G43 firstHeaderRow="1" firstDataRow="2" firstDataCol="4"/>
</cacheSource>

The order of items for fields and other field information for each field is then specified by <pivotField> elements within a <pivotFields>.

<pivotFields count="6">
	<pivotField axis="axisRow" compact="0" outline="0" subtotalTop="0" showAll="0" includeNewItemsFilter="1">
	<items count="5">
		<item sd="0" x="3"/>
		<item sd="0" x="0"/>
		<item sd="0" x="2"/>
		<item sd="0" x="1"/>
		<item t="default"/>
	</items>
	</pivotField>
	<pivotField dataField="1" compact="0" outline="0" subtotalTop="0" showAll="0" includeNewItemsFilter="1"/>
	<pivotField axis="axisRow" compact="0" numFmtId="14" outline="0" subtotalTop="0" showAll="0" includeNewItemsFilter="1">
		<items count="3">
			<item sd="0" x="0"/>
			<item sd="0" x="1"/>
			<item t="default"/>
		</items>
	</pivotField>
	. . .
</pivotFields count="6">

From the pivotCacheDefinition we know that the first <pivotField> above is the product. It has 5 items listed.

The first one is <item sd="0" x="3"/>.

The sd attribute indicates whether the item is hidden.

A value of 0 means the item is not hidden.

The x attribute is the index for the items in the <cacheField> for the product in the pivotCacheDefinition.

The <cacheField> is shown below.

Note that the value of the item at index 3 is Blue Widget, so Blue Widget should appear first in the pivot table if and where the product field is shown.

<cacheField name="Product" numFmtId="0">
	<sharedItems count="4">
		<s v="Green Widget"/>
		<s v="Red Widget"/>
		<s v="Grey Widget"/>
		<s v="Blue Widget"/>
	</sharedItems>
</cacheField>

The second item has an index of 0, or “Green Widget”, the third is 2 or “Grey Widget,” and the fourth is 1 or “Red Widget.”

Note that <item t="default"/> indicates a subtotal or total.

Following the <pivotFields> collection is the <rowFields> collection.

This collection specifies what fields are actually in the pivot table on the row axis, and in what order.

In our example, when we fully expand the first row, we see that a row consists of first a product, then a city, followed by a state, and then a date.

These are the row fields.

在这里插入图片描述
Following the index order in the <pivotFields> collection, this is 0, 4, 5, 2. The corresponding <rowFields> looks like this.

<rowFields count="4">
	<field x="0"/>
	<field x="4"/>
	<field x="5"/>
	<field x="2"/>
</rowFields>

After the <rowFields> collection is the <rowItems> collection.

This is a collection of all the values in the row axis.

There is an <i> element for each row in the pivot table.

And for each <i> there are as many <x> elements as there are item values in the row.

The v attribute is a zero-based index referencing a <pivotField> item value.

If there is no v then the value is assumed to be 0.

The value of grand for t indicates a grand total as the last row item value.

<rowItems count="5">
	<i>
		<x/>
	</i>
	<i>
		<x v="1"/>
	</i>
	<i>
		<x v="2"/>
	</i>
	<i>
		<x v="3"/>
	</i>
	<i t="grand">
		<x/>
	</i>
</rowItems>

The <colFields> collection follows, indicating which fields are on the column axis of the pivot table.

Here again <x> is an index into the <pivotField> collection.

<colFields count="1">
	<field x="-2"/>
</colFields>

The <colItems> collection follows, listing all of the values on the column axis.

<colItems count="2">
	<i>
		<x/>
	</i>
	<i i="1">
		<x v="1"/>
	</i>
</colItems>

There may also be a <pageFields> collection which describes which fields are found in the report filter area.

Finally, there is a <dataFields> collection, which describes what fields are found in the values area of the pivot table.

In our example, there are two fields in the values area – sum of quantity sold and sum of revenue.

Below is the collection.

The fld attribute is the index of the field being summarized.

<dataFields count="2">
	<dataField name="Sum of Quantity Sold" fld="1" baseField="0" baseItem="0"/>
	<dataField name="Sum of Revenue" fld="3" baseField="0" baseItem="0"/>
</dataFields>

Spreadsheet Styles(也会用到一部分)

Spreadsheets can be styled using styles, themes, and direct formatting.

There are cell styles, table styles, and pivot styles.

However, unlike in WordprocessingML, styling XML never appears with the content in a worksheet.

The formatting is always stored separately within a single styles part for the workbook.

There is also a single theme part for the entire workbook.

A cell style can specify number format, cell alignment, font information, cell borders, and background/foreground fills.

Table styles specify formatting for regions of a table, such as, e.g., headers are bold or a gray fill should be applied to alternating rows.

Pivot table styles specify formatting for regions of a pivot table, such as colors for totals or for the row axis.

Themese define a set of colors, font information, and effects on shapes.

A style or formatting element can define a color, font, or effect by referencing a theme, but of course that format may change if the thme is changed.

Text-Level Formatting

Before getting to the styles applied to a worksheet, however, let’s first cover formatting at the text level, that is, not formatting applied to the entire cell, but formatting that might change from word to word, such as different colors or effects.

For example, see cell A13 below, with blue color for the first word and orange underline for the second.

Obviously this cannot be accomplished with a cell style.

This formatting is done within the shared string part where the text of the cell is stored.

在这里插入图片描述
Let’s look at the XML for the first cell in row 13 of the worksheet part.

We know from the type attribute for the cell t="s" that the text is stored in the shared strings part, and from <v="25"/> we know that string is the 26th string or . (Remember that it is a zero-based index.)

<row r="13">
	<c r="A13" t="s">
		<v>25</v>
	</c>
	. . .
</row>

The XML for the string is below.

Note that the formatting is applied directly within the string item, just as direct formatting is applied to text runs (<r>) using run properties (<rPr>) within wordprocessingML (docx) documents.

<si>
	<r>
		<rPr>
			<sz val="11"/>
			<color theme="4"/>
			<rFont val="Calibri"/>
			<family val="2"/>
			<scheme val="minor"/>
		</rPr>
		<t>Blue</t>
	</r>
	<r>
		<rPr>
			<sz val="11"/>
			<color theme="1"/>
			<rFont val="Calibri"/>
			<family val="2"/>
			<scheme val="minor"/>
		</rPr>
		<t xml:space="preserve"> </t>
	</r>
	<r>
		<rPr>
			<u/>
			<sz val="11"/>
			<color theme="9"/>
			<rFont val="Calibri"/>
			<family val="2"/>
			<scheme val="minor"/>
		</rPr>
		<t>Widget</t>
	</r>
</si>

Cell-Level Formatting

Now let’s return to cell styles.

Styles within spreadsheetML are implemented to minimize repetition, and this is done with collections.

Within the styles part there are the collections shown below.

<stylesheet xmls="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
	<numFmts/>
	<fonts/>
	<fills/>
	<borders/>
	<cellStyleXfs/>
	<cellXfs/>
	<cellStyles/>
	<dxfs/>
	<tableStyles/>
</stylesheet>

Most of the collections above (except for <dxfs> and <tableStyles>) relate to cells.

And the first four–numFmts, fonts, fills, and borders–contain all of the possible charateristics for every cell in the workbook.

Each may have many elements, each one defining the characteristics for a set of cells that have the same such characteristics.

For example, below is a sample of the <fills> for a workbook.

Every cell in the workbook will use one of these fill definitions.

<fills count="5">
	<fill>
		<patternFill patternType="none"/>
	</fill>
	<fill>
		<patternFill patternType="gray125"/>
	</fill>
	<fill>
		<patternFill patternType="solid">
			<fgColor rgb="FFFFEB9C"/>
		</patternFill/>
	</fill>
	<fill>
		<patternFill patternType="solid">
			<fgColor theme="5" tint="0.39997558519241921"/>
			<bgColor indexed="65"/>
		</patternFill/>
	</fill>
	<fill>
		<patternFill patternType="solid">
			<fgColor rgb="FFC6EFCE"/>
		</patternFill/>
	</fill>
</fills>

The <fill> for a particular cell is specified with a zero-based index into the above fills collection.

The same is true of the font for the cell, the number format, and the borders.

So the formatting for a cell can be specified with a list or collection of indices into these four collections.

And in fact, that is what the <cellXfs> is.

It contains a collection of groups of indices, one group for every combination of cell formatting characteristics found in the workbook.

Below is one such grouping.

<cellXfs count="14">
	<xf numFmtId="0" fontId="0" fillId="0" borderId="0" xfId="0"/>
	. . .
</cellXfs>

Every cell will have a reference to one <xf> in the <cellXfs> collection.

This is direct formatting for the cell.

To apply a style to the cell, the <xf> references the style using the xfId attribute.

The xfId attribute is an index into the <cellStyleXFs> collection, which collects the cell styles available to the user.

The <cellStyleXFs> contains one <xf> for each style.

Each such <xf> is tied to its name via an index (in its xfId attribute) from the <cellStyles> collection.

Let’s try and tie it all together by looking at a sample.

Consider row 10 in the sample below.

在这里插入图片描述
The first cell A10 has a cell style applied.

The XML for the cell in the worksheet is below.

<row r="10">
	<c r="A10" s="12" t="s">
		<v>6</v>
	</c>
	. . .
</row>

From the attribute s="12" we know that the cell’s formatting is stored at the 13th (zero-based index) <xf> within the <cellXfs> collection in the styles part.

The 13th <xf> is below.

<xf numFmtId="0" fontId="8" fillId="4" borderId="0" xfId="3"/>

So for this cell, the number format is the first (index value is 0) within the <numFmts> collection.

The cell uses the font format found within the 9th <font> in the <fonts> collection, the 5th <fill> within the <fills> collection (which references a theme for the green), and the first <border> within the <borders> collection.

This cell also applies a style (xfId="3")–the 4th <xf> within the <cellStyleXfs> collection.

The style is shown below.

<xf numFmtId="0" fontId="8" fillId="4" borderId="0" applyNumberFormat="0" applyBorder="0" applyAlignment="0" applyProtection="0"/>

The formatting of the style is same as the direct formatting, and the attributes applyNumberFormat, applyBorder, applyAlignment, and applyProtection, each with values of 0, tell us not to apply the corresponding values of the style but instead apply the values for the direct formatting.

In this case they are the same, so there is no difference anyway.

Table-Level Formatting

A table applies a table style by specifying a <tableStyleInfo> element within the table definition in the tables part.

For example, the following sample table definition specifies the TableStyleMedium9 style.

Note that it is specified by name.

Note also that not only is the style specified, but the specification also tells us which aspects of the style are turned on (e.g., showRowStripes=“1”) and which are turned off (e.g., showLastColumn=“0”).

Each table style is made up of a collection of formatting definitions, each of which corresponds to a particular region of the table–e.g., whole table, first column stripe, first row stripe, first column, header column, first header cell, etc.

Each of these formatting definitions can be turned on or off.

<table xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" id="1" name="Table1" displayName="Table1" ref="A18:C22" totalRowShown="0">
	<autoFilter ref="A18:C22"/>
	<tableColumns count="3">
		tableColumn id="1" name="Expenses"
		tableColumn id="2" name="Amount"
		tableColumn id="3" name="Date Paid"
	</tableColumns>
	<tableStyleInfo name="TableStyleMedium9" showFirstColumn="0" showLastColumn="0" showRowStripes="1" showColumnStripes="0"/>
</table>

Annex G of the ECMA-376, 3rd Edition (June, 2011) OOXML specification defines built-in styles for cells, tables, and pivot tables, and style TableStyleMedium9 is among the built-in table styles.

The built-in table and pivot table styles are not stored in the styles part–only custom styles are.

Below is a custom style defined in the styles part, based on the TableStyleMedium9 style.

<tableStyles count="1" defaultTableStyle="TableStyleMedium9" defaultPivotStyle="PivotStyleLight16">
	<tableStyle name="My Custom Table Style" pivot="0" count="3">
		<tableStyleElement type="wholeTable" dxfId="2">
		<tableStyleElement type="headerRow" dxfId="1">
		<tableStyleElement type="firstColumn" dxfId="0">
	</tableStyle>
</tableStyles>

The style looks like this:
在这里插入图片描述
The style definition above uses differential formatting records (<dxf> elements referenced from the dxfId attribute), which enables subsets of formatting to be specified instead of specifying all formatting.

Looking at the sample above, we see that the default style is the TableStyleMedium9 style.

From that we are altering three aspects – the wholeTable, headerRow, and firstColumn.

Each of these elements references (again using a zero-based index) the <dxfs> collection within the styles part.

For example, the <headerRow> element references the second <dxf> (dxfId=“1”).

It applies bold and a fill background color to the default table style defaultTableStyle="TableStyleMedium9".

<dxfs count="3">
	<dxf>
		<font>
			<b val="0"/>
			<i/>
			<strike/>
		</font>
		<fill>
			<patternFill>
				<bgColor theme="2" tint="-0.2499465926081701"/>
			</patternFill>
		</fill>
	</dxf>

	<dxf>
		<font>
			<b/>
			<i val="0"/>
			<strike val="0"/>
		</font>
		<fill>
			<patternFill>
				<bgColor theme="8" tint="0.59996337778862885"/>
			</patternFill>
		</fill>
	</dxf>

	<dxf>
		<fill>
			<patternFill>
				<bgColor theme="5" tint="0.59996337778862885"/>
			</patternFill>
		</fill>
		<border>
			<left style="hair">
				<color auto="1"/>
			</left>
			<right style="hair">
				<color auto="1"/>
			</right>
			<top style="hair">
				<color auto="1"/>
			</top>
			<bottom style="hair">
				<color auto="1"/>
			</bottom>
			<vertical style="hair">
				<color auto="1"/>
			</vertical>
			<horizontal style="hair">
				<color auto="1"/>
			</horizontal>
		</border>
	</dxf>
</dxfs>

Conditional Formatting

Conditional formatting is a format such as cell shading or font color that a spreadsheet can apply automatically to cells if a specified condition is true.

For example, you can specify that a cell fill color should be red if the value in the cell is above 50.

It can be a very effective tool for visually highlighting important aspects of the data in a worksheet.

Conditional formatting rules are stored in the worksheet part, within a <conditionalFormatting> element after the <sheetData> element.

The range of cells to which the formatting applies is specified with the sqref attribute.

Each condition is within a <cfRule> element.

Multiple rules can be set, each with a different priority.

There are several different types of rules to specify different conditions.

For example, type="cellIs" will determine a cell format based on whether a cell value is greater than or less than a specified value, or between two values.

A type="dataBar" will display a bar of varying length within a cell based on the value in the cell.

Theses types are set with the type attribute on <cfRule>. A type="iconSet" will display an icon in the cell based upon the value in a cell.

Below is a sample table which applies two conditions to the cells B2:B7 - one which applies a pink color fill if the value of the cell is greater than 500 and the other which applies a green fill if the value is less than 300.

在这里插入图片描述
The XML for the conditions is below.

<conditionalFormatting sqref="B2:B7">
	<cfRule type="cellIs" dxfId="0" priority="2" operator="greaterThan">
		<formula>500</formula>
	</cfRule>
	<cfRule type="cellIs" dxfId="1" priority="1" operator="lessThan">
		<formula>300</formula>
	</cfRule>
</conditionalFormatting>

一口气看到这里,说明你对Xlsx的格式已经有了基本的了解,加油,后面还有呐!

3,Xlsx格式内容补充

在上一节中,已经了解了OOXML大部分的构成,然鹅在实际操作中,遇到了t="shared"这样的共享公式,是上一篇中没有出现过的内容,stackOverflow上查到是共享公式shared formula但是更详细的内容就没了。

在百度,谷歌,stackOverflow上搜到的描述,讲解OOXML的信息很少,导致最后不得不追根溯源,去wiki上看了一下,发现OOXML是ECMA-376制定的。

.xlsx格式出现的时间是2007年,且它遵守了这个标准,那这个标准诞生的时间必须在2007年以及之前,经过查询发现ECMA-376 1st 第一版就是2006年发布的,所以把它下载下来,看看内容。

经过查询发现,关于.xlsx的XML的cell里的字段定义是在SpreadsheetML里定义的,以下是ECMA原文内容:

8.4 SpreadsheetML
This subclause introduces the overall form of a SpreadsheetML package, and identifies some of its main element types.
(See Part 3 for a more detailed introduction.) A SpreadsheetML package
has a relationship of type officeDocument, which specifies the
location of the main part in the package. For a SpreadsheetML
document, that part contains the workbook definition.

这里它说详细的内容在第三部分,于是我找到第三部分,发现果然是有的:
在这里插入图片描述

在这里插入图片描述

经过查找,补充一部分需要使用的内容,这些都来自于ECMA-376 1st part3

3.2.9.2.1 Shared Formulas

<row r="7" spans="4:8">
  <c r="H7" s="1">
    <f t="shared" ref="H7:H11" ce="1" si="0">SUM(E7:G7)</f>
    <v>1.0246225028914113</v>
  </c>
</row>
<row r="8" spans="4:8">
  <c r="H8" s="1">
    <f t="shared" ce="1" si="0">SUM(E8:G8)</f>
    <v>0.9063376048733931</v>
  </c>
</row>

Just as strings in cells can be extremely pervasive and redundant in a sheet (and therefore must be optimized), formulas are also extremely pervasive in a sheet, and often can be optimized.

Consider the table in the above example, where column H contains a formula that sums the numbers in columns E through G, for each row.

The only difference between the formulas in H6:H12 is that the reference increases by 1 row from one row to the next.

Therefore, an optimization is created where only the formula in H6 needs to be written out, with some additional information indicating how far to propagate the formula once loaded.

This enables the loading application to load and parse only the first of the shared formulas, and then more quickly apply the necessary transforms to produce the additional related formulas in subsequent cells.

Note that while formulas can be shared, it is desirable to enable easy access to the contents of a cell.

Therefore, it is allowed that all formulas may be written out, but only the primary formula in a shared formula need be loaded and parsed.

3.2.9 Cell

<c r="B3">
  <f>B2+1</f>
  <v>2</v>
</c>

The cell itself is expressed by the c collection.

Each cell indicates it’s location in the grid using A1-style reference notation.

A cell can also indicate a style identifier (attribute s) and a data type (attribute t).

The cell types include string, number, and Boolean.

In order to optimize load/save operations, default data values are not written out.

3.2.9.1 Cell Values

Cells contain values, whether the values were directly typed in (e.g., cell A2 in our example has the value External Link:) or are the result of a calculation (e.g., cell B3 in our example has the formula B2+1).

String values in a cell are not stored in the cell table unless they are the result of a calculation.

Therefore, instead of seeing External Link: as the content of the cell’s v node, instead you see a zero-based index into the shared string table where that string is stored uniquely.

This is done to optimize load/save performance and to reduce duplication of information.

To determine whether the 0 in v is a number or an index to a string, the cell’s data type must be examined.

When the data type indicates string, then it is an index and not a numeric value.

在ECMA part4中,终于找到了关于公式的定义

在这里插入图片描述

还有普通类型的定义
在这里插入图片描述

当前就补充到这里,如果后续有需要再返回来补充

4,代码简单讲解

前提条件

使用场景:
1,大文件Excel,比如100MB,几十万行
2,仅限于.xlsx格式的Excel
3,要求低内存消耗

<dependency>
	<groupId>org.apache.poi</groupId>
	<artifactId>poi-ooxml</artifactId>
	<version>4.1.2</version>
</dependency>

第一步,参考一下官方文档的示例代码

相信在看到这篇博客之前,你已经了解了POI官网对于Event API的代码示例:
https://poi.apache.org/components/spreadsheet/how-to.html#xssf_sax_api

import java.io.InputStream;
import java.util.Iterator;
import org.apache.poi.util.XMLHelper;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import javax.xml.parsers.ParserConfigurationException;
public class ExampleEventUserModel {
    public void processOneSheet(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();
        XMLReader parser = fetchSheetParser(sst);
        // To look up the Sheet Name / Sheet Order / rID,
        //  you need to process the core Workbook stream.
        // Normally it's of the form rId# or rSheet#
        InputStream sheet2 = r.getSheet("rId2");
        InputSource sheetSource = new InputSource(sheet2);
        parser.parse(sheetSource);
        sheet2.close();
    }
    public void processAllSheets(String filename) throws Exception {
        OPCPackage pkg = OPCPackage.open(filename);
        XSSFReader r = new XSSFReader( pkg );
        SharedStringsTable sst = r.getSharedStringsTable();
        XMLReader parser = fetchSheetParser(sst);
        Iterator<InputStream> sheets = r.getSheetsData();
        while(sheets.hasNext()) {
            System.out.println("Processing new sheet:\n");
            InputStream sheet = sheets.next();
            InputSource sheetSource = new InputSource(sheet);
            parser.parse(sheetSource);
            sheet.close();
            System.out.println("");
        }
    }
    public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException, ParserConfigurationException {
        XMLReader parser = XMLHelper.newXMLReader();
        ContentHandler handler = new SheetHandler(sst);
        parser.setContentHandler(handler);
        return parser;
    }
    /**
     * See org.xml.sax.helpers.DefaultHandler javadocs
     */
    private static class SheetHandler extends DefaultHandler {
        private SharedStringsTable sst;
        private String lastContents;
        private boolean nextIsString;
        private SheetHandler(SharedStringsTable sst) {
            this.sst = sst;
        }
        public void startElement(String uri, String localName, String name,
                                 Attributes attributes) throws SAXException {
            // c => cell
            if(name.equals("c")) {
                // Print the cell reference
                System.out.print(attributes.getValue("r") + " - ");
                // Figure out if the value is an index in the SST
                String cellType = attributes.getValue("t");
                if(cellType != null && cellType.equals("s")) {
                    nextIsString = true;
                } else {
                    nextIsString = false;
                }
            }
            // Clear contents cache
            lastContents = "";
        }
        public void endElement(String uri, String localName, String name)
                throws SAXException {
            // Process the last contents as required.
            // Do now, as characters() may be called more than once
            if(nextIsString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = sst.getItemAt(idx).getString();
                nextIsString = false;
            }
            // v => contents of a cell
            // Output after we've seen the string contents
            if(name.equals("v")) {
                System.out.println(lastContents);
            }
        }
        public void characters(char[] ch, int start, int length) {
            lastContents += new String(ch, start, length);
        }
    }
    public static void main(String[] args) throws Exception {
        ExampleEventUserModel example = new ExampleEventUserModel();
        example.processOneSheet(args[0]);
        example.processAllSheets(args[0]);
    }
}

第二步,改吧改吧变成自己的代码

如下代码处理每一个Element的数据,并将其拼接成为SQL
很抱歉,暂时为了保密,只能展示部分代码,希望这些代码能对你有所帮助。

SheetHandler

package com.airde.handler;
/**
 * Author : Airde
 * Date: 2020/12/7 14:36
 */

import com.airde.dto.Constants;
import com.airde.dto.DBInfo;
import com.airde.pojo.Column;
import com.airde.pojo.JDBC;
import com.airde.util.JdbcUtil;
import com.airde.util.SqlUtil;
import lombok.extern.slf4j.Slf4j;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import java.util.*;

/**
 * @ClassName XlsxHandler
 * @Description 采用SAX方式依次处理每一个element
 * 使用map限定元素数量,避免VALUES长度不一致的问题。
 * @Author airde
 * @Date 2020/12/7 14:36
 * @Version 1.0
 **/
@Slf4j
public class SheetHandler extends DefaultHandler {

    //共享区
    private SharedStringsTable sharedStringsTable;
    private String nowCellColumnNo;

    //内容区
    private String lastContents;
    private boolean nextIsSharedString;


    //拼接SQL
    private StringBuilder insertSql;

    private String nowRowNum;


    //通用属性区
    private JDBC jdbc;
    private DBInfo dbInfo;
    private String insertSqlHead;


    //计数器
    private int valuesNum;
    private int commitNum;


    private int valuesNumCount = 0;
    private int commitNumCount = 0;


    private int dataStartNum;
    private int dataStopNum;
    private int skipNumCount = 0;

    //拼接区

    private List<String> columnNoList = new ArrayList<>();
    private Map<String, String> valuesMap = new HashMap<>();

    //对ColumnList的每一个Column,都产生对应的值
    //在每一个columnNo都生成对应的值之后,检查每一对值是否为空
    //如果出现为空的值,那就填补为空字符串,最后再进行拼接(endEle遇到row进行整理拼接)

    public void initValuesMap() {
        for (String columnNo : columnNoList) {
            valuesMap.put(columnNo, "");
        }
    }


    @Override
    public void startDocument() throws SAXException {
        super.startDocument();
        //初始化
        this.insertSqlHead = SqlUtil.generateInsertSqlHead(dbInfo);

        insertSql = new StringBuilder("");
        insertSql.append(insertSqlHead);

        initValuesMap();

        JdbcUtil.openFullConnection(jdbc);

    }

    @Override
    public void endDocument() throws SAXException {
        //防止没到指定次数的剩余记录未提交,只要不为Null或空字符串,就执行提交
        if (insertSql.indexOf(",") != -1) {
            if (insertSql != null && !"".equals(insertSql.toString())) {
                //最后一次拼接与执行
                insertSql.delete(insertSql.lastIndexOf(","), insertSql.length());
                insertSql.append(";");
                //执行
                try {
                    JdbcUtil.executeByFullConnection(insertSql.toString());
                } catch (Exception e) {
                    e.printStackTrace();
                    log.error("execute sql error!" + nowRowNum);
                }

            }
        }


        //用完连接就关上
        JdbcUtil.commitFullConnection();

        JdbcUtil.closeFullConnection();
        super.endDocument();
    }

    public SheetHandler(SharedStringsTable sst, DBInfo dbInfo, JDBC jdbc, List<Column> columnList, String dataStartNum, String dataStopNum) {
        //默认一条insert 包含50个记录
        this.valuesNum = 50;

        //默认每1000个insert commit一次
        this.commitNum = 1000;

        //默认起始数据行
        this.dataStartNum = Integer.parseInt(dataStartNum) - 1;
        this.dataStopNum = Integer.parseInt(dataStopNum) - 2;


        for (Column column : columnList) {
            columnNoList.add(column.getNo());
        }

        this.dbInfo = dbInfo;
        this.jdbc = jdbc;
        this.sharedStringsTable = sst;
    }


    /**
     * 遇到一个元素之前,进行什么动作
     *
     * @param uri        XML命名空间标识符
     * @param localName  不带前缀的元素名
     * @param qName      当前Element的元素名
     * @param attributes Cell里对应的属性列表
     * @throws SAXException SAX解析异常
     */
    @Override
    public void startElement(String uri, String localName, String qName,
                             Attributes attributes) throws SAXException {

        if ("row".equals(qName)) {
            nowRowNum = attributes.getValue("r");
            //新的row对应新的一条插入语句

        } else if ("c".equals(qName)) {
            // c => cell
            String cellType = attributes.getValue("t");


            nowCellColumnNo = attributes.getValue("r");
            //去掉结尾的数字
            nowCellColumnNo = nowCellColumnNo.substring(0, nowCellColumnNo.lastIndexOf(nowRowNum));

            //下一个元素是否为sharedString?
            nextIsSharedString = "s".equals(cellType);

        }
        // Clear contents cache
        lastContents = "";
    }

    /**
     * 遇到一个元素之后,进行什么动作
     *
     * @param uri       XML命名空间标识符
     * @param localName 不带前缀的元素名
     * @param qName     当前Element的元素名
     * @throws SAXException SAX解析异常
     */
    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {



        if ("row".equals(qName)) {
            //如果当前已积累skipNumCount不等于skipNum,说明当前读取的行还不是数据行,那就不拼接
            if (skipNumCount < dataStartNum || skipNumCount >= dataStopNum) {
                //log.error("skip this row!");
                //每个row完了,就+1
                skipNumCount += 1;

            }else {
                //map形式拼接
                StringBuilder singleMapSql = new StringBuilder("");
                singleMapSql.append("(");
                for (String columnNo : columnNoList) {
                    if ("-1".equals(columnNo)){
                        singleMapSql.append("'',");
                    }else {
                        singleMapSql.append("'");
                        singleMapSql.append(valuesMap.get(columnNo));
                        singleMapSql.append("',");
                    }

                }

                //删除多出来的逗号,闭合
                singleMapSql.delete(singleMapSql.lastIndexOf(","), singleMapSql.length());
                singleMapSql.append("),");


                //每处理完一行,就要初始化map
                initValuesMap();

                //拼接完毕,就把它摁到InsertSql里
                insertSql.append(singleMapSql);


                //每一行结束,都得把值+1
                commitNumCount += 1;
                valuesNumCount += 1;
            }

            //如果达到了对应执行的阈值,就进行执行处理
            if (valuesNumCount == valuesNum) {
                insertSql.delete(insertSql.lastIndexOf(","), insertSql.length());
                insertSql.append(";");


                //执行
                try {
                    //log.info(insertSql.toString());
                    JdbcUtil.executeByFullConnection(insertSql.toString());
                } catch (Exception e) {
                    e.printStackTrace();
                    log.error("execute sql error" + nowRowNum);
                }


                //执行完初始化insertSql
                insertSql = new StringBuilder("");
                insertSql.append(insertSqlHead);

                //初始化valuesNumCount
                valuesNumCount = 0;
            }
            //如果达到了对应commit的阈值,就进行commit处理
            if (commitNumCount == commitNum) {
                JdbcUtil.commitFullConnection();
                //执行完初始化commitNumCount
                commitNumCount = 0;
            }

        } else {
            // TODO 这里否则之后默认为cell的情况?

            //处理SharedString的情况,可能也可以换成
            //if(Constants.CELL_TYPE_SHAREDSTRING.equals(nowCellType)){
            //}
            if (nextIsSharedString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = sharedStringsTable.getItemAt(idx).getString();
                nextIsSharedString = false;
            }

            //如果nowCellNo不在columnNoList里,则不拼接这个属性,否则才拼接这个属性

            boolean ifInColumnNoList = false;
            for (String columnNo : columnNoList) {
                if (nowCellColumnNo != null && nowCellColumnNo.equals(columnNo)) {
                    ifInColumnNoList = true;
                    break;
                }
            }
            // v => contents of a cell
            // Output after we've seen the string contents
            // 拼接SQL
            if ("v".equals(qName)) {
                if (ifInColumnNoList) {
                    valuesMap.put(nowCellColumnNo, lastContents);
                    //同时检查put进去的是否为Null,不为null就行为null就替换为""空字符串
                    if (valuesMap.get(nowCellColumnNo) == null) {
                        valuesMap.put(nowCellColumnNo, "");
                    }
                }

            }


        }


    }

    @Override
    public void characters(char[] ch, int start, int length) {
        lastContents += new String(ch, start, length);
    }



}

SheetHandlerWrapper

package com.airde.handler;
/**
 * Author : Airde
 * Date: 2020/12/9 9:52
 */

import com.airde.pojo.Task;
import com.airde.util.SqlUtil;
import lombok.extern.slf4j.Slf4j;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.XMLReaderFactory;

import javax.xml.parsers.ParserConfigurationException;
import java.io.InputStream;

/**
 * @ClassName SheetHandlerWrapper
 * @Description 用于触发sheethandler的执行
 * @Author airde
 * @Date 2020/12/9 9:52
 * @Version 1.0
 **/
@Slf4j
public class SheetHandlerWrapper {


    public static void generateInsertSqlAndExecute(Task task){
        try {
            if ("false".equals(task.getIsFtp() )){
                OPCPackage opcPackage = OPCPackage.open(task.getSrcPath());
                XSSFReader xssfReader = new XSSFReader( opcPackage );
                SharedStringsTable sharedStringsTable = xssfReader.getSharedStringsTable();
                XMLReader xmlReader = getXmlReader(sharedStringsTable, task);

                //虽然现在必须使用rId1,rId2这种形式,但是可以通过workbookdata根据sheetname查找sheetid,
                InputStream sheet = xssfReader.getSheet("rId"+task.getSheet().getSheetId());
                InputSource sheetSource = new InputSource(sheet);
                //智能产生插入SQL并执行
                xmlReader.parse(sheetSource);
                sheet.close();

            }else {
                log.error("This is a ftp srcPath ,this version can't handle it, skip this Task");
                return;
            }

        }catch (Exception e){
            e.printStackTrace();
            log.error("GenerateInsertSqlAndExecute ERROR!");
        }

    }


    public static XMLReader getXmlReader(SharedStringsTable sst,Task task) throws SAXException, ParserConfigurationException {
        try {
            XMLReader xmlReader = XMLReaderFactory.createXMLReader();
            ContentHandler handler = new SheetHandler(sst,SqlUtil.getDbInfoFromTask(task),task.getJdbc(),task.getColumnList(),task.getSheet().getDataLine(),task.getSheet().getDataEndLine());
            xmlReader.setContentHandler(handler);
            return xmlReader;
        }catch (Exception e){
            e.printStackTrace();
            log.error("The function getXmlReader Error,it's weired to find this problem.");
        }
       return null;
    }

}

你居然有耐心看完!相信你一定收获颇丰!一键三连!拜托了!(●’◡’●)

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值