一.概述
1. Excel 2003文件(即后缀为xls)是二进制文件,存储结构为复合文档,POI读取xls文件有两种方式
- 用户模式(usermodel):一次性将xls文件读入到内存,创建dom结构处理
- 事件模式(eventusermodel):以流的形式读取xls文件,读取xls文件占用相对较小的内存
2. 事件模式适用于愿意学习一点低级API结构的中间开发人员。它使用起来相对简单,但需要对Excel文件j结构有个基本了解。
二. 存储格式
2.1 Workbook document
2.2 文档流Workbook Stream
- Workbook Globals Substream - Workbook globals对应的流,包含workbook的全局信息
- Sheet Substream - Sheet对应的流,包含一个Sheet的信息
2.3 子流SubStreams
- 最先存储的是Workbook Globals Substream
- 接着是第一个Sheet Substream
- Sheet Substream的存储顺序是根据Excel中Sheet的顺序来的
三. 目录结构Directory
- 目录结构Directory是复合文档一种内部控制流
- 目录结构Directory由一系列的目录条目Directory Entry组成
- 每一个目录条目Directory Entry都指向复合文档的一个仓库Storage或流Stream
- 目录条目Directory Entry根据对应仓库或流在文件流中出现的顺序被列举
- 目录条目Directory Entry的索引从0开始,其索引称为DirID,见下图
- DirID为0的表示一个特殊的目录条目,它代表根仓库条目 - root storage entry
3.1 目录条目按序列举
3.2 目录条目结构
- 每个目录条目指向一个仓库Storage或流Stream
- 每个目录条目Directory Entry固定大小为128字节
- 第一个目录条目是根仓库条目 - root storage Entry
- 第二个目录条目的名字是“Workbook”,它表示一个流
- 根仓库Root Storage描述的是根仓库条目root storage entry ,由于它没有父目录条目,所以无需构建红黑树
- 根仓库的所有直接成员(Storage1、Stream1、Stream2、Storage2、Stream3、Stream4)将组成一颗红黑树,该树的根节点的DID记录在root storage entry中
- Storage1只有一个直接成员Stream1,Stream1将构成一颗红黑树,此树只有一个节点,Stream1的DID入口记录在Storage1目录条目中
- Storage2有3个直接成员,Stream21、Stream22、Stream23这3个直接成员将组成一颗红黑树,此树根节点的DID记录在Storage2的目录条目中
四. 记录Record
- BOFRecord : 记录了Workbook或一个sheet的开始
- EOFRecord : 记录了Workbook或一个sheet的结尾
- STRecord : 记录了Excel中所有文件大院个的文本值
- .......
- Identifier:Record的标识符sid,POI读取到sid就知道将流解析成对应的Record
- size :记录了当前Record内容占据的大小,单位为字节
- content:当前Record的内容
@SuppressWarnings("unchecked")
private static final Class<? extends Record>[] recordClasses = new Class[] {
ArrayRecord.class,
AutoFilterInfoRecord.class,
BackupRecord.class,
BlankRecord.class,
BOFRecord.class,
BookBoolRecord.class,
BoolErrRecord.class,
BottomMarginRecord.class,
BoundSheetRecord.class,
CalcCountRecord.class,
CalcModeRecord.class,
CFHeaderRecord.class,
CFHeader12Record.class,
CFRuleRecord.class,
CFRule12Record.class,
ChartRecord.class,
ChartTitleFormatRecord.class,
CodepageRecord.class,
ColumnInfoRecord.class,
ContinueRecord.class,
CountryRecord.class,
CRNCountRecord.class,
CRNRecord.class,
DateWindow1904Record.class,
DBCellRecord.class,
DConRefRecord.class,
DefaultColWidthRecord.class,
DefaultRowHeightRecord.class,
DeltaRecord.class,
DimensionsRecord.class,
DrawingGroupRecord.class,
DrawingRecord.class,
DrawingSelectionRecord.class,
DSFRecord.class,
DVALRecord.class,
DVRecord.class,
EOFRecord.class,
ExtendedFormatRecord.class,
ExternalNameRecord.class,
ExternSheetRecord.class,
ExtSSTRecord.class,
FeatRecord.class,
FeatHdrRecord.class,
FilePassRecord.class,
FileSharingRecord.class,
FnGroupCountRecord.class,
FontRecord.class,
FooterRecord.class,
FormatRecord.class,
FormulaRecord.class,
GridsetRecord.class,
GutsRecord.class,
HCenterRecord.class,
HeaderRecord.class,
HeaderFooterRecord.class,
HideObjRecord.class,
HorizontalPageBreakRecord.class,
HyperlinkRecord.class,
IndexRecord.class,
InterfaceEndRecord.class,
InterfaceHdrRecord.class,
IterationRecord.class,
LabelRecord.class,
LabelSSTRecord.class,
LeftMarginRecord.class,
LegendRecord.class,
MergeCellsRecord.class,
MMSRecord.class,
MulBlankRecord.class,
MulRKRecord.class,
NameRecord.class,
NameCommentRecord.class,
NoteRecord.class,
NumberRecord.class,
ObjectProtectRecord.class,
ObjRecord.class,
PaletteRecord.class,
PaneRecord.class,
PasswordRecord.class,
PasswordRev4Record.class,
PrecisionRecord.class,
PrintGridlinesRecord.class,
PrintHeadersRecord.class,
PrintSetupRecord.class,
ProtectionRev4Record.class,
ProtectRecord.class,
RecalcIdRecord.class,
RefModeRecord.class,
RefreshAllRecord.class,
RightMarginRecord.class,
RKRecord.class,
RowRecord.class,
SaveRecalcRecord.class,
ScenarioProtectRecord.class,
SelectionRecord.class,
SeriesRecord.class,
SeriesTextRecord.class,
SharedFormulaRecord.class,
SSTRecord.class,
StringRecord.class,
StyleRecord.class,
SupBookRecord.class,
TabIdRecord.class,
TableRecord.class,
TableStylesRecord.class,
TextObjectRecord.class,
TopMarginRecord.class,
UncalcedRecord.class,
UseSelFSRecord.class,
UserSViewBegin.class,
UserSViewEnd.class,
ValueRangeRecord.class,
VCenterRecord.class,
VerticalPageBreakRecord.class,
WindowOneRecord.class,
WindowProtectRecord.class,
WindowTwoRecord.class,
WriteAccessRecord.class,
WriteProtectRecord.class,
WSBoolRecord.class,
// chart records
BeginRecord.class,
ChartFRTInfoRecord.class,
ChartStartBlockRecord.class,
ChartEndBlockRecord.class,
// TODO ChartFormatRecord.class,
ChartStartObjectRecord.class,
ChartEndObjectRecord.class,
CatLabRecord.class,
DataFormatRecord.class,
EndRecord.class,
LinkedDataRecord.class,
SeriesToChartGroupRecord.class,
// pivot table records
DataItemRecord.class,
ExtendedPivotTableViewFieldsRecord.class,
PageItemRecord.class,
StreamIDRecord.class,
ViewDefinitionRecord.class,
ViewFieldsRecord.class,
ViewSourceRecord.class,
};
4.1 Record解析顺序
五. Workbook解析步骤
5.1 BOFRecord / EOFRecord
1.位置:org.apache.poi.hssf.record.BOFRecord、org.apache.poi.hssf.record.EOFRecord
2.BOFRecord表示Workbook或一个sheet的开始,EOFRecord表示Workbook或一个sheet的结尾
3.如图:
5.2 FormatRecord
5.3 ExtendedFormatRecord
- 单元格样式索引:XFIndex
- 单元格边框样式:border
- 单元格水平垂直样式:alignment
- 单元格填充色:fill
5.4 BoundSheetRecord
- Excel中有几个Sheet,就有几个BoundSheetRecord对象
- BoundSheetRecord对应Sheet在Excel中出现顺序
5.5 SSTRecord
1.位置:org.apache.poi.hssf.record.SSTRecord
2.SSTRecord中存储了在Excel中文本单元格中的文本值,文本单元格通过索引获取文本值
3.如图
六. WorkSheet解析步骤
6.1 BOFRecord/EOFRecord
6.2 DimensionsRecord
1.位置:org.apache.poi.hssf.record.DimensionsRecord
2.DimensionsRecord存储了一个sheet的行列范围
DimensionsRecord | 描述说明 |
---|---|
field_1_first_row | sheet中第一有效行行号 |
field_2_last_row | sheet中最后有效行行号+1 |
field_3_first_col | sheet中第一有效列列号 |
field_4_last_col | sheet中最后有效列列号+1 |
3.如图
6.3 ColumnInfoRecord
1.位置:org.apache.poi.hssf.record.ColumnInfoRecord
2.ColumnInfoRecord存储了sheet中一列的信息
3.如图:
6.4 RowRecord
- 当前行索引
- 当前行是否隐藏
6.5 LabelSSTRecord
1. 位置:org.apache.poi.hssf.record.LabelSSTRecord
2. LabelSSTRecord记录了一个sheet中的文本单元格
3. 如图:
Record | 描述说明 |
---|---|
NumberRecord | 数值单元格 |
LabelSSTRecord | 引用了SSTRecord中一个String类型的单元格值 |
BoolErrRecord | 布尔或错误单元格,根据属性isError判断是布尔还是错误单元格 |
FormulaRecord | 公式单元格 |
BlankRecord | 空白单元格,单元格没有值,但是有单元格样式 |
StringRecord | 存储文本公式的缓存结果 |
LabelRecord | 只读,支持读取直接存储在单元格中的字符串,而不是存储在SSTRecord中,除了读取不要使用LabelRecord,应该使用SSTRecord替代 |
6.6 NumberRecord
6.7 BoolErrRecord
6.8 FormulaRecord
6.9 BlankRecord
6.10 MergeCellsRecord
七. 解析步骤
- 实现接口HSSFListener,实现自己的监听器listener
- 通过Record。sid为某些特定的Record设置监听listener
- 根据Excel 2003文件路径获取该文件的输入流FileInputStream - in
- 根据输入流in创建POIFSFileSysytem实例对象poifs
- 第二个目录条目名字是Workbook,找到第二目录条目,根据对应的流创建一个输入流DocumentInputStream
- 根据输入流DocumentInputStream,解析为一个个记录Record
- 如果解析出的Record设置了监听,触发监听事件
- 处理监听器中事件
7.1 设置监听的Record
BOFRecord.sid, // HSSFWorkbook、HSSFSheet的开始
EOFRecord.sid, // HSSFWorkbook、HSSFSheet的结束
BoundSheetRecord.sid, // BoundSheetRecord记录了sheetName
SSTRecord.sid, // SSTRecord记录了所有Sheet的文本单元格的文本
DimensionsRecord.sid, // DimensionsRecord记录了每个Sheet的有效起始结束行列索引
MergeCellsRecord.sid, // MergeCellsRecord记录了每个Sheet中的合并单元格信息
ExtendedFormatRecord.sid, // ExtendedFormatRecord记录了扩展的单元格样式
FormatRecord.sid, // FormatRecord记录单元格样式信息
ColumnInfoRecord.sid, // ColumnInfoRecord记录了Sheet中列信息,如列是否隐藏
RowRecord.sid, // RowRecord记录了Sheet中行信息,如行索引,行是否隐藏
BlankRecord.sid, // Sheet中空单元格,存在单元格样式
BoolErrRecord.sid, // Sheet中布尔或错误单元格
FormulaRecord.sid, // Sheet中公式单元格
LabelSSTRecord.sid, // Sheet中文本单元格
NumberRecord.sid // Sheet中数值单元格:数字单元格和日期单元格
7.2 org.apache.poi.poifs.filesystem.POIFSFileSystem类
- 根据fs可以获取到根目录条目DirectoryNode - root entry
- 根目录root entry根据Entry实体名Workbook获取对应的Entry
- 根据Workbook Entry创建该实体的输入流DocumentInputStream - ds
- 解析输入流ds,根据Excel XLS文档格式解析为各种Record
- 解析到设置监听器的Record,触发监听器时间listener,处理Record
/ xls文件的输入流
FileInputStream fin = newFileInputStream("C:\\Users\\Administrator\\Desktop\\测试.xls");
// 创建一个POIFSFileSystem实例
POIFSFileSystem poifs = newPOIFSFileSystem(fin);
// 从流中获取Excel的WorkBook流
InputStream workBookInputStream = poifs.createDocumentInputStream("Workbook");
7.3 org.apache.poi.hssf.eventusermodel.HSSFListener
HSSFListener是与HSSFRequest和HSSFEventFactory一起使用的接口- 用户应该实现接口HSSFListener,创建一个自己的监听器类Workbook
- listener可以注册到HSSFRequest实例request中,用于监听特定的Record
- 一个Record可以设置多个监听器,处理不同的事
7.4 org.apache.poi.hssf.eventusermodel.HSSFRequest
HSSFRequest | 类方法描述 |
---|---|
addListener(HSSFListener lsnr, short sid) | 为sid的记录record注册一个监听器lsnr |
addListenerForAllRecords(HSSFListener lsnr) | 为org.apache.poi.hssf.record.Record包中所有的记录注册一个监听器lsnr 不推荐用这种方法,影响性能 |
processRecord(Record rec) | 由HSSFEventFactory调用,处理记录rec 记录rec可能注册了多个监听器,循环触发每个注册的监听器,处理记录record |
7.5 org.apache.poi.hssf.eventusermodel.HSSFEventFactory
HSSFEventFactory | 类方法描述 |
---|---|
processWorkbookEvents(HSSFRequest req, POIFSFileSystem fs) | 将一个文件处理为基本的Record事件 @param req 一个HSSFRequest实例,记录了Record的所有监听器 @param fs 包含WorkBook的POIFS文件系统 |
processWorkbookEvents(HSSFRequest req, DirectoryNode dir) | 将一个文件处理为基本的Record事件 @param req 一个HSSFRequest实例,记录了Record的所有监听器 @param dir 包含WorkBook的DirectoryNode |
processEvents(HSSFRequest req, InputStream in) | 将一个文件处理为基本的Record事件 @param req 一个HSSFRequest实例,记录了Record的所有监听器 @param in 包含WorkBook的DirectoryNode的输入流 |
short abortableProcessWorkbookEvents(HSSFRequest req, POIFSFileSystem fs) | 将一个文件处理为基本的Record事件
返回数值,如果监听器是继承AbortableHSSFListener,返回值不为0,则不会触发当前记录的其他监听器,
就会继续处理下一个记录
|
short abortableProcessWorkbookEvents(HSSFRequest req, DirectoryNode dir) | 将一个文件处理为基本的Record事件
返回数值,如果监听器是继承AbortableHSSFListener,返回值不为0,则不会触发当前记录的其他监听器,
就会继续处理下一个记录
|
short abortableProcessEvents(HSSFRequest req, InputStream in) | 将一个文件处理为基本的Record事件
返回数值,如果监听器是继承AbortableHSSFListener,返回值不为0,则不会触发当前记录的其他监听器,
就会继续处理下一个记录
|
八.一个事件模式实例
8.1 HSSFListener接口的实现类
package poi.hssf.event;
import org.apache.poi.hssf.eventusermodel.HSSFListener;
import org.apache.poi.hssf.record.BOFRecord;
import org.apache.poi.hssf.record.BoundSheetRecord;
import org.apache.poi.hssf.record.LabelSSTRecord;
import org.apache.poi.hssf.record.NumberRecord;
import org.apache.poi.hssf.record.Record;
import org.apache.poi.hssf.record.RowRecord;
import org.apache.poi.hssf.record.SSTRecord;
public class HSSFListenerImpl implements HSSFListener {
private SSTRecord sstrec;
/**
* This method listens for incoming records and handles them as required.
* @param record The record that was found while reading.
*/
public void processRecord(Record record) {
switch (record.getSid()) {
// the BOFRecord can represent either the beginning of a sheet or the workbook
case BOFRecord.sid:
BOFRecord bof = (BOFRecord) record;
if (bof.getType() == BOFRecord.TYPE_WORKBOOK) {
System.out.println("处理 workbook");
// assigned to the class level member
} else if (bof.getType() == BOFRecord.TYPE_WORKSHEET) {
System.out.println("处理sheet");
}
break;
case BoundSheetRecord.sid:
BoundSheetRecord bsr = (BoundSheetRecord) record;
System.out.println("New sheet named: " + bsr.getSheetname());
break;
case RowRecord.sid:
RowRecord rowrec = (RowRecord) record;
System.out.println("Row found, first column at "
+ rowrec.getFirstCol() + " last column at " + rowrec.getLastCol());
break;
case NumberRecord.sid:
NumberRecord numrec = (NumberRecord) record;
System.out.println("Cell found with value " + numrec.getValue()
+ " at row " + numrec.getRow() + " and column " + numrec.getColumn());
break;
// SSTRecords store a array of unique strings used in Excel.
case SSTRecord.sid:
sstrec = (SSTRecord) record;
for (int k = 0; k < sstrec.getNumUniqueStrings(); k++) {
System.out.println("String table value " + k + " = " + sstrec.getString(k));
}
break;
case LabelSSTRecord.sid:
LabelSSTRecord lrec = (LabelSSTRecord) record;
System.out.println("String cell found with value "
+ sstrec.getString(lrec.getSSTIndex()));
break;
}
}
}
package poi.hssf.event;
import org.apache.poi.hssf.eventusermodel.HSSFListener;
import org.apache.poi.hssf.record.BOFRecord;
import org.apache.poi.hssf.record.BoundSheetRecord;
import org.apache.poi.hssf.record.LabelSSTRecord;
import org.apache.poi.hssf.record.NumberRecord;
import org.apache.poi.hssf.record.Record;
import org.apache.poi.hssf.record.RowRecord;
import org.apache.poi.hssf.record.SSTRecord;
public class HSSFListenerImpl implements HSSFListener {
private SSTRecord sstrec;
/**
* This method listens for incoming records and handles them as required.
* @param record The record that was found while reading.
*/
public void processRecord(Record record) {
switch (record.getSid()) {
// the BOFRecord can represent either the beginning of a sheet or the workbook
case BOFRecord.sid:
BOFRecord bof = (BOFRecord) record;
if (bof.getType() == BOFRecord.TYPE_WORKBOOK) {
System.out.println("处理 workbook");
// assigned to the class level member
} else if (bof.getType() == BOFRecord.TYPE_WORKSHEET) {
System.out.println("处理sheet");
}
break;
case BoundSheetRecord.sid:
BoundSheetRecord bsr = (BoundSheetRecord) record;
System.out.println("New sheet named: " + bsr.getSheetname());
break;
case RowRecord.sid:
RowRecord rowrec = (RowRecord) record;
System.out.println("Row found, first column at "
+ rowrec.getFirstCol() + " last column at " + rowrec.getLastCol());
break;
case NumberRecord.sid:
NumberRecord numrec = (NumberRecord) record;
System.out.println("Cell found with value " + numrec.getValue()
+ " at row " + numrec.getRow() + " and column " + numrec.getColumn());
break;
// SSTRecords store a array of unique strings used in Excel.
case SSTRecord.sid:
sstrec = (SSTRecord) record;
for (int k = 0; k < sstrec.getNumUniqueStrings(); k++) {
System.out.println("String table value " + k + " = " + sstrec.getString(k));
}
break;
case LabelSSTRecord.sid:
LabelSSTRecord lrec = (LabelSSTRecord) record;
System.out.println("String cell found with value "
+ sstrec.getString(lrec.getSSTIndex()));
break;
}
}
}
8.2 Test
packagepoi.hssf.event;
importjava.io.FileInputStream;
importjava.io.IOException;
importjava.io.InputStream;
importorg.apache.poi.hssf.eventusermodel.HSSFEventFactory;
importorg.apache.poi.hssf.eventusermodel.HSSFRequest;
importorg.apache.poi.poifs.filesystem.POIFSFileSystem;
publicclass TestEventAPI {
publicstatic void main(String[] args) throwsIOException {
FileInputStream fin = newFileInputStream("C:\\Users\\Administrator\\Desktop\\测试.xls");
try{
POIFSFileSystem poifs = newPOIFSFileSystem(fin);
try{
// 从流中获取Excel的WorkBook流
InputStream workBookInputStream = poifs.createDocumentInputStream("Workbook");
try{
HSSFRequest hssfRequest = newHSSFRequest();
// 为所有的record注册一个监听器
hssfRequest.addListenerForAllRecords(newHSSFListenerImpl());
// 创建事件工厂
HSSFEventFactory factory = newHSSFEventFactory();
// 根据WorkBook输入流处理所有事件
factory.processEvents(hssfRequest, workBookInputStream);
}finally{
workBookInputStream.close();
}
}finally{
poifs.close();
}
}finally{
// 一旦所有的监听器处理完成,关闭文件输入流
fin.close();
}
}
}