本文主要记录最近的工作内容,使用java实现对Excel(03,07)的上传、解析、验证和入库(PostgreSQL)。
一。上传
实际就是实现文件上传至服务器即可,但是方法有很多,首先要考虑采用何种方法实现上传:
1.参考文章:
主要有以下几种方法:
JSP+Servlet(或纯JSP);Struts2;Struts;FTP;ExtJs;Flex;
2.我采用的是jsp+Servlet的方式来实现的。需要先从apache官网下载fileupload.jar。
UploadFileServlet.java的主要代码:
testfileupload.jsp的主要代码:
3.需要注意的几个问题
(1)上传文件的大小限制
设置 sfu.setFileSizeMax(1024*1024*10000);单位byte.表示近10G。
测试上传1.5G的文件没问题,2.9G以上的文件上传失败,无错误提示,而是“无法显示该页面”的错误:
所以目前不清楚到底支持多大的文件上传。
二。解析
1.java解析excel通常有两种方法:(1)jxl(2)poi.
至于选用哪种那做得看具体的需求,我这次主要是要支持03,07的excel,目前jxl更新慢,尚不支持07,所以只能选用poi;本次只涉及到excel的读操作,没有写操作,jxl的写快于poi,但jxl的读慢于poi.
综合下来,选用poi。这个poi具体是啥意思我不太清楚,但不是常说的感兴趣点(point of interest)哈.
2.是否支持大文件解析
excel03最多只能有65536条记录,65536行*256列;excel07:1048576行*16384列。
excel07文件超过65536条记录,另存为xls时会报错:
所以要支持大文件,必须得用excel07格式来操作。
需要考虑这点,因为这决定了是用DOM做,还是用SAX。
就1和2来看,java解析excel比起.net解析excel是麻烦了很多很多啊,.net解析excel主要有三种方法:com组件;oledb;open xml。其中使用oledb最常用也最简单。
3.jar包下载
jxl;http://www.andykhan.com/jexcelapi/download.html,最新版本为JExcelApi v2.6.12
poi:http://poi.apache.org/download.html#POI-3.9,最新版本为poi-bin-3.9-20121203.zip
4.poi+SAX,支持excel07大文件解析(EventUserModel)
这种方式应该是最高级别的,能满足大数据量需求,也不会造成oom错误的。我这次需要采用的就是这种实现方式。
主要定义了三个类来实现:
(1)Excel2003Reader.java,操作03Excel。
package com.cbe.excel;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.poi.hssf.eventusermodel.EventWorkbookBuilder.SheetRecordCollectingListener;
import org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener;
import org.apache.poi.hssf.eventusermodel.HSSFEventFactory;
import org.apache.poi.hssf.eventusermodel.HSSFListener;
import org.apache.poi.hssf.eventusermodel.HSSFRequest;
import org.apache.poi.hssf.eventusermodel.MissingRecordAwareHSSFListener;
import org.apache.poi.hssf.eventusermodel.dummyrecord.LastCellOfRowDummyRecord;
import org.apache.poi.hssf.eventusermodel.dummyrecord.MissingCellDummyRecord;
import org.apache.poi.hssf.model.HSSFFormulaParser;
import org.apache.poi.hssf.record.BOFRecord;
import org.apache.poi.hssf.record.BlankRecord;
import org.apache.poi.hssf.record.BoolErrRecord;
import org.apache.poi.hssf.record.BoundSheetRecord;
import org.apache.poi.hssf.record.FormulaRecord;
import org.apache.poi.hssf.record.LabelRecord;
import org.apache.poi.hssf.record.LabelSSTRecord;
import org.apache.poi.hssf.record.NumberRecord;
import org.apache.poi.hssf.record.Record;
import org.apache.poi.hssf.record.SSTRecord;
import org.apache.poi.hssf.record.StringRecord;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
public class Excel2003Reader implements HSSFListener {
private int minColumns = -1;
private POIFSFileSystem fs;
private int lastRowNumber;
private int lastColumnNumber;
private boolean outputFormulaValues = true;
private SheetRecordCollectingListener workbookBuildingListener;
//excel2003工作薄
private HSSFWorkbook stubWorkbook;
// Records we pick up as we process
private SSTRecord sstRecord;
private FormatTrackingHSSFListener formatListener;
//表索引
private int sheetIndex = -1;
private BoundSheetRecord[] orderedBSRs;
//@SuppressWarnings("unchecked")
@SuppressWarnings("rawtypes")
private ArrayList boundSheetRecords = new ArrayList();
// For handling formulas with string results
private int nextRow;
private int nextColumn;
private boolean outputNextStringRecord;
//当前行
//private int curRow = 0;
//存储行记录的容器
private List<String> rowlist = new ArrayList<String>();
private List<List<String>> exceldata=new ArrayList<List<String>>();
@SuppressWarnings( "unused")
private String sheetName;
public List<List<String>> getExcelData(){
return exceldata;
}
public void process(String fileName) throws IOException {
this.fs = new POIFSFileSystem(new FileInputStream(fileName));
MissingRecordAwareHSSFListener listener = new MissingRecordAwareHSSFListener(
this);
formatListener = new FormatTrackingHSSFListener(listener);
HSSFEventFactory factory = new HSSFEventFactory();
HSSFRequest request = new HSSFRequest();
if (outputFormulaValues) {
request.addListenerForAllRecords(formatListener);
} else {
workbookBuildingListener = new SheetRecordCollectingListener(
formatListener);
request.addListenerForAllRecords(workbookBuildingListener);
}
factory.processWorkbookEvents(request, fs);
}
@SuppressWarnings("unchecked")
public void processRecord(Record record) {
//List<String> rowlist = new ArrayList<String>();
int thisRow = -1;
int thisColumn = -1;
String thisStr = null;
String value = null;
switch (record.getSid()) {
case BoundSheetRecord.sid:
boundSheetRecords.add(record);
break;
case BOFRecord.sid:
BOFRecord br = (BOFRecord) record;
if (br.getType() == BOFRecord.TYPE_WORKSHEET) {
// 如果有需要,则建立子工作薄
if (workbookBuildingListener != null && stubWorkbook == null) {
stubWorkbook = workbookBuildingListener
.getStubHSSFWorkbook();
}
sheetIndex++;
if (orderedBSRs == null) {
orderedBSRs = BoundSheetRecord
.orderByBofPosition(boundSheetRecords);
}
sheetName = orderedBSRs[sheetIndex].getSheetname();
}
break;
case SSTRecord.sid:
sstRecord = (SSTRecord) record;
break;
case BlankRecord.sid:
BlankRecord brec = (BlankRecord) record;
thisRow = brec.getRow();
thisColumn = brec.getColumn();
thisStr = "";
rowlist.add(thisColumn, thisStr);
break;
case BoolErrRecord.sid: //单元格为布尔类型
BoolErrRecord berec = (BoolErrRecord) record;
thisRow = berec.getRow();
thisColumn = berec.getColumn();
thisStr = berec.getBooleanValue()+"";
rowlist.add(thisColumn, thisStr);
break;
case FormulaRecord.sid: //单元格为公式类型
FormulaRecord frec = (FormulaRecord) record;
thisRow = frec.getRow();
thisColumn = frec.getColumn();
if (outputFormulaValues) {
if (Double.isNaN(frec.getValue())) {
// Formula result is a string
// This is stored in the next record
outputNextStringRecord = true;
nextRow = frec.getRow();
nextColumn = frec.getColumn();
} else {
thisStr = formatListener.formatNumberDateCell(frec);
}
} else {
thisStr = '"' + HSSFFormulaParser.toFormulaString(stubWorkbook,
frec.getParsedexpression_r()) + '"';
}
rowlist.add(thisColumn,thisStr);
break;
case StringRecord.sid://单元格中公式的字符串
if (outputNextStringRecord) {
// String for formula
StringRecord srec = (StringRecord) record;
thisStr = srec.getString();
thisRow = nextRow;
thisColumn = nextColumn;
outputNextStringRecord = false;
}
break;
case LabelRecord.sid:
LabelRecord lrec = (LabelRecord) record;
//curRow = thisRow = lrec.getRow();
thisColumn = lrec.getColumn();
value = lrec.getValue().trim();
value = value.equals("")?" ":value;
rowlist.add(thisColumn, value);
break;
case LabelSSTRecord.sid: //单元格为字符串类型
LabelSSTRecord lsrec = (LabelSSTRecord) record;
//curRow = thisRow = lsrec.getRow();
thisColumn = lsrec.getColumn();
if (sstRecord == null) {
rowlist.add(thisColumn, " ");
} else {
value = sstRecord
.getString(lsrec.getSSTIndex()).toString().trim();
value = value.equals("")?" ":value;
rowlist.add(thisColumn,value);
}
break;
case NumberRecord.sid: //单元格为数字类型
NumberRecord numrec = (NumberRecord) record;
//curRow = thisRow = numrec.getRow();
thisColumn = numrec.getColumn();
value = formatListener.formatNumberDateCell(numrec).trim();
value = value.equals("")?" ":value;
// 向容器加入列值
rowlist.add(thisColumn, value);
break;
default:
break;
}
// 遇到新行的操作
if (thisRow != -1 && thisRow != lastRowNumber) {
lastColumnNumber = -1;
}
// 空值的操作
if (record instanceof MissingCellDummyRecord) {
MissingCellDummyRecord mc = (MissingCellDummyRecord) record;
//curRow = thisRow = mc.getRow();
thisColumn = mc.getColumn();
rowlist.add(thisColumn," ");
}
// 更新行和列的值
if (thisRow > -1)
lastRowNumber = thisRow;
if (thisColumn > -1)
lastColumnNumber = thisColumn;
// 行结束时的操作
if (record instanceof LastCellOfRowDummyRecord) {
if (minColumns > 0) {
// 列值重新置空
if (lastColumnNumber == -1) {
lastColumnNumber = 0;
}
}
lastColumnNumber = -1;
// 每行结束时, 调用getRows() 方法
//rowReader.getRows(sheetIndex,curRow, rowlist);
List<String> rowlistcopy = new ArrayList<String>();
for (int i = 0; i < rowlist.size(); i++) {
rowlistcopy.add(rowlist.get(i));
}
exceldata.add(rowlistcopy);
// 清空容器
rowlist.clear();
}
}
}
(2)Excel2007ReaderNew.java,操作07excel,这个是核心,之所以有个New,是取代了之前的Excel2007Reader.java,解决有单元格为空的判定处理问题。
参考:http://gaosheng08.iteye.com/blog/624758
package com.cbe.excel;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class Excel2007ReaderNew extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
@SuppressWarnings("unused")
private int sheetIndex = -1;
private List<String> rowlist = new ArrayList<String>();
private int curRow = 0; //当前行
private int curCol = 0; //当前列索引
private int preCol = 0; //上一列列索引
private int titleRow = 0; //标题行,一般情况下为0
private int rowsize = 0; //列数
private List<List<String>> exceldata=new ArrayList<List<String>>();//整个excel数据
public List<List<String>> getExcelData(){
return exceldata;
}
//excel记录行操作方法,以行索引和行元素列表为参数,对一行元素进行操作,元素为String类型
// public abstract void optRows(int curRow, List<String> rowlist) throws SQLException ;
//excel记录行操作方法,以sheet索引,行索引和行元素列表为参数,对sheet的一行元素进行操作,元素为String类型
// public abstract void optRows(int sheetIndex,int curRow, List<String> rowlist) throws SQLException;
//只遍历一个sheet,其中sheetId为要遍历的sheet索引,从1开始,1-3
public void processOneSheet(String filename,int sheetId) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader(pkg);
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
// rId2 found by processing the Workbook
// 根据 rId# 或 rSheet# 查找sheet
InputStream sheet2 = r.getSheet("rId"+sheetId);
sheetIndex++;
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
pkg.close();//add by lfc 20130710,解决文件使用后无法删除的问题
}
public void process(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader(pkg);
SharedStringsTable sst = r.getSharedStringsTable();
XMLReader parser = fetchSheetParser(sst);
Iterator<InputStream> sheets = r.getSheetsData();
while (sheets.hasNext()) {
curRow = 0;
sheetIndex++;
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
}
pkg.close();//add by lfc 20130710,解决文件使用后无法删除的问题
}
public XMLReader fetchSheetParser(SharedStringsTable sst)
throws SAXException {
XMLReader parser = XMLReaderFactory
.createXMLReader("org.apache.xerces.parsers.SAXParser");
this.sst = sst;
parser.setContentHandler(this);
return parser;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => 单元格
if (name.equals("c")) {
// 如果下一个元素是 SST 的索引,则将nextIsString标记为true
String cellType = attributes.getValue("t");
String rowStr = attributes.getValue("r");
curCol = this.getRowIndex(rowStr);
if (cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// 置空
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// 根据SST的索引值的到单元格的真正要存储的字符串
// 这时characters()方法可能会被调用多次
if (nextIsString) {
try {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx))
.toString();
} catch (Exception e) {
}
}
// v => 单元格的值,如果单元格是字符串则v标签的值为该字符串在SST中的索引
// 将单元格内容加入rowlist中,在这之前先去掉字符串前后的空白符
if (name.equals("v")) {
String value = lastContents.trim();
value = value.equals("")?" ":value;
int cols = curCol-preCol;
if (cols>1){
for (int i = 0;i < cols-1;i++){
rowlist.add(preCol,"");
}
}
preCol = curCol;
rowlist.add(curCol-1, value);
}else {
//如果标签名称为 row ,这说明已到行尾,调用 optRows() 方法
if (name.equals("row")) {
int tmpCols = rowlist.size();
if(curRow>this.titleRow && tmpCols<this.rowsize){
for (int i = 0;i < this.rowsize-tmpCols;i++){
rowlist.add(rowlist.size(), "");
}
}
// add by lfc 20130710,注释掉
if(curRow==this.titleRow){
this.rowsize = rowlist.size();
}
List<String> rowlistcopy = new ArrayList<String>();
for (int i = 0; i < rowlist.size(); i++) {
rowlistcopy.add(rowlist.get(i));
}
exceldata.add(rowlistcopy);
rowlist.clear();
curRow++;
curCol = 0;
preCol = 0;
}
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
//得到单元格内容的值
lastContents += new String(ch, start, length);
}
//得到列索引,每一列c元素的r属性构成为字母加数字的形式,字母组合为列索引,数字组合为行索引,
//如AB45,表示为第(A-A+1)*26+(B-A+1)*26列,45行
public int getRowIndex(String rowStr){
rowStr = rowStr.replaceAll("[^A-Z]", "");
byte[] rowAbc = rowStr.getBytes();
int len = rowAbc.length;
float num = 0;
for (int i=0;i<len;i++){
num += (rowAbc[i]-'A'+1)*Math.pow(26,len-i-1 );
}
return (int) num;
}
public int getTitleRow() {
return titleRow;
}
public void setTitleRow(int titleRow) {
this.titleRow = titleRow;
}
}
(3)ExcelReaderUtil.java
核心代码:
public static final String EXCEL03_EXTENSION = ".xls"; //excel2003扩展名
public static final String EXCEL07_EXTENSION = ".xlsx"; //excel2007扩展名
public static List<List<String>> readExcel(String fileName) throws Exception{
List<List<String>> exceldata=new ArrayList<List<String>>();
if (fileName.endsWith(EXCEL03_EXTENSION)){// 处理excel2003文件
Excel2003Reader excel03 = new Excel2003Reader();
excel03.process(fileName);
exceldata=excel03.getExcelData();
}
else if (fileName.endsWith(EXCEL07_EXTENSION)){// 处理excel2007文件
//Excel2007Reader excel07 = new Excel2007Reader();
Excel2007ReaderNew excel07 = new Excel2007ReaderNew();
excel07.process(fileName);
exceldata=excel07.getExcelData();
}
else {
throw new Exception("文件格式错误,fileName的扩展名只能是xls或xlsx!");
}
return exceldata;
}
5.poi+DOM,普通模式(UserModel)
相对简单,但数据量大了报OOM错误,所以没有发展前景。
ExcelUtil.java类,代码如下:
import java.io.FileInputStream;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.DateUtil;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
import com.cbe.entity.Index;
public class ExcelUtil {
public static List<Index> GetDistrictIndex(String filename){
List<Index> indexsList=new ArrayList<Index>();
try {
}
catch (Exception ex) {
ex.printStackTrace();
}
return indexsList;
}
public static List<List<Object>> ReadExcel(String fileName) throws Exception{
Workbook workbook=null;
try {
if (fileName.endsWith(".xlsx") || fileName.endsWith(".xls")){
//注:add by lfc,20130702,采用这种方式同时支持xls和xlsx,而不必分别用HSSF和XSSF来做
FileInputStream inputStream = new FileInputStream(fileName);
workbook= WorkbookFactory.create(inputStream);
}
}
catch (Exception e) {
e.printStackTrace();
}
return analyzeWorkbook(workbook);
}
public static List<List<Object>> analyzeWorkbook(Workbook workbook){
Sheet sheet = workbook.getSheetAt(0);//第一个表单
List<String> columnNames=new ArrayList<String>();//列
List<Object> row=new ArrayList<Object>();//一行
List<List<Object>> rows=new ArrayList<List<Object>>();//所有行
int rowCount = sheet.getLastRowNum();//行数
int columnCount=sheet.getRow(0).getPhysicalNumberOfCells();//列数
for (int i = 0; i <columnCount; i++) {
Cell cell=sheet.getRow(0).getCell(i);
columnNames.add(getCellValue(cell).toString());
}
for(int j=0;j<=rowCount;j++){ //第一行为列名,若只是取数据则从第二行开始
Row r=sheet.getRow(j);
for(Cell cell : r){
row.add(getCellValue(cell));
}
rows.add(row);
}
return rows;
}
public static Object getCellValue(Cell cell){
int cellType = cell.getCellType();
switch (cellType) {
case Cell.CELL_TYPE_STRING://1
return cell.getStringCellValue();
case Cell.CELL_TYPE_NUMERIC://0
if (DateUtil.isCellDateFormatted(cell)) {//日期格式
Date t = cell.getDateCellValue();
SimpleDateFormat format=new SimpleDateFormat("yyyy-MM-dd");
return format.format(t);
}
else{
return cell.getNumericCellValue();
}
case Cell.CELL_TYPE_BOOLEAN://4
return cell.getBooleanCellValue();
case Cell.CELL_TYPE_FORMULA://2
case Cell.CELL_TYPE_BLANK://3
case Cell.CELL_TYPE_ERROR://5
default:
return cell.getStringCellValue();
}
}
}
附:
测试的时候可能要监控一下耗时什么的,可以使用MyEclipse自带的jvisualvm.exe工具(我没测过)。
路径:D:\Program Files\MyEclipse\Common\binary\com.sun.java.jdk.win32.x86_1.6.0.013\bin
初次运行时会提示要做jdk的校准标准化(calibration),没细研究是个什么概念,直接点确定就好。
运行后的主界面:
三。验证
这个就根据自己的需求,对特定的excel模板做特定的检查,废话就不多说了。
验证处理还是比较费劲的,所以还是控制好源头“excel模板”,整理好数据,否则问题会很多的,本身excel07采用xml的解析方式就存在不少问题的。
如图,不同的格式会有不同的xml,这个在调试程序的时候我也费了不少周折
四。入库
这个工作就交个dao和daoImp包下的接口和类来实现了,自己写。