最近项目可以说到达了一个里程碑,借这篇文章把前面的技术进行总结.
我们的项目是给一个政府单位开发的,后台其实是个CMS系统,客户非要完成一个功能就是把WORD直接导入到Web 编辑器中,我们是用的是Fckeditor2.5版本,这个功能让我很头疼,想了几天没有思路,但是忽然看到了网上的一篇文章 地址如下:
http://topic.csdn.net/u/20091020/21/b77f825b-4a18-4a86-b642-8d38ffef9e12.html
3楼的哥们把代码贴了上了,不错的思路。
首先用调用COM组件把Word转为html ,然后通过截取重要的源代码 ,最后把这代码放到fck编辑器中,我在做的中间还遇到了很多技术细节问题,下面来看我的实现
使用jacob 来把word转成html
我们的项目是给一个政府单位开发的,后台其实是个CMS系统,客户非要完成一个功能就是把WORD直接导入到Web 编辑器中,我们是用的是Fckeditor2.5版本,这个功能让我很头疼,想了几天没有思路,但是忽然看到了网上的一篇文章 地址如下:
http://topic.csdn.net/u/20091020/21/b77f825b-4a18-4a86-b642-8d38ffef9e12.html
3楼的哥们把代码贴了上了,不错的思路。
首先用调用COM组件把Word转为html ,然后通过截取重要的源代码 ,最后把这代码放到fck编辑器中,我在做的中间还遇到了很多技术细节问题,下面来看我的实现
使用jacob 来把word转成html
Java代码
/**
* 把word文件转换成html文件
*
* @param src
* 原文件
* @param out
* 目标文件
*/
public static synchronized void word2Html(String src, String out) {
ActiveXComponent app = null;
try {
app = new ActiveXComponent("Word.Application");// 启动word
app.setProperty("Visible", new Variant(false));
// 设置word不可见
Dispatch docs = app.getProperty("Documents").toDispatch();
Dispatch doc = Dispatch.invoke(docs, "Open", Dispatch.Method, new Object[] { src, new Variant(false), new Variant(true) }, new int[1]).
toDispatch();
// 打开word文件 8转为 html 9转为 mht
Dispatch.invoke(doc, "SaveAs", Dispatch.Method, new Object [] {out, new Variant(8) }, new int[1]);
Variant f = new Variant(false);
Dispatch.call(doc, "Close", f);
} catch (Exception e) {
e.printStackTrace();
} finally {
// 注意这里一定 要关闭否则服务器端会有很多winword.exe进程
app.invoke("Quit", new Variant[] {});
app = null;
}
} [/color]
Java代码
/**
* 把word文件转换成html文件
*
* @param src
* 原文件
* @param out
* 目标文件
*/
public static synchronized void word2Html(String src, String out) {
ActiveXComponent app = null;
try {
app = new ActiveXComponent("Word.Application");// 启动word
app.setProperty("Visible", new Variant(false));
// 设置word不可见
Dispatch docs = app.getProperty("Documents").toDispatch();
Dispatch doc = Dispatch.invoke(docs, "Open", Dispatch.Method, new Object[] { src, new Variant(false), new Variant(true) }, new int[1]).
toDispatch();
// 打开word文件 8转为 html 9转为 mht
Dispatch.invoke(doc, "SaveAs", Dispatch.Method, new Object [] {out, new Variant(8) }, new int[1]);
Variant f = new Variant(false);
Dispatch.call(doc, "Close", f);
} catch (Exception e) {
e.printStackTrace();
} finally {
// 注意这里一定 要关闭否则服务器端会有很多winword.exe进程
app.invoke("Quit", new Variant[] {});
app = null;
}
}
!----------------------------------------------------------------->
[color=green]上面的代码其实完成的功能其实就是通过调用COM组件打开word程序然后隐藏窗口然后把打开的word文件另存为html.
2.用Apache的CommonsIO读取文件
Java代码
/**
* 根据文件名读取出html代码
*
* @param fileName
* @return
*/
public static synchronized String getHtmlCode(String fileName) {
InputStream in = null;
String result = null;
try {
in = new FileInputStream(fileName);
result = IOUtils.toString(in, "gb2312");
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(in);
}
return result;
}
Java代码
/**
* 根据文件名读取出html代码
*
* @param fileName
* @return
*/
public static synchronized String getHtmlCode(String fileName) {
InputStream in = null;
String result = null;
try {
in = new FileInputStream(fileName);
result = IOUtils.toString(in, "gb2312");
} catch (Exception e) {
e.printStackTrace();
} finally {
IOUtils.closeQuietly(in);
}
return result;
} [/color]
!--------------------------------------------------------------->
[color=blue]默认转成的html文件就是gb2312编码的 这里注意你读取出来的字符串必须是包含空格的,意思就是把读取出来的字符串拷出来放到文本文档里面的代码和html的源代码格式完全一样.
3.截取body代码
Java代码
/**
* 截取body内容
*
* @param bodyCode
* @return
*/
public static synchronized String performBodyCode(String htmlCode) {
String bodyCode = "";
// 处理body
int bodyIndex = htmlCode.indexOf("<body");
int bodyEndIndex = htmlCode.indexOf("</html>");
if (bodyIndex != -1 && bodyEndIndex != -1) {
htmlCode = htmlCode.substring(bodyIndex, bodyEndIndex);
//bodyCode = StringUtils.replace(htmlCode, "v:imagedata", "img");
//bodyCode = StringUtils.replace(bodyCode, "</v:imagedata>", "");
bodyCode=htmlCode;
}
htmlCode = null;
return bodyCode;
} [/color]
!--------------------------------------------------------------------->
Java代码
/**
* 截取body内容
*
* @param bodyCode
* @return
*/
public static synchronized String performBodyCode(String htmlCode) {
String bodyCode = "";
// 处理body
int bodyIndex = htmlCode.indexOf("<body");
int bodyEndIndex = htmlCode.indexOf("</html>");
if (bodyIndex != -1 && bodyEndIndex != -1) {
htmlCode = htmlCode.substring(bodyIndex, bodyEndIndex);
//bodyCode = StringUtils.replace(htmlCode, "v:imagedata", "img");
//bodyCode = StringUtils.replace(bodyCode, "</v:imagedata>", "");
bodyCode=htmlCode;
}
htmlCode = null;
return bodyCode;
}
!------------------------------------------------------------------->
[color=indigo]
转成的html代码中很多一部分是无用的代码 我们需要对他进行减肥 已经标签的替换.
4.处理html代码中的style标签
Java代码
/**
* 处理Style标签中的内容
*
* @param htmlCode
* @return
*/
public static synchronized String performStyleCode(String htmlCode) {
String result = "";
int index = 0;
int styleStartIndex = 0;
int styleEndIndex = 0;
// 截取<style>标签中开始部分的坐标
while (index < htmlCode.length()) {
int styleIndexStartTemp = htmlCode.indexOf("<style>", index);
if (styleIndexStartTemp == -1) {
break;
}
int styleContentStartIndex = htmlCode.indexOf("<!--", styleIndexStartTemp);
if (styleContentStartIndex - styleIndexStartTemp == 9) {
styleStartIndex = styleIndexStartTemp;
break;
}
index = styleIndexStartTemp + 7;
}
// 截取style标签中后面部分的坐标
index = 0;
while (index < htmlCode.length()) {
int styleContentEndIndex = htmlCode.indexOf("-->", index);
if (styleContentEndIndex == -1) {
break;
}
int styleEndIndexTemp = htmlCode.indexOf("</style>", styleContentEndIndex);
if (styleEndIndexTemp - styleContentEndIndex == 5) {
styleEndIndex = styleEndIndexTemp;
break;
}
index = styleContentEndIndex + 4;
}
result = htmlCode.substring(styleStartIndex, styleEndIndex + 8);
return result;
} [/color]
!------------------------------------------------------------------>
Java代码
/**
* 处理Style标签中的内容
*
* @param htmlCode
* @return
*/
public static synchronized String performStyleCode(String htmlCode) {
String result = "";
int index = 0;
int styleStartIndex = 0;
int styleEndIndex = 0;
// 截取<style>标签中开始部分的坐标
while (index < htmlCode.length()) {
int styleIndexStartTemp = htmlCode.indexOf("<style>", index);
if (styleIndexStartTemp == -1) {
break;
}
int styleContentStartIndex = htmlCode.indexOf("<!--", styleIndexStartTemp);
if (styleContentStartIndex - styleIndexStartTemp == 9) {
styleStartIndex = styleIndexStartTemp;
break;
}
index = styleIndexStartTemp + 7;
}
// 截取style标签中后面部分的坐标
index = 0;
while (index < htmlCode.length()) {
int styleContentEndIndex = htmlCode.indexOf("-->", index);
if (styleContentEndIndex == -1) {
break;
}
int styleEndIndexTemp = htmlCode.indexOf("</style>", styleContentEndIndex);
if (styleEndIndexTemp - styleContentEndIndex == 5) {
styleEndIndex = styleEndIndexTemp;
break;
}
index = styleContentEndIndex + 4;
}
result = htmlCode.substring(styleStartIndex, styleEndIndex + 8);
return result;
}
/**
* 处理Style标签中的内容
*
* @param htmlCode
* @return
*/
public static synchronized String performStyleCode(String htmlCode) {
String result = "";
int index = 0;
int styleStartIndex = 0;
int styleEndIndex = 0;
// 截取<style>标签中开始部分的坐标
while (index < htmlCode.length()) {
int styleIndexStartTemp = htmlCode.indexOf("<style>", index);
if (styleIndexStartTemp == -1) {
break;
}
int styleContentStartIndex = htmlCode.indexOf("<!--", styleIndexStartTemp);
if (styleContentStartIndex - styleIndexStartTemp == 9) {
styleStartIndex = styleIndexStartTemp;
break;
}
index = styleIndexStartTemp + 7;
}
// 截取style标签中后面部分的坐标
index = 0;
while (index < htmlCode.length()) {
int styleContentEndIndex = htmlCode.indexOf("-->", index);
if (styleContentEndIndex == -1) {
break;
}
int styleEndIndexTemp = htmlCode.indexOf("</style>", styleContentEndIndex);
if (styleEndIndexTemp - styleContentEndIndex == 5) {
styleEndIndex = styleEndIndexTemp;
break;
}
index = styleContentEndIndex + 4;
}
result = htmlCode.substring(styleStartIndex, styleEndIndex + 8);
return result;
}
word转为html后里面有很多的style标签 其中
<style>
<!--- 内容省略
--->
<style>
类似于如上带html注释的style标签才是有用的 其余全是无用的.上面的代码就是把这有用的代码截取出来.如果你在第2部的时候格式读取正确,那么上面的代码截取出来的代码肯定没问题.
!--------------------------------------------------------------------->
[color=indigo]5.处理word文件中的图片
Java代码
/**
* 处理body中的图片内容
* @param bodyContent
* @return
*/
public static synchronized String performBodyImg(String bodyContent) {
//根据图片名称预览图片action的地址
String newImgSrc = "tumbnail.action?fileName=";
//存放word文件的物理位置
String filePath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word");
//存放图片的物理位置
String imgPath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.image");
Parser parser = Parser.createParser(bodyContent, "gb2312");
ImgTagVisitor imgTag = new ImgTagVisitor();
try {
parser.visitAllNodesWith(imgTag);
// 得到所有图片地址
List<String> imgUrls = imgTag.getSrcStringList();
for (String url : imgUrls) {
String uuid = UUID.randomUUID().toString();
String extName = url.substring(url.lastIndexOf("."));
String newImgFileName = newImgSrc + uuid + extName;
bodyContent = StringUtils.replace(bodyContent, url, newImgFileName);
bodyContent = StringUtils.replace(bodyContent, url, newImgFileName);
ImageUtils.copy(filePath + url, imgPath + uuid + extName);
}
} catch (ParserException e) {
e.printStackTrace();
}
String result = bodyContent;
//去除多余的代码
result = StringUtils.replace(result, "<![endif]>", "");
result = StringUtils.replace(result, "<![if !vml]>", "");
bodyContent = null;
return result;
} [/color]
!------------------------------------------------------------------->
Java代码
/**
* 处理body中的图片内容
* @param bodyContent
* @return
*/
public static synchronized String performBodyImg(String bodyContent) {
//根据图片名称预览图片action的地址
String newImgSrc = "tumbnail.action?fileName=";
//存放word文件的物理位置
String filePath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word");
//存放图片的物理位置
String imgPath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.image");
Parser parser = Parser.createParser(bodyContent, "gb2312");
ImgTagVisitor imgTag = new ImgTagVisitor();
try {
parser.visitAllNodesWith(imgTag);
// 得到所有图片地址
List<String> imgUrls = imgTag.getSrcStringList();
for (String url : imgUrls) {
String uuid = UUID.randomUUID().toString();
String extName = url.substring(url.lastIndexOf("."));
String newImgFileName = newImgSrc + uuid + extName;
bodyContent = StringUtils.replace(bodyContent, url, newImgFileName);
bodyContent = StringUtils.replace(bodyContent, url, newImgFileName);
ImageUtils.copy(filePath + url, imgPath + uuid + extName);
}
} catch (ParserException e) {
e.printStackTrace();
}
String result = bodyContent;
//去除多余的代码
result = StringUtils.replace(result, "<![endif]>", "");
result = StringUtils.replace(result, "<![if !vml]>", "");
bodyContent = null;
return result;
}
/**
* 处理body中的图片内容
* @param bodyContent
* @return
*/
public static synchronized String performBodyImg(String bodyContent) {
//根据图片名称预览图片action的地址
String newImgSrc = "tumbnail.action?fileName=";
//存放word文件的物理位置
String filePath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word");
//存放图片的物理位置
String imgPath = ResourceBundle.getBundle("sysConfig").getString("userFilePath.image");
Parser parser = Parser.createParser(bodyContent, "gb2312");
ImgTagVisitor imgTag = new ImgTagVisitor();
try {
parser.visitAllNodesWith(imgTag);
// 得到所有图片地址
List<String> imgUrls = imgTag.getSrcStringList();
for (String url : imgUrls) {
String uuid = UUID.randomUUID().toString();
String extName = url.substring(url.lastIndexOf("."));
String newImgFileName = newImgSrc + uuid + extName;
bodyContent = StringUtils.replace(bodyContent, url, newImgFileName);
bodyContent = StringUtils.replace(bodyContent, url, newImgFileName);
ImageUtils.copy(filePath + url, imgPath + uuid + extName);
}
} catch (ParserException e) {
e.printStackTrace();
}
String result = bodyContent;
//去除多余的代码
result = StringUtils.replace(result, "<![endif]>", "");
result = StringUtils.replace(result, "<![if !vml]>", "");
bodyContent = null;
return result;
}
上面的代码中用到了开源的html解析工具htmlparser 用他来进行分析得到所有图片的链接 然后把图片的链接用Apache的Commons-lang包中的StrutsUtils替换成我修改了fck中预览图片的action
下面是我自己实现ImgTagVisitor 代码
Java代码
package com.bettem.cms.web.utils.htmlparser;
import java.util.ArrayList;
import java.util.List;
import org.htmlparser.Tag;
import org.htmlparser.Text;
import org.htmlparser.visitors.NodeVisitor;
/**
*
* 说明:htmlparser 解析 Img 标签所用类
* *******************
* 日期 人员
* 2010-2-3 Liqiang
*/
public class ImgTagVisitor extends NodeVisitor {
private List<String> srcList;
private StringBuffer textAccumulator;
public ImgTagVisitor() {
srcList = new ArrayList<String>();
textAccumulator = new StringBuffer();
}
public void visitTag(Tag tag) {
if (tag.getTagName().equalsIgnoreCase("img")) {
srcList.add(tag.getAttribute("src"));
}
}
public List<String> getSrcStringList() {
return srcList;
}
public void visitStringNode(Text stringNode) {
String text = stringNode.getText();
textAccumulator.append(text);
}
public String getText() {
return textAccumulator.toString();
}
}
Java代码
package com.bettem.cms.web.utils.htmlparser;
import java.util.ArrayList;
import java.util.List;
import org.htmlparser.Tag;
import org.htmlparser.Text;
import org.htmlparser.visitors.NodeVisitor;
/**
*
* 说明:htmlparser 解析 Img 标签所用类
* *******************
* 日期 人员
* 2010-2-3 Liqiang
*/
public class ImgTagVisitor extends NodeVisitor {
private List<String> srcList;
private StringBuffer textAccumulator;
public ImgTagVisitor() {
srcList = new ArrayList<String>();
textAccumulator = new StringBuffer();
}
public void visitTag(Tag tag) {
if (tag.getTagName().equalsIgnoreCase("img")) {
srcList.add(tag.getAttribute("src"));
}
}
public List<String> getSrcStringList() {
return srcList;
}
public void visitStringNode(Text stringNode) {
String text = stringNode.getText();
textAccumulator.append(text);
}
public String getText() {
return textAccumulator.toString();
}
}
package com.bettem.cms.web.utils.htmlparser;
import java.util.ArrayList;
import java.util.List;
import org.htmlparser.Tag;
import org.htmlparser.Text;
import org.htmlparser.visitors.NodeVisitor;
/**
*
* 说明:htmlparser 解析 Img 标签所用类
* *******************
* 日期 人员
* 2010-2-3 Liqiang
*/
public class ImgTagVisitor extends NodeVisitor {
private List<String> srcList;
private StringBuffer textAccumulator;
public ImgTagVisitor() {
srcList = new ArrayList<String>();
textAccumulator = new StringBuffer();
}
public void visitTag(Tag tag) {
if (tag.getTagName().equalsIgnoreCase("img")) {
srcList.add(tag.getAttribute("src"));
}
}
public List<String> getSrcStringList() {
return srcList;
}
public void visitStringNode(Text stringNode) {
String text = stringNode.getText();
textAccumulator.append(text);
}
public String getText() {
return textAccumulator.toString();
}
}
6.移除多余的v:imagedata标签
Java代码
/**
* 移除多余的v:imagedata标签
* @param content
* @return
*/
public static synchronized String removeImagedataTag(String content) {
Parser parser = null;
Lexer lexer = null;
AndFilter andFilter = null;
NodeList nl = null;
try {
parser = new Parser(content, Parser.STDOUT);
lexer = new Lexer(content);
andFilter = new AndFilter(new NotFilter(new TagNameFilter("v:imagedata")), new NotFilter(new TagNameFilter("v:imagedata")));
nl = parser.extractAllNodesThatMatch(andFilter);
} catch (ParserException e) {
e.printStackTrace();
}
return nl.toHtml();
}
Java代码
/**
* 移除多余的v:imagedata标签
* @param content
* @return
*/
public static synchronized String removeImagedataTag(String content) {
Parser parser = null;
Lexer lexer = null;
AndFilter andFilter = null;
NodeList nl = null;
try {
parser = new Parser(content, Parser.STDOUT);
lexer = new Lexer(content);
andFilter = new AndFilter(new NotFilter(new TagNameFilter("v:imagedata")), new NotFilter(new TagNameFilter("v:imagedata")));
nl = parser.extractAllNodesThatMatch(andFilter);
} catch (ParserException e) {
e.printStackTrace();
}
return nl.toHtml();
}
/**
* 移除多余的v:imagedata标签
* @param content
* @return
*/
public static synchronized String removeImagedataTag(String content) {
Parser parser = null;
Lexer lexer = null;
AndFilter andFilter = null;
NodeList nl = null;
try {
parser = new Parser(content, Parser.STDOUT);
lexer = new Lexer(content);
andFilter = new AndFilter(new NotFilter(new TagNameFilter("v:imagedata")), new NotFilter(new TagNameFilter("v:imagedata")));
nl = parser.extractAllNodesThatMatch(andFilter);
} catch (ParserException e) {
e.printStackTrace();
}
return nl.toHtml();
}
在word转html的时候大图片会被自动压缩成小图片 但是原来的大图片还会存在在代码里,上面的代码把多余的标签过滤掉.
最后看下我action中的代码
Java代码
/**
* 导入word文件
*
* @return
*/
public synchronized String exportWord()
{
String content = null;
String path = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word");
InputStream ins = null;
OutputStream wordFile = null;
String htmlPath = null;
String wordPath = null;
// 处理上传的word文件
try
{
String uuid = UUID.randomUUID().toString();
// 截取扩展名
String fileName = uuid + filedataFileName.substring(filedataFileName.lastIndexOf("."));
// 生存html文件名
String wordHtmlFileName = uuid + ".html";
ins = new FileInputStream(filedata);
wordPath = path + fileName;
wordFile = new FileOutputStream(wordPath);
IOUtils.copy(ins, wordFile);
// word转html
htmlPath = path + wordHtmlFileName;
WordUtils.word2Html(wordPath, htmlPath);
String wordHtmlContent = WordUtils.getHtmlCode(htmlPath);
// 处理样式
String styleCode = WordUtils.performStyleCode(wordHtmlContent);
String bodyCode = WordUtils.performBodyCode(wordHtmlContent);
// 处理文章中的图片
bodyCode = WordUtils.performBodyImg(bodyCode);
content = styleCode + bodyCode;
styleCode = null;
bodyCode = null;
WordUtils.removeImagedataTag(content);
}
catch (FileNotFoundException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
IOUtils.closeQuietly(wordFile);
IOUtils.closeQuietly(ins);
try
{
File word = new File(wordPath);
File file = new File(htmlPath);
if (file.exists())
{
file.delete();
word.delete();
FileUtils.deleteDirectory(new File(htmlPath.substring(0, htmlPath.lastIndexOf(".")) + ".files"));
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
// 读取word文件内容,添加到content中
// 放到request中
ServletActionContext.getRequest().setAttribute("content", content);
ServletActionContext.getRequest().setAttribute("add", true);
return SUCCESS;
}
Java代码
/**
* 导入word文件
*
* @return
*/
public synchronized String exportWord()
{
String content = null;
String path = ResourceBundle.getBundle("sysConfig").getString("userFilePath.word");
InputStream ins = null;
OutputStream wordFile = null;
String htmlPath = null;
String wordPath = null;
// 处理上传的word文件
try
{
String uuid = UUID.randomUUID().toString();
// 截取扩展名
String fileName = uuid + filedataFileName.substring(filedataFileName.lastIndexOf("."));
// 生存html文件名
String wordHtmlFileName = uuid + ".html";
ins = new FileInputStream(filedata);
wordPath = path + fileName;
wordFile = new FileOutputStream(wordPath);
IOUtils.copy(ins, wordFile);
// word转html
htmlPath = path + wordHtmlFileName;
WordUtils.word2Html(wordPath, htmlPath);
String wordHtmlContent = WordUtils.getHtmlCode(htmlPath);
// 处理样式
String styleCode = WordUtils.performStyleCode(wordHtmlContent);
String bodyCode = WordUtils.performBodyCode(wordHtmlContent);
// 处理文章中的图片
bodyCode = WordUtils.performBodyImg(bodyCode);
content = styleCode + bodyCode;
styleCode = null;
bodyCode = null;
WordUtils.removeImagedataTag(content);
}
catch (FileNotFoundException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
IOUtils.closeQuietly(wordFile);
IOUtils.closeQuietly(ins);
try
{
File word = new File(wordPath);
File file = new File(htmlPath);
if (file.exists())
{
file.delete();
word.delete();
FileUtils.deleteDirectory(new File(htmlPath.substring(0, htmlPath.lastIndexOf(".")) + ".files"));
}
}
catch (IOException e)
{
e.printStackTrace();
}
}
// 读取word文件内容,添加到content中
// 放到request中
ServletActionContext.getRequest().setAttribute("content", content);
ServletActionContext.getRequest().setAttribute("add", true);
return SUCCESS;
}