我使用xdocreport的xwpf来读取docs文档,并转换成html, 代码如下
核心代码如下:
public static String word2htmlString(XWPFDocument document, FileImageExtractorImpl imageProcessor) throws IOException {
String result = null;
ByteArrayOutputStream htmlOutputStream = null;
try {
XHTMLOptions options = XHTMLOptions.create();
options.setExtractor(imageProcessor);
options.setIgnoreStylesIfUnused(true);
options.setFragment(true);
htmlOutputStream = new ByteArrayOutputStream();
XHTMLConverter.getInstance().convert(document, htmlOutputStream, options);
result = imageProcessor.convertLocalUrlToRemoteUrl(htmlOutputStream.toString("utf-8"));
} catch (Exception e) {
throw new AppBizException(HttpResult.ERROR_TEMPLATE_PARSE_ERROR, e.getMessage());
} finally {
if (null != htmlOutputStream)
htmlOutputStream.close();
}
return result;
}
依赖是
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
<version>1.0.5</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.10</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.12</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>2.2.6</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.12</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.12</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>3.12</version>
</dependency>
后面为了集成eaxyexcel, poi版本必须升级成3.17, 依赖变成了
<dependency>
<groupId>fr.opensagres.xdocreport</groupId>
<artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.10</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-scratchpad</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>easyexcel</artifactId>
<version>2.2.6</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>3.17</version>
</dependency>
结果在导入word 的时候,window和ubuntu都没问题,到了docker里头就乱码,中文变成问号????,这样,刚开始怀疑是字体的问题,导入了中文字体也没用,最后只能使用idea 连接docker进行远程调试,发现这个问题代码
private void write( String content )
throws SAXException
{
try
{
if ( out != null )
{
out.write( content.getBytes() );
}
else
{
writer.write( content );
}
}
catch ( IOException e )
{
throw new SAXException( e );
}
}
注意 content.getBytes() );,这里头有这个代码
static byte[] encode(byte coder, byte[] val) {
Charset cs = Charset.defaultCharset();
一打印这个值 ,发现是US-ASCII,问题找到,赶紧百度,去设置docker的默认字符集
往Dockerfile里头设置
ENV LANG C.UTF-8
参考链接
https://blog.csdn.net/u011077672/article/details/70569849
再重新制作镜像,果然就行了,没有乱码了
再调试,Charset.defaultCharset() 就成utf-8了