Prototype'parseMhtToFile'函数 below 从Cognos活动报告'mht'文件中删除html文件和其他工件,但可以根据其他目的定制。
这是用Groovy编写的,需要 Mime4J'核心层'和'dom'jars ( 当前 0.7.2 ) 。import org.apache.james.mime4j.dom.Message
import org.apache.james.mime4j.dom.Multipart
import org.apache.james.mime4j.dom.field.ContentTypeField
import org.apache.james.mime4j.message.DefaultMessageBuilder
import org.apache.james.mime4j.stream.MimeConfig
/**
* Use Mime4J MessageBuilder to parse an mhtml file (assumes multipart) into
* separate html files.
* Files will be written to outDir (or parent) as baseName + partIdx + ext.
*/
void parseMhtToFile(File mhtFile, File outDir = null) {
if (!outDir) {outDir = mhtFile.parentFile }
//File baseName will be used in generating new filenames
def mhtBaseName = mhtFile.name.replaceFirst(~/.[^.]+$/, '')
//-- Set up Mime parser, using Default Message Builder
MimeConfig parserConfig = new MimeConfig();
parserConfig.setMaxHeaderLen(-1);//The default is a mere 10k
parserConfig.setMaxLineLen(-1);//The default is only 1000 characters.
parserConfig.setMaxHeaderCount(-1);//Disable the check for header count.
DefaultMessageBuilder builder = new DefaultMessageBuilder();
builder.setMimeEntityConfig(parserConfig);
//-- Parse the MHT stream data into a Message object
println"Parsing ${mhtFile}...";
InputStream mhtStream = mhtFile.newInputStream()
Message message = builder.parseMessage(mhtStream);
//-- Process the resulting body parts, writing to file
assert message.getBody() instanceof Multipart
Multipart multipart = (Multipart) message.getBody();
def parts = multipart.getBodyParts();
parts.eachWithIndex { p, i ->
ContentTypeField cType = p.header.getField('content-type')
println"${p.class.simpleName}t${i}t${cType.mimeType}"
//Assume mime sub-type is a"good enough" file-name extension
//e.g. text/html = html, image/png = png, application/json = json
String partFileName ="${mhtBaseName}_${i}.${cType.subType}"
File partFile = new File(outDir, partFileName)
//Write part body stream to file
println"Writing ${partFile}...";
if (partFile.exists()) partFile.delete();
InputStream partStream = p.body.inputStream;
partFile.append(partStream);
}
}
用法简单:File mhtFile = new File('', 'Report-en-au.mht')
parseMhtToFile(mhtFile)
println 'Done.'
输出为:Parsing Report-en-au.mht...
BodyPart 0 text/html
Writing Report-en-au_0.html...
BodyPart 1 image/png
Writing Report-en-au_1.png...
Done.
关于其他改进的想法:例如对于 '文本'mime部件,可以访问Reader,而不是像操作请求那样更适合于文本挖掘。
对于生成的文件扩展名,我将使用另一个库查找适当的扩展,而不是假定mime子类型是足够的。
处理单个正文( 非多部分) 和递归多部分mhtml文件和其他。 这些可能需要一个带有自定义的内容处理程序( )的 MimeStreamParser插件。