轉載自:http://www.cnblogs.com/shanyou/archive/2007/11/28/975941.html
要實現
office
文檔轉換成
MHTML
文檔,首先會將
office
文檔轉換成
HTML
格式的文檔,然后將
HTML
文檔轉換成
MHTML
文檔。要將
office
文檔轉成
HTML
需要使用
Microsoft.HtmlTrans.Interface
的程序集。這個程序集是需要安裝
“HTML
轉換服務器
”
。
HTML
轉換服務器是
Windows SharePoint Services
服務器場的可選組件。你可以在微軟網站上找到該服務器的安裝文件,或單擊這里下載。
按照下面的步驟安裝:
1.解壓縮下載的文件,里面有文件:eng11probypass.mst
htmltrbackend.msi
HTML Viewer WhitePaper文檔2.如果已經安裝了office,請先卸載,然后安裝支持HTML Viewer Services的Office:在Office安裝路徑下,找到Setup文件所在路徑;將eng11probypass.mst文件拷貝到該路徑下;在命令提示符下輸入:Setup transforms= eng11probypass.mst來安裝支持HTML Viewer的Office;3.安裝HTML Viewer Server:運行htmltrbackend.msi;
安裝好以后,找到Microsoft.HtmlTrans.Interface.dll文件並把它copy到項目文件夾中。在項目中引用該文件。由於將用到命名空間Microsoft.HtmlTrans中的htmlTrLoadBalancer和htmlTrLauncher兩個Romoting對象將office文檔轉換為HTML文件。不過需要注意:
Document types not supported are:
Master documents in Word (see Word Help for an explanation of Master document)
Password protected documents, workbooks, and presentations (encrypted)
Word documents that use framesets
Files that contain Excel 4.0 macros
WordPerfect files
For files with embedded objects, VBA, scripts, etc, the following rules apply:
VBA is ignored and not executed; However, the VBA project (source code, dialog definitions, etc) is retained
Embedded and linked objects are converted to graphic images and displayed in the approximate location where they were in the source file
Linked or embedded objects with password protection are not converted
在實現中另外一個難點就是如何將HTML轉換成MHTML。MHTML是MIME Encapsulation of Aggregate HTML的縮寫,它是一種網絡編碼格式,是用來定義在電子郵件正文中如何傳送html內容的MIME標准。通俗點說,就是一個HTML文件和包括其中的.css文件、.js文件、圖片等等一切的資源文件都整合在一個MHTL文件中。以下是一個典型的MHTML文件(;后為解釋部分):
Mime-Version: 1.0
; Content-Location為主文件地址,可以隨意設定為MTHML文件的類型,這里表示MHTML文件中包含多種文件類型;boundary定義文件之間的分隔符,可隨意定義;type為主文件格式Content-Type: multipart/related; boundary="boundary-example";type="text/html"
;在前面加”--”字符表示一個文件開始--boundary-example
;以下是文件頭; text/html表示該文件的文件類型;charset表示使用的字符集Content-Type: text/html; charset="ISO-8859-1"
; Content-Transfer-Encoding:表示的是該文件的編碼類型;
;一般有兩種:一種是文本類型的一般使用”QUOTED-PRINTABLE”;
;另一種是二進制文件一般使用”BASE64”
Content-Transfer-Encoding: QUOTED-PRINTABLE
;以下是正文... text of the HTML document, which might contain URIs
referencing resources in other body parts, for example through
statements such as:
Example of a copyright sign encoded with Quoted-Printable: =A9
Example of a copyright sign mapped onto HTML markup: ¨
--boundary-example
; Content-Location:該文件的地址,可以是絕對地址或相對主文件的相對地址;這里是絕對地址Content-Location:http://www.ietf.cnri.reston.va.us/images/ietflogo1.gif
Content-Type: IMAGE/GIF
;二進制文件,使用BASE64編碼Content-Transfer-Encoding: BASE64
R0lGODlhGAGgAPEAAP/ZRaCgoAAAACH+PUNvcHlyaWdodCAoQykgMTk5
NSBJRVRGLiBVbmF1dGhvcml6ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
etc...
--boundary-example
;這里是相對地址Content-Location: images/ietflogo2.gif
Content-Transfer-Encoding: BASE64
R0lGODlhGAGgAPEAAP/ZRaCgoAAAACH+PUNvcHlyaWdodCAoQykgMTk5
NSBJRVRGLiBVbmF1dGhvcml6ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
etc...
--boundary-example
Content-Location:http://www.ietf.cnri.reston.va.us/images/ietflogo3.gif
Content-Transfer-Encoding: BASE64
R0lGODlhGAGgAPEAAP/ZRaCgoAAAACH+PUNvcHlyaWdodCAoQykgMTk5
NSBJRVRGLiBVbmF1dGhvcml6ZWQgZHVwbGljYXRpb24gcHJvaGliaXRlZC4A
etc...
;注意這里是結束標記,表示MHTML文件已經結束了.在定義的分隔符前后都加上”--”
--boundary-example—
上面是標准的MHTML文件格式,但是按上面的標准是無法在IE里面正確瀏覽的。還需要注意以下幾點:
1.凡是文本類型的文件所有的”=”替換成”=3D”,例如要替換成
2.所有的BASE64編碼的文件必須要換行;3.每個文件開頭的分隔符要在前加上”--”,而最后一個分隔符要在前后加上”--”;4.正文與文件頭和下一個文件的分割符都要有換行符。
實現代碼:
using
System;
using
System.Collections;
using
System.IO;
using
System.Text;
using
Microsoft.HtmlTrans;
namespace
MSOfficeHelper
{
publicclassConversion
{
//字符串的編碼protectedstaticEncoding encoding=Encoding.Default;
//用於創建IHtmlTrLoadBalancer的remoting對象的urlprotectedstaticstringstrServiceUrl=System.Configuration.ConfigurationSettings.AppSettings["OfficeHtmlViewService"];
publicstaticvoidConvertMHT(stringinputfile,stringoutputfile)
{
//通過url(strServiceUrl)獲取一個IHtmlTrLoadBalancer的remoting對象IHtmlTrLoadBalancer htmlTrLoadBalancer= (IHtmlTrLoadBalancer) System.Activator.GetObject(
typeof(IHtmlTrLoadBalancer), strServiceUrl);
//用輸入文件名(inputfile)作為一個任務的任務標示(strTask)stringstrTask=inputfile;
//根據任務標示(strTask)新建一個任務並獲取任務的url(strLauncherUri)stringstrLauncherUri=htmlTrLoadBalancer.StrGetLauncher(strTask);
//通過任務的url(strLauncherUri)獲取一個IHtmlTrLauncher的remoting對象(htmlTrLauncher),
//並用這個對象來執行該任務IHtmlTrLauncher htmlTrLauncher= (IHtmlTrLauncher) System.Activator.GetObject(typeof(IHtmlTrLauncher), strLauncherUri);
//接下來是把輸入文件(inputfile)的內容讀入一個byte數組(bFile)byte[] bFile=null;
FileStream fsInputMht=null;
BinaryReader bwInputMht=null;
try
{
fsInputMht=newFileStream(inputfile, FileMode.Open);
bwInputMht=newBinaryReader(fsInputMht, encoding);
bFile=newbyte[fsInputMht.Length];
for(longi=0; i
bFile[i]=bwInputMht.ReadByte();
bwInputMht.Close();
fsInputMht.Close();
}catch(Exception ex)
{
bwInputMht.Close();
fsInputMht.Close();
throwex;
}
//CHICreateHtml通過office文檔創建HTML文件及其附件
//CHICreateHtml(
//string strLauncherUri, 任務的url
//byte[] rgbFile, office文檔的二進制內容
//Microsoft.HtmlTrans.BrowserType bt, 使用瀏覽類型,該參數是一個枚舉類型
//string strReqFile, office文檔的路徑/url
//string strTaskName, 任務標示名,HTML轉換服務器根據其跟蹤該請求
//int timeout, 轉換超時時間,如果網絡狀況較差,建議值設大點
//bool fReturnFileBits 是否返回二進制內容,分別保存在CreateHtmlInfo的rgbMainFile屬性和rgrgbThicketFiles屬性中
//);CreateHtmlInfo chi=htmlTrLauncher.CHICreateHtml(strLauncherUri, bFile,
BrowserType.BT_IE4, inputfile, strTask,200,true);
//結束轉換任務htmlTrLoadBalancer.LauncherTaskCompleted(strLauncherUri, strTask);
//在轉換HTML文件的過程中沒有錯誤,並且存在主文件,執行以下代碼if(chi.ce==CreationErrorType.CE_NONE&&chi.fHasMainFile)
{
FileStream fsOutputMht=null;
BinaryWriter bwOutputMht=null;
try
{
fsOutputMht=newFileStream(outputfile, FileMode.Create);
bwOutputMht=newBinaryWriter(fsOutputMht, encoding);
//將HTML文件及其附件轉換為MHTML文件byte[] bMHTMLBody=CreateMHTMLBody(chi);
stringtemp=System.Text.Encoding.Default.GetString(bMHTMLBody);
StringBuilder sb=newStringBuilder();
foreach(charcintemp.ToCharArray())
{
stringt=c.ToString();
if((uint) c>500)
{
t=""+((uint) c).ToString()+";";
}
sb.Append(t);
}
bMHTMLBody=Encoding.ASCII.GetBytes(sb.ToString());
bwOutputMht.Write(bMHTMLBody);
bwOutputMht.Close();
fsOutputMht.Close();
return;
}catch(Exception ex)
{
bwOutputMht.Close();
fsOutputMht.Close();
throwex;
} }return;
}
//MHTML文件頭信息protectedstaticstringMIME="MIME-Version: 1.0"+Environment.NewLine+
"Content-Type: multipart/related; boundary=\"{0}\""+Environment.NewLine+ Environment.NewLine;
//MHTML各個文件的頭信息protectedstaticstringHEADER= Environment.NewLine+"--{0}"+Environment.NewLine+"Content-Location: {1}"+Environment.NewLine+"Content-Transfer-Encoding: {2}"+Environment.NewLine+"Content-Type: {3}"+Environment.NewLine+ Environment.NewLine;
//定義MHTML中各文件之間的分隔符protectedstaticstringBOUNDARY="Define_It_Youself";
//MHTML主文件的URLprotectedstaticstringLOCATION=string.Format(@"file:///c:/{0}/",Guid.NewGuid());
privatestaticbyte[] CreateMHTMLBody(CreateHtmlInfo creatHtmlInfo)
{
//將回車換行符進行編碼並存儲在字節數組中byte[] bNewLine=Encoding.UTF8.GetBytes(Environment.NewLine);
//將3D進行編碼並存儲在字節數組中byte[] bAfterEquals=encoding.GetBytes("3D");
//'='的byte值為61bytebEquals=61;
//MHTML文件的長度longlMHTMLBodyLength=0;
//從零開始的字節偏移量longlOffset=0;
//根據BOUNDARY的定義形成MTHML文件的頭信息stringstrMIME=string.Format(MIME, BOUNDARY);
//將頭信息進行編碼並存儲在字節數組中byte[] bMIME=encoding.GetBytes(strMIME);
//MHTML文件的長度增加bMIME.LongLengthlMHTMLBodyLength+=bMIME.LongLength;
//根據信息定義主文件的頭信息stringstrMainHeader=string.Format(HEADER,
BOUNDARY,
LOCATION+creatHtmlInfo.strMainFileName,
TransferEncoding.QUOTED_PRINTABLE,
ContentType.TEXT_HTML);
byte[] bMainHeader=encoding.GetBytes(strMainHeader);
lMHTMLBodyLength+=bMainHeader.LongLength;
//建立一個動態臨時數組ArrayList alTempArray=newArrayList();
//主文件的正文部分所有的"="替換成"=3D"for(inti=0; i
{
alTempArray.Add(creatHtmlInfo.rgbMainFile[i]);
if(creatHtmlInfo.rgbMainFile[i]==bEquals)
{
alTempArray.Add(bAfterEquals[0]);
alTempArray.Add(bAfterEquals[1]);
} }//獲取新的主文件的正文部分並存儲在字節數組中byte[] bMainBody=newbyte[alTempArray.Count];
alTempArray.CopyTo(bMainBody);
lMHTMLBodyLength+=bMainBody.LongLength;
alTempArray.Clear();
//申明存儲MHTML附件的正文內容字節數組,該數組為一個二維數組byte[][] bThicketContent=null;
//申明存儲MHTML附件的頭信息字節數組string[] strThicketHeaders=null;
//如果MHTML存在附件則執行以下代碼if(creatHtmlInfo.fHasThicket)
{
bThicketContent=newbyte[creatHtmlInfo.rgrgbThicketFiles.Length][];
strThicketHeaders=newstring[creatHtmlInfo.rgrgbThicketFiles.Length];
for(inti=0; i
{
//定義附件的頭信息stringstrLocation=LOCATION+ creatHtmlInfo.strThicketFolderName+"/"+ creatHtmlInfo.rgstrThicketFileNames[i];
stringstrTransferEncoding=TransferEncoding.GetTransferEncodingByFileName
(creatHtmlInfo.rgstrThicketFileNames[i]);
stringstrContentType=ContentType.GetContentTypeByFileName
(creatHtmlInfo.rgstrThicketFileNames[i]);
strThicketHeaders[i]=string.Format(HEADER,
BOUNDARY,
strLocation,
strTransferEncoding,
strContentType);
byte[] bThicketHeader=encoding.GetBytes(strThicketHeaders[i]);
StringBuilder strBase64ThicketBody=newStringBuilder();
byte[] bThicketBody=null;
//如果附件二進制文件,那么用BASE64編碼if(strTransferEncoding== TransferEncoding.BASE64)
{
//首先將字節數組里的內容轉換為Base64編碼的字符串strBase64ThicketBody.Append(
Convert.ToBase64String(creatHtmlInfo.rgrgbThicketFiles[i]));
//然后將字符串進行編碼存儲在新的字節數組中bThicketBody=encoding.GetBytes(strBase64ThicketBody.ToString());
//每76個字節,加入一個換行符intBUFFER_SIZE=76;
for(intj=0; j
{
alTempArray.Add(bThicketBody[j]);
if(j%BUFFER_SIZE==BUFFER_SIZE-1)
{
alTempArray.Add(bNewLine[0]);
alTempArray.Add(bNewLine[1]);
} } bThicketBody=newbyte[alTempArray.Count];
alTempArray.CopyTo(bThicketBody);
alTempArray.Clear();
}//如果附件是以明文編碼,那么明文編碼,並將附件正文部分所有的"="替換成"=3D"else
{
for(intj=0; j
{
alTempArray.Add(creatHtmlInfo.rgrgbThicketFiles[i][j]);
if(creatHtmlInfo.rgrgbThicketFiles[i][j]==bEquals)
{
alTempArray.Add(bAfterEquals[0]);
alTempArray.Add(bAfterEquals[1]);
} } bThicketBody=newbyte[alTempArray.Count];
alTempArray.CopyTo(bThicketBody);
alTempArray.Clear();
}
//如中htm文件則進行添加base操作stringext=Path.GetExtension(creatHtmlInfo.rgstrThicketFileNames[i]).ToLower();
if(ext==".htm")
{
stringbody=Encoding.Default.GetString(bThicketBody);
intstart=body.IndexOf("
if(start>-1)
{
body= body.Insert(
start,
string.Format(
"\r\n\r\n"
+"{0}\"\r\n"+"id=3D\"webarch_temp_base_tag\">\r\n"+"\r\n",
LOCATION+creatHtmlInfo.strThicketFolderName+@"/"+creatHtmlInfo.rgstrThicketFileNames[i]
)
);
}
byte[] data=Encoding.Default.GetBytes(body);
bThicketBody=newbyte[data.Length];
data.CopyTo(bThicketBody,0);
}//將附件中的頭信息字節數組和正文的字節數組合並存儲在bThicketContent[i]中,
//並在lMHTMLBodyLength增加相應的長度bThicketContent[i]=newbyte[bThicketHeader.LongLength+bThicketBody.LongLength+bNewLine.LongLength];
Array.Copy(
bThicketHeader,
0,
bThicketContent[i],
0,
bThicketHeader.LongLength);
Array.Copy(
bThicketBody,
0,
bThicketContent[i],
bThicketHeader.LongLength,
bThicketBody.LongLength);
Array.Copy(
bNewLine,
0,
bThicketContent[i],
bThicketHeader.LongLength+bThicketBody.LongLength,
bNewLine.LongLength);
lMHTMLBodyLength+=bThicketContent[i].LongLength;
} }//MHTML文件結束分割符的存儲在字節數組中byte[] bEndBoundary=encoding.GetBytes(
Environment.NewLine+"--"+BOUNDARY+"--"+Environment.NewLine);
lMHTMLBodyLength+=bEndBoundary.LongLength;
//新建一個數組,該數組用於存儲MHTML文件的所有內容byte[] bMHTMLBody=newbyte[lMHTMLBodyLength];
//將所有的內容全部合並,並存儲在數組bMHTMLBody中Array.Copy(bMIME,0, bMHTMLBody, lOffset, bMIME.LongLength);
lOffset+=bMIME.LongLength;
Array.Copy(bMainHeader,0, bMHTMLBody, lOffset, bMainHeader.LongLength);
lOffset+=bMainHeader.LongLength;
Array.Copy(bMainBody,0, bMHTMLBody, lOffset, bMainBody.LongLength);
lOffset+=bMainBody.LongLength;
if(bThicketContent!=null)
for(inti=0; i
{
Array.Copy(
bThicketContent[i],
0,
bMHTMLBody,
lOffset,
bThicketContent[i].LongLength);
lOffset+=bThicketContent[i].LongLength;
} Array.Copy(bEndBoundary,0, bMHTMLBody, lOffset, bEndBoundary.LongLength);
returnbMHTMLBody;
} }
//根據不同的文件后綴名定義編碼方式classTransferEncoding
{
publicconststringQUOTED_PRINTABLE="quoted-printable";
publicconststringBASE64="base64";
publicstaticstringGetTransferEncodingByFileName(stringfileName)
{
stringstrRusult=string.Empty;
stringstrExtension=fileName.Remove(0, fileName.LastIndexOf(".")).ToUpper();
switch(strExtension)
{
//以下文件名在MTHML文件中都將以明文的形式編碼default:
case".HTM":
case".HTML":
case".XML":
strRusult=TransferEncoding.QUOTED_PRINTABLE;
break;
//以下文件名在MHTML文件中都將以BASE64編碼形式出現case".JPG":
case".JEPG":
case".PNG":
case".MSO":
case".EMZ":
case".GIF":
case".WMF":
case".WMZ":
case".CSS":
strRusult=TransferEncoding.BASE64;
break;
}returnstrRusult;
} }
//根據不同的后綴名定義文件內容的類型classContentType
{
publicconststringTEXT_HTML="text/html; charset=\"us-ascii\"";
publicconststringAPPLICATION_XMSO="application/x-mso";
publicconststringIMAGE_XEMZ="image/x-emz";
publicconststringIMAGE_GIF="image/gif";
publicconststringTEXT_CSS="text/css";
publicconststringTEXT_XML="text/xml; charset=\"utf-8\"";
publicconststringIMAGE_XWMF="image/x-wmf";
publicconststringIMAGE_PNG="image/png";
publicconststringIMAGE_JPEG="image/jpeg";
publicconststringTEXT_JS="application/javascript; charset=\"us-ascii\"";
publicconststringIMAGE_WMZ="image/x-wmz";
publicstaticstringGetContentTypeByFileName(stringfileName)
{
stringstrExtension=fileName.Remove(0, fileName.LastIndexOf(".")).ToUpper();
switch(strExtension)
{
//以下文件名在MHTML文件中的類型是text/html; charset="us-ascii"case".HTM":
case".HTML":
returnContentType.TEXT_HTML;
//以下文件名在MHTML文件中的類型是application/x-msocase".MSO":
returnContentType.APPLICATION_XMSO;
//以下文件名在MHTML文件中的類型是image/x-emzcase".EMZ":
returnContentType.IMAGE_XEMZ;
//以下文件名在MHTML文件中的類型是image/gifcase".GIF":
returnContentType.IMAGE_GIF;
//以下文件名在MHTML文件中的類型是text/csscase".CSS":
returnContentType.TEXT_CSS;
//以下文件名在MHTML文件中的類型是text/xml; charset="utf-8"case".XML":
returnContentType.TEXT_XML;
//以下文件名在MHTML文件中的類型是image/x-wmfcase".WMF":
returnContentType.IMAGE_XWMF;
//以下文件名在MHTML文件中的類型是image/pngcase".PNG":
returnContentType.IMAGE_PNG;
//以下文件名在MHTML文件中的類型是image/jpegcase".JPG":
case".JEPG":
returnContentType.IMAGE_JPEG;
case".JS":
returnContentType.TEXT_JS;
case".WMZ":
returnContentType.IMAGE_WMZ;
}returnstring.Empty;
} }
}