c docx html,C# Docx to HTML to Docx

本文档介绍如何使用C#进行Docx到HTML和HTML到Docx的转换,包括所需库和处理Word文档中损坏超链接的方法。通过提供示例代码,展示了如何将Docx文件转换为HTML内容,并处理可能引发异常的断开链接。此外,还讨论了从Docx文件上传的需求和背景,以及希望此功能能在更多富文本编辑器中实现。
摘要由CSDN通过智能技术生成

Introduction

I would have simply uploaded this whole article from my docx file in just a few seconds, if only this WYSIWYG editor that I wrote this article on had an Upload from Docxbutton also. Well, I could have just used the Paste from Word c6fc9a1f3acb1dab1fca537901037c17.png button. But to paste from a Word document, we need to have a Microsoft Office Package installed on the system (in Windows).

This article is the solution to that problem and also to help C# developers to perform Docx-HTML-Docx Conversion. The resources found in this article have been collected from many different places and solutions provided by many awesome developers around the globe and combined into one small sample application such that developers don't have to dwell around looking for solutions to common problems.

For now, we will look into how the conversion is done. In the next chapter to this article, we will be creating our very own CKEditor plug in to upload from Docx (Coming soon :D).

Requirements

DocumentFormat.OpenXml.dll (2.6.0.0) [ For Docx to Html Conversion ]

DocumentFormat.OpenXml.dll (2.5.5631.0) [ For Html to Docx Conversion ]

We actually didn't have to include two different sets of the same DLLs, but it was mandatory due to some DLL issues

OpenXmlPowerTools.dll

System.IO.Packaging.dll (1.0.0.0)

System.Drawing [ Add Reference ]

System.IO.Compression [ Add Reference ]

CKEditor(4.6.1 Standard) - Your choice

Note: You can also find the above mentioned DLLs in the project that I have attached along with this article.

Background

Docx to HTML is becoming a very common requirement these days, mainly if you have a CMS or are building one and your WYSIWYG editor wants this feature. You can also find many questions regarding Docx to Html conversion in StackOverflow if you have noticed.

This editor I wrote my article on also has its own Paste from Wordbutton. It would have been much better, if it had a feature to directly upload from a docx file alongside it. I hope this feature will soon be available in all the WYSIWYG editors out there.

Moving on to what this article intends to do is as shown in the figure below:

70ba9de3c0a54c1c58f59cf46fc6335c.png

Well, if you didn't know what a Docx file is, then it is simply a packaged file just like our normal zip file with its own set of standardized structure. If you try uncompressing a docx file with a Decompressor or a Zip extractor, this is what you get:

e194bf0869cbac3efd708e0d5fd2bef3.png

For full details of the packaging structuring, you can head on to the following link:

Using the Code

Converting a Docx File data to an HTML content is as simple as shown by the following code:

C#

Copy Code

DocxToHTML.Converter.HTMLConverter converter = new DocxToHTML.Converter.HTMLConverter();

string htmlContent = converter.ConvertToHtml(YOUR-DOCX-FILE);

If you are building an ASP.NET application, you could have just sent the converted HTML content to the client but for demo purposes, I have shown the output in a CKEditor control inside a WinForm WebBrowser control.

78df486b2a59b09b1e50d42b77971b38.png

One thing we need to look for while parsing the docx content is to check for broken hyperlinks which might result in an exception. The following code intends to handle that exception.

C#

Copy Code

string htmlText = string.Empty;

try

{

htmlText = ParseDOCX(fileInfo);

}

catch (OpenXmlPackageException e)

{

if (e.ToString().Contains("Invalid Hyperlink"))

{

using (FileStream fs = new FileStream(fullFilePath,

FileMode.OpenOrCreate, FileAccess.ReadWrite))

{

UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));

}

htmlText = ParseDOCX(fileInfo);

}

}

return htmlText;

Actual parsing is done here by this method:

C#

Copy Code

private string ParseDOCX(FileInfo fileInfo)

{

try

{

byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);

using (MemoryStream memoryStream = new MemoryStream())

{

memoryStream.Write(byteArray, 0, byteArray.Length);

using (WordprocessingDocument wDoc =

WordprocessingDocument.Open(memoryStream, true))

{

int imageCounter = 0;

var pageTitle = fileInfo.FullName;

var part = wDoc.CoreFilePropertiesPart;

if (part != null)

pageTitle = (string)part.GetXDocument()

.Descendants(DC.title)

.FirstOrDefault() ?? fileInfo.FullName;

WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()

{

AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",

PageTitle = pageTitle,

FabricateCssClasses = true,

CssClassPrefix = "pt-",

RestrictToSupportedLanguages = false,

RestrictToSupportedNumberingFormats = false,

ImageHandler = imageInfo =>

{

++imageCounter;

string extension = imageInfo.ContentType.Split('/')[1].ToLower();

ImageFormat imageFormat = null;

if (extension == "png") imageFormat = ImageFormat.Png;

else if (extension == "gif") imageFormat = ImageFormat.Gif;

else if (extension == "bmp") imageFormat = ImageFormat.Bmp;

else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;

else if (extension == "tiff")

{

extension = "gif";

imageFormat = ImageFormat.Gif;

}

else if (extension == "x-wmf")

{

extension = "wmf";

imageFormat = ImageFormat.Wmf;

}

if (imageFormat == null) return null;

string base64 = null;

try

{

using (MemoryStream ms = new MemoryStream())

{

imageInfo.Bitmap.Save(ms, imageFormat);

var ba = ms.ToArray();

base64 = System.Convert.ToBase64String(ba);

}

}

catch (System.Runtime.InteropServices.ExternalException)

{ return null; }

ImageFormat format = imageInfo.Bitmap.RawFormat;

ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()

.First(c => c.FormatID == format.Guid);

string mimeType = codec.MimeType;

string imageSource =

string.Format("data:{0};base64,{1}", mimeType, base64);

XElement img = new XElement(Xhtml.img,

new XAttribute(NoNamespace.src, imageSource),

imageInfo.ImgStyleAttribute,

imageInfo.AltText != null ?

new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);

return img;

}

};

XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);

var html = new XDocument(new XDocumentType("html", null, null, null),

htmlElement);

var htmlString = html.ToString(SaveOptions.DisableFormatting);

return htmlString;

}

}

}

catch

{

return "File contains corrupt data";

}

}

The Uri fixing code goes like this:

C#

Copy Code

private static string FixUri(string brokenUri)

{

string newURI = string.Empty;

if (brokenUri.Contains("mailto:"))

{

int mailToCount = "mailto:".Length;

brokenUri = brokenUri.Remove(0, mailToCount);

newURI = brokenUri;

}

else

{

newURI = "";

}

return newURI;

}

The HTML to Docx Conversion can be viewed in the link below:

Sources

I would like to thank each and every individual for his/her contribution and the helpful solutions to various problems that were encountered related to this topic.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值