Preview Word files (docx) in HTML using ASP.NET, OpenXML and LINQ to XML

转载 2011年06月10日 22:20:00

Since an image (or even an example) tells more than any text will ever do, here's what I've created in the past few evening hours:


Live examples:

Want the source code? Download it here.

Want to know how?

If you want to know how I did this, let me first tell you why I created this. After searching Google for something similar, I found aSharepoint blogger who did the same using a Sharepoint XSL transformation document called DocX2Html.xsl. Great, but this document can not be distributed without a Sharepoint license. The only option for me was to do something similar myself.

ASP.NET handlers

The main idea of this project was to be able to type in a URL ending in ".docx", which would then render a preview of the underlying Word document. Luckily, ASP.NET provides a system of creating HttpHandlers. A HttpHandler is the class instance which is called by the .NET runtime to process an incoming request for a specific extension. So let's trick ASP.NET into believing ".docx" is an extension which should be handled by a custom class...

Creating a custom handler

A custom handler can be created quite easily. Just create a new class, and make it implement the IHttpHandler interface:



/// <summary>
/// Word document HTTP handler
/// </summary>
public class WordDocumentHandler : IHttpHandler
    #region IHttpHandler Members

    /// <summary>
    /// Is the handler reusable?
    /// </summary>
    public bool IsReusable
        get { return true; }

    /// <summary>
    /// Process request
    /// </summary>
    /// <param name="context">Current http context</param>
    public void ProcessRequest(HttpContext context)
        // Todo...
        context.Response.Write("Hello world!");




Registering a custom handler

For ASP.NET to recognise our newly created handler, we must register it in Web.config:


Now if you are using IIS6, you should also register this extension to be handled by the .NET runtime:


In the application configuration, add the extension ".docx" and make it point to the following executable:C:/WINDOWS/Microsoft.NET/Framework/v2.0.50727/aspnet_isapi.dll

This should be it. Fire up your browser, browse to your web site and type anything.docx. You should see "Hello world!" appearing in a nice, white page.


As you may already know, Word 2007 files are OpenXML packages containg WordprocessingML markup. A .docx file can be opened using the System.IO.Packaging.Package class (which is available after adding a project reference to WindowsBase.dll).

The Package class is created for accessing any OpenXML package. This includes all Office 2007 file formats, but also custom OpenXML formats which you can implement for yourself. Unfortunately, if you want to use Package to access an Office 2007 file, you'll have to implement a lot of utility functions to get the right parts from the OpenXML container.

Luckily, Microsoft released an OpenXML SDK (CTP), which I also used in order to create this Word preview handler.


As you know, the latest .NET 3.5 release brought us something new & extremely handy: LINQ (Language Integrated Query). On Doug's blog, I read about Eric White's attempts to use LINQ to XML on OpenXML.


For implementing my handler, I basically used similar code to Eric's to run query's on a Word document's contents. Here's an example which fetches all paragraphs in a Word document:



using (WordprocessingDocument document = WordprocessingDocument.Open("test.docx", false))
    // Register namespace
    XNamespace w = """;">";

    // Element shortcuts
    XName w_r = w + "r";
    XName w_ins = w + "ins";
    XName w_hyperlink = w + "hyperlink";

    // Load document's MainDocumentPart (document.xml) in XDocument
    XDocument xDoc = XDocument.Load(
            new StreamReader(document.MainDocumentPart.GetStream())

    // Fetch paragraphs
    var paragraphs = from l_paragraph in xDoc
                    .Element(w + "body")
                    .Descendants(w + "p")
         select new
             TextRuns = l_paragraph.Elements().Where(z => z.Name == w_r || z.Name == w_ins || z.Name == w_hyperlink)

    // Write paragraphs
    foreach (var paragraph in paragraphs)
        // Fetch runs
        var runs = from l_run in paragraph.Runs
                   select new
                       Text = l_run.Descendants(w + "t").StringConcatenate(element => (string)element)

        // Write runs
        foreach (var run in runs)
            // Use run.Text to fetch a text string



Now if you run this code, you will notice a compilation error... This is due to the fact that I used an extension methodStringConcatenate.

Extension methods

In the above example, I used an extension method named StringConcatenate. An extension method is, as the name implies, an "extension" to a known class. In the following example, find the extension for all IEnumerable<T> instances:



public static class IEnumerableExtensions
    /// <summary>
    /// Concatenate strings
    /// </summary>
    /// <typeparam name="T">Type</typeparam>
    /// <param name="source">Source</param>
    /// <param name="func">Function delegate</param>
    /// <returns>Concatenated string</returns>
    public static string StringConcatenate<T>(this IEnumerable<T> source, Func<T, string> func)
        StringBuilder sb = new StringBuilder();
        foreach (T item in source)
        return sb.ToString();



Lambda expressions

Another thing you may have noticed in my example code, is a lambda expression:



z => z.Name == w_r || z.Name == w_ins || z.Name == w_hyperlink.



A lambda expression is actually an anonymous method, which is called by the StringConcatenate extension method. Lambda expressions always accept a parameter, and return true/false. In this case, z is instantiated as an XNode, returning true/false depending on its Name property.

Wrapping things up...

If you read this whole blog post, you may have noticed that I extensively used C# 3.5's new language features. I combined these with OpenXML and ASP.NET to create a useful Word document preview handler. If you want the full source code, download it here.

Java POI组件——简单提取Word、word转html、text、xml(仅支持doc,不支持docx)

需要添加的库 poi-3.15.jar poi-ooxml-3.15.jar poi-scratchpad-3.15.jar package com.poi.word;import
  • chy555chy
  • chy555chy
  • 2016年11月20日 23:00
  • 2168


最近项目需要输出聘书,聘书就是个Word做成的模板,需要把名字、岗位等文字替换一下。 如果用微软自带的Word编辑DLL,感觉很不好 于是找到了DocX,项目地址https://docx.code...
  • sfqpublic
  • sfqpublic
  • 2016年02月28日 20:14
  • 1676


OpenXml相对于用MS提供的COM组件来生成WORD,有如下优势: 1.相对于MS 的COM组件,因为版本带来的不兼容问题,及各种会生成WORD半途会崩溃的问题. 2.对比填满一张30...
  • bingle14
  • bingle14
  • 2015年10月09日 14:45
  • 3740

Word 2007 XML 解压缩格式

本页内容 简介 Word 2007 文档包 Word XML格式的开放打包约定 解析Word 2007文件 确定Word 20...
  • dhdhxgx
  • dhdhxgx
  • 2015年02月26日 11:21
  • 2107


 使用poi将word转换为html 使用poi将word转换为html,支持doc,docx,转换后可以保持图片、样式 演示地址: https://www.xiaoyun.stud...
  • qq_22498277
  • qq_22498277
  • 2016年11月22日 16:47
  • 2838


/** * word07版本(.docx)转html * poi:word07在线预览 * */ public static void PoiWord07ToHtml (Http...
  • wy123123000
  • wy123123000
  • 2017年06月20日 14:19
  • 3989


博文来自 pengqianhe CSDN账户 介绍 这篇文章包含了怎么样使用OpenXML 2.0去更新Word文档中的图表的方法。 背景 假设你有一个Word文档,并且你想要...
  • greensomnuss
  • greensomnuss
  • 2017年04月17日 09:31
  • 578


直入正题,需求为页面预览word文档,用的是poi3.8,以下代码支持表格、图片,不支持分页,只支持doc,不支持docx;  Java代码   /**  *   */      ...
  • yztezhl
  • yztezhl
  • 2016年10月13日 15:26
  • 1360

Open XML学习总结

一、Office Open XML 概述    office open xml是一种新的文件格式,是微软office2007以后的新的文件储存格式,较之以前的二进制储存格式,它有很多优点,1融合zi...
  • rencuicuilucy
  • rencuicuilucy
  • 2013年01月08日 08:54
  • 14524


由于项目需要,需要在线预览文档,所以就想转换为htmL格式;  此项目为maven项目,引入的包可能需要一些时间;  maven项目转换为eclipse项目命令是:mvn eclipse:eclips...
  • renzhehongyi
  • renzhehongyi
  • 2015年09月27日 17:02
  • 6896
您举报文章:Preview Word files (docx) in HTML using ASP.NET, OpenXML and LINQ to XML