Office文件的解析

最新推荐文章于 2025-01-15 19:30:00 发布

ljmwork

最新推荐文章于 2025-01-15 19:30:00 发布

阅读量3.8k

点赞数

分类专栏：文档格式分析

文档格式分析专栏收录该内容

6 篇文章 0 订阅

订阅专栏

Office文件的解析

引用： http://www.langye.com/a/2013417/237.shtml

【题外话】

这是这个系列的最后一篇文章了，为了不让自己觉得少点什么，顺便让自己感觉完美一些，就再把OOXML说一下吧。不过说实话，OOXML真的太容易解析了，而且这方面的文档包括成熟的开源类库也特别特别特别的多，所以我就稍微说一下，文章中引用了不少的链接，感兴趣的话可以深入了解下。

【系列索引】

Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(一)
获取Office二进制文档的DocumentSummaryInformation以及SummaryInformation
Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(二)
获取Word二进制文档（.doc）的文字内容（包括正文、页眉、页脚、批注等等）
Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(三)
详细介绍Office二进制文档中的存储结构，以及获取PowerPoint二进制文档（.ppt）的文字内容
Office文件的奥秘——.NET平台下不借助Office实现Word、Powerpoint等文件的解析(完)
介绍Office Open XML文档（.docx、.pptx）如何进行解析以及解析Office文件常见开源类库

【文章索引】

【一、初见Office Open XML(OOXML)】

先来看一段微软官方对Office Open XML的说明（详细见http://office.microsoft.com/zh-cn/support/HA010205815.aspx?CTT=3）：

可以看到，与Windows 复合文档不同的是，OOXML生来就是开放的，而且由于基于zip+xml的格式，使得读取变得更容易，如果仅是为了抽取文字，我们甚至不需要读取文档的任何参数！

如果您之前不了解OOXML的话，我们可以把手头docx、pptx以及xlsx文件的扩展名改为zip，然后用压缩软件打开看看。

打开的这三个文件分别是docx、pptx和xlsx，我们可以看到，目录结构清晰可见，所以我们只需要使用读取zip的类库读取zip文件，然后再解析xml文件即可。对于使用.NET Framework 3.0及以上的，可以直接使用.NET自带的Package类（System.IO.Packaging，在WindowsBase.dll中）进行解压，个人感觉如果只是读取zip流中的文件流或内容，WindowsBase中的Package还是很好用的。如果用于.NET CF或者2.0甚至以下的CLR可以使用SharpZipLib（支持CLR 1.1、2.0、4.0，官方网站http://www.icsharpcode.net/），也可以使用DotNetZip（支持CLR 2.0，官方网站http://dotnetzip.codeplex.com/），个人感觉后者的License更友好些。

比如我们使用自带的Package打开OOXML文件：

 
       View Code 
      
       #region 字段 
      
       protected  
       FileStream m_stream; 
      
       protected  
       Package m_package; 
      
       #endregion 
      
       #region 构造函数 
      
       /// <summary> 
      
       /// 初始化OfficeOpenXMLFile 
      
       /// </summary> 
      
       /// <param name="filePath">文件路径</param> 
      
       public  
       OfficeOpenXMLFile(String filePath) 
      
       { 
      
       try 
      
       { 
      
       this 
       .m_stream =  
       new  
       FileStream(filePath, FileMode.Open, FileAccess.Read); 
      
       this 
       .m_package = Package.Open( 
       this 
       .m_stream); 
      
       this 
       .ReadProperties(); 
      
       this 
       .ReadCoreProperties(); 
      
       this 
       .ReadContent(); 
      
       } 
      
       finally 
      
       { 
      
       if  
       ( 
       this 
       .m_package !=  
       null 
       ) 
      
       { 
      
       this 
       .m_package.Close(); 
      
       } 
      
       if  
       ( 
       this 
       .m_stream !=  
       null 
       ) 
      
       { 
      
       this 
       .m_stream.Close(); 
      
       } 
      
       } 
      
       } 
      
       #endregion

【二、OOXML文档属性的解析】

OOXML文件的文档属性其实存在于docProps目录下，比较重要的有三个文件

app.xml：记录文档的属性，内容类似之前的DocumentSummaryInformation。
core.xml：记录文档核心的属性，比如创建时间、最后修改时间等等，内容类似之前的SummaryInformation。
thumbnail.*：文档的缩略图，不同文件存储的是不同的格式，比如Word为emf，Excel为wmf，PowerPoint为jpeg。

我们只需要遍历XML文件中所有的子节点就可以读出所有的属性，为了好看，这里还用的Windows复合文件中的名称：

 
       View Code 
      
       #region 常量 
      
       private  
       const  
       String PropertiesNameSpace = 
       "http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" 
       ; 
      
       private  
       const  
       String CorePropertiesNameSpace = 
       "http://schemas.openxmlformats.org/package/2006/metadata/core-properties" 
       ; 
      
       #endregion 
      
       #region 字段 
      
       protected  
       Dictionary<String, String> m_properties; 
      
       protected  
       Dictionary<String, String> m_coreProperties; 
      
       #endregion 
      
       #region 属性 
      
       /// <summary> 
      
       /// 获取DocumentSummaryInformation 
      
       /// </summary> 
      
       public  
       override  
       Dictionary<String, String> DocumentSummaryInformation 
      
       { 
      
       get 
      
       { 
      
       return  
       this 
       .m_properties; 
      
       } 
      
       } 
      
       /// <summary> 
      
       /// 获取SummaryInformation 
      
       /// </summary> 
      
       public  
       override  
       Dictionary<String, String> SummaryInformation 
      
       { 
      
       get 
      
       { 
      
       return  
       this 
       .m_coreProperties; 
      
       } 
      
       } 
      
       #endregion 
      
       #region 读取Properties 
      
       private  
       void  
       ReadProperties() 
      
       { 
      
       if  
       ( 
       this 
       .m_package ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       PackagePart part =  
       this 
       .m_package.GetPart( 
       new  
       Uri( 
       "/docProps/app.xml" 
       , UriKind.Relative)); 
      
       if  
       (part ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       XmlDocument doc =  
       new  
       XmlDocument(); 
      
       doc.Load(part.GetStream()); 
      
       XmlNodeList nodes = doc.GetElementsByTagName( 
       "Properties" 
       , PropertiesNameSpace); 
      
       if  
       (nodes.Count < 1) 
      
       { 
      
       return 
       ; 
      
       } 
      
       this 
       .m_properties =  
       new  
       Dictionary<String, String>(); 
      
       foreach  
       (XmlElement element  
       in  
       nodes[0]) 
      
       { 
      
       this 
       .m_properties.Add(element.LocalName, element.InnerText); 
      
       } 
      
       } 
      
       #endregion 
      
       #region 读取CoreProperties 
      
       private  
       void  
       ReadCoreProperties() 
      
       { 
      
       if  
       ( 
       this 
       .m_package ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       PackagePart part =  
       this 
       .m_package.GetPart( 
       new  
       Uri( 
       "/docProps/core.xml" 
       , UriKind.Relative)); 
      
       if  
       (part ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       XmlDocument doc =  
       new  
       XmlDocument(); 
      
       doc.Load(part.GetStream()); 
      
       XmlNodeList nodes = doc.GetElementsByTagName( 
       "coreProperties" 
       , CorePropertiesNameSpace); 
      
       if  
       (nodes.Count < 1) 
      
       { 
      
       return 
       ; 
      
       } 
      
       this 
       .m_coreProperties =  
       new  
       Dictionary<String, String>(); 
      
       foreach  
       (XmlElement element  
       in  
       nodes[0]) 
      
       { 
      
       this 
       .m_coreProperties.Add(element.LocalName, element.InnerText); 
      
       } 
      
       } 
      
       #endregion

【三、Word 2007文件的解析】

Word文件（.docx）主要的内容基本都存在于word目录下，比较重要的有以下的内容

document.xml：记录Word文档的正文内容
footer*.xml：记录Word文档的页脚
header*.xml：记录Word文档的页眉
comments.xml：记录Word文档的批注
endnotes.xml：记录WOrd文档的尾注

这里我们只读取Word文档的正文内容，由于OOXML文档在存储文字时也是嵌套结构存储的，比如对于Word而言，<w:p></w:p>之间存储的是段落，段落中会嵌套着<w:t></w:t>，而这个存储的是文字。除此之外<w:tab/>是Tab符号，<w:br w:type="page"/>是分页符等等，所以我们需要写一个方法递归处理这些标签：

 
       View Code 
      
       /// <summary> 
      
       /// 抽取Node中的文字 
      
       /// </summary> 
      
       /// <param name="node">XmlNode</param> 
      
       /// <returns>Node中的文字</returns> 
      
       public  
       static  
       String ReadNode(XmlNode node) 
      
       { 
      
       if  
       ((node ==  
       null 
       ) || (node.NodeType != XmlNodeType.Element)) 
       //如果node为空 
      
       { 
      
       return  
       String.Empty; 
      
       } 
      
       StringBuilder nodeContent =  
       new  
       StringBuilder(); 
      
       foreach  
       (XmlNode child  
       in  
       node.ChildNodes) 
      
       { 
      
       if  
       (child.NodeType != XmlNodeType.Element) 
      
       { 
      
       continue 
       ; 
      
       } 
      
       switch  
       (child.LocalName) 
      
       { 
      
       case  
       "t" 
       : 
       //正文 
      
       nodeContent.Append(child.InnerText.TrimEnd()); 
      
       String space = ((XmlElement)child).GetAttribute( 
       "xml:space" 
       ); 
      
       if  
       ((!String.IsNullOrEmpty(space)) && (space ==  
       "preserve" 
       )) nodeContent.Append( 
       ' ' 
       ); 
      
       break 
       ; 
      
       case  
       "cr" 
       : 
       //换行符 
      
       case  
       "br" 
       : 
       //换页符 
      
       nodeContent.Append(Environment.NewLine); 
      
       break 
       ; 
      
       case  
       "tab" 
       : 
       //Tab 
      
       nodeContent.Append( 
       "\t" 
       ); 
      
       break 
       ; 
      
       case  
       "p" 
       : 
       //段落 
      
       nodeContent.Append(ReadNode(child)); 
      
       nodeContent.Append(Environment.NewLine); 
      
       break 
       ; 
      
       default 
       : 
       //其他情况 
      
       nodeContent.Append(ReadNode(child)); 
      
       break 
       ; 
      
       } 
      
       } 
      
       return  
       nodeContent.ToString(); 
      
       }

然后我们从根标签开始读取就可以了

 
       View Code 
      
       #region 常量 
      
       private  
       const  
       String WordNameSpace = 
       "http://schemas.openxmlformats.org/wordprocessingml/2006/main" 
       ; 
      
       #endregion 
      
       #region 字段 
      
       private  
       String m_paragraphText; 
      
       #endregion 
      
       #region 属性 
      
       /// <summary> 
      
       /// 获取文档正文内容 
      
       /// </summary> 
      
       public  
       String ParagraphText 
      
       { 
      
       get  
       {  
       return  
       this 
       .m_paragraphText; } 
      
       } 
      
       #endregion 
      
       #region 读取内容 
      
       protected  
       override  
       void  
       ReadContent() 
      
       { 
      
       if  
       ( 
       this 
       .m_package ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       PackagePart part =  
       this 
       .m_package.GetPart( 
       new  
       Uri( 
       "/word/document.xml" 
       , UriKind.Relative)); 
      
       if  
       (part ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       StringBuilder content =  
       new  
       StringBuilder(); 
      
       XmlDocument doc =  
       new  
       XmlDocument(); 
      
       doc.Load(part.GetStream()); 
      
       XmlNamespaceManager nsManager =  
       new  
       XmlNamespaceManager(doc.NameTable); 
      
       nsManager.AddNamespace( 
       "w" 
       , WordNameSpace); 
      
       XmlNode node = doc.SelectSingleNode( 
       "/w:document/w:body" 
       , nsManager); 
      
       if  
       (node ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       content.Append(NodeHelper.ReadNode(node)); 
      
       this 
       .m_paragraphText = content.ToString(); 
      
       } 
      
       #endregion

【四、PowerPoint 2007文件的解析】

PowerPoint文件（.pptx）主要的内容都存在于ppt目录下，而幻灯片的信息则又在slides子目录下，这里边幻灯片按照slide + 页序号 +.xml的名称进行存储，我们挨个顺序读取就可以。不过需要注意的是，由于字符串比较的问题，如“slide10.xml”<"slide2.xml"，所以如果你按顺序读取的话可能会出现页码错乱的情况，所以我们可以先进行排序然后再挨个页面从根标签读取就可以了。

 
       #region 常量 
      
       private  
       const  
       String PowerPointNameSpace = 
       "http://schemas.openxmlformats.org/presentationml/2006/main" 
       ; 
      
       #endregion 
      
       #region 字段 
      
       private  
       StringBuilder m_allText; 
      
       #endregion 
      
       #region 属性 
      
       /// <summary> 
      
       /// 获取PowerPoint幻灯片中所有文本 
      
       /// </summary> 
      
       public  
       String AllText 
      
       { 
      
       get  
       {  
       return  
       this 
       .m_allText.ToString(); } 
      
       } 
      
       #endregion 
      
       #region 构造函数 
      
       /// <summary> 
      
       /// 初始化PptxFile 
      
       /// </summary> 
      
       /// <param name="filePath">文件路径</param> 
      
       public  
       PptxFile(String filePath) : 
      
       base 
       (filePath) { } 
      
       #endregion 
      
       #region 读取内容 
      
       protected  
       override  
       void  
       ReadContent() 
      
       { 
      
       if  
       ( 
       this 
       .m_package ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       this 
       .m_allText =  
       new  
       StringBuilder(); 
      
       XmlDocument doc =  
       null 
       ; 
      
       PackagePartCollection col =  
       this 
       .m_package.GetParts(); 
      
       SortedList<Int32, XmlDocument> list =  
       new  
       SortedList<Int32, XmlDocument>(); 
      
       foreach  
       (PackagePart part  
       in  
       col) 
      
       { 
      
       if  
       (part.Uri.ToString().IndexOf( 
       "ppt/slides/slide" 
       , StringComparison.OrdinalIgnoreCase) > -1) 
      
       { 
      
       doc =  
       new  
       XmlDocument(); 
      
       doc.Load(part.GetStream()); 
      
       String pageName = part.Uri.ToString().Replace( 
       "/ppt/slides/slide" 
       , 
       "" 
       ).Replace( 
       ".xml" 
       ,  
       "" 
       ); 
      
       Int32 index = 0; 
      
       Int32.TryParse(pageName,  
       out  
       index); 
      
       list.Add(index, doc); 
      
       } 
      
       } 
      
       foreach  
       (KeyValuePair<Int32, XmlDocument> pair  
       in  
       list) 
      
       { 
      
       XmlNamespaceManager nsManager =  
       new  
       XmlNamespaceManager(doc.NameTable); 
      
       nsManager.AddNamespace( 
       "p" 
       , PowerPointNameSpace); 
      
       XmlNode node = pair.Value.SelectSingleNode( 
       "/p:sld" 
       , nsManager); 
      
       if  
       (node ==  
       null 
       ) 
      
       { 
      
       continue 
       ; 
      
       } 
      
       this 
       .m_allText.Append(NodeHelper.ReadNode(node)); 
      
       } 
      
       } 
      
       #endregion

 
       #region 常量 
      
       private  
       const  
       String PowerPointNameSpace = 
       "http://schemas.openxmlformats.org/presentationml/2006/main" 
       ; 
      
       #endregion 
      
       #region 字段 
      
       private  
       StringBuilder m_allText; 
      
       #endregion 
      
       #region 属性 
      
       /// <summary> 
      
       /// 获取PowerPoint幻灯片中所有文本 
      
       /// </summary> 
      
       public  
       String AllText 
      
       { 
      
       get  
       {  
       return  
       this 
       .m_allText.ToString(); } 
      
       } 
      
       #endregion 
      
       #region 构造函数 
      
       /// <summary> 
      
       /// 初始化PptxFile 
      
       /// </summary> 
      
       /// <param name="filePath">文件路径</param> 
      
       public  
       PptxFile(String filePath) : 
      
       base 
       (filePath) { } 
      
       #endregion 
      
       #region 读取内容 
      
       protected  
       override  
       void  
       ReadContent() 
      
       { 
      
       if  
       ( 
       this 
       .m_package ==  
       null 
       ) 
      
       { 
      
       return 
       ; 
      
       } 
      
       this 
       .m_allText =  
       new  
       StringBuilder(); 
      
       XmlDocument doc =  
       null 
       ; 
      
       PackagePartCollection col =  
       this 
       .m_package.GetParts(); 
      
       SortedList<Int32, XmlDocument> list =  
       new  
       SortedList<Int32, XmlDocument>(); 
      
       foreach  
       (PackagePart part  
       in  
       col) 
      
       { 
      
       if  
       (part.Uri.ToString().IndexOf( 
       "ppt/slides/slide" 
       , StringComparison.OrdinalIgnoreCase) > -1) 
      
       { 
      
       doc =  
       new  
       XmlDocument(); 
      
       doc.Load(part.GetStream()); 
      
       String pageName = part.Uri.ToString().Replace( 
       "/ppt/slides/slide" 
       , 
       "" 
       ).Replace( 
       ".xml" 
       ,  
       "" 
       ); 
      
       Int32 index = 0; 
      
       Int32.TryParse(pageName,  
       out  
       index); 
      
       list.Add(index, doc); 
      
       } 
      
       } 
      
       foreach  
       (KeyValuePair<Int32, XmlDocument> pair  
       in  
       list) 
      
       { 
      
       XmlNamespaceManager nsManager =  
       new  
       XmlNamespaceManager(doc.NameTable); 
      
       nsManager.AddNamespace( 
       "p" 
       , PowerPointNameSpace); 
      
       XmlNode node = pair.Value.SelectSingleNode( 
       "/p:sld" 
       , nsManager); 
      
       if  
       (node ==  
       null 
       ) 
      
       { 
      
       continue 
       ; 
      
       } 
      
       this 
       .m_allText.Append(NodeHelper.ReadNode(node)); 
      
       } 
      
       } 
      
       #endregion

附，本系列全部代码下载：http://files.cnblogs.com/mayswind/DotMaysWind.OfficeReader_4.rar

【五、常见Office文档（Word、PowerPoint、Excel）文件的开源类库】

1、NPOI：http://npoi.codeplex.com

这个没的说，.NET上最好的，没有之一，Office文档类库，提供完整的Excel读取与编辑操作，目前支持二进制（.xls）文件和OOXML（.xlsx）两种格式。如果用过Apache的Java类库POI的话，NPOI提供几乎一样的类库。实际上，对于ASP.NET，需要编辑的Office文档大多都是Excel文件，或者也可以使用Excel文件代替，所以使用NPOI几乎已经能满足所有需要。目前已经支持docx文件，而doc的支持则在NPOI.ScratchPad中，大家可以去Source Code中下载自己编译。如果不需要OOXML的话，类库仅有1.5MB，并且支持.NET CLR 2.0和4.0。

2、Open XML SDK 2.0 for Microsoft Office：http://msdn.microsoft.com/en-us/library/bb448854(office.14).aspx

微软提供的Open XML SDK，支持读写任意OOXML文档，其同时提供了一个工具，可以打开Office文档然后直接生成使用该类库生成该文档的程序代码。只不过类库确实大了些，有5MB之多，并且需要.NET Framework 3.5的支持。

3、Office Binary Translator to Open XML：http://b2xtranslator.sourceforge.net/

这是我最近才知道的一个类库，其实很早很早以前就有了，其可以将Windows复合文档（.doc、.ppt、.xls）转换为对应的OOXML格式（.docx、.pptx、.xlsx），当然你也可以获取文件中存储的内容。不知道为什么，这个网站被墙了。如果你想研究Windows复合文档的话，我比较推荐这个类库，因为NPOI实在是太完美的一个类库，要想走一遍文件读取的流程实在是太复杂，但是如果用这个类库单步的话还是很容易懂的。这个类库将每种文件的支持（以及支持的模块等）都拆分到了不同的项目中，支持每种文件仅需要几百KB，而且是基于.NET CLR 2.0的。

4、EPPlus：http://epplus.codeplex.com

在2010年NPOI还不支持OOXML的时候，个人感觉EPPlus是最好的.xlsx文件处理的类库，其仅有几百KB，非常轻量，对于zip文件的读取，这个类库没有选择SharpZipLib或者DotNetZip，老版本需要.NET Framework 3.0就行，刚看了下新版本得需要.NET Framework 3.5才可以。

5、ExcelDataReader：http://exceldatareader.codeplex.com

也是一个非常轻量并且好用的库，同时支持读取.xls和.xlsx，当年在使用EPPlus之前使用的这个类库，记不得是因为什么问题替换成了EPPlus，也不知道这个问题现在解决了没有。这个类库的好处是仅需要.NET CLR 2.0，并且支持.NET CF，只不过现在已经不需要开发Windows Mobile的应用了。

【六、相关链接】

1、OpenXMLDeveloper.org：http://openxmldeveloper.org
2、如何：从 Office Open XML 文档检索段落：http://msdn.microsoft.com/zh-cn/library/bb669175.aspx
3、如何操作 Office Open XML 格式文档：http://www.microsoft.com/china/msdn/library/office/office/howManipulateOfficexml.mspx
4、如何实现...（打开 XML SDK）：http://msdn.microsoft.com/zh-cn/library/bb491088.aspx

【后记】

终于到了最后一篇，这个系列就到这结束了，感谢大家的捧场，我也终于实现了两年前的心愿。说实话，我确实没想到第一篇会有那么多的访问和推荐，因为需要解析Office文档的毕竟是少数的。写这四篇文章也希望起到抛砖引玉的作用，起码可以对Office文档有个最基础的了解，而之后如果想深入了解下去也会容易得多，这也是我要把这些内容写出来的原因。

【补遗】

在写完这四篇文章后，我偶然发现微软关于这方面竟然有中文文档，泪奔了，为什么之前我没有找到。所以在此附上几篇常用的链接。

1、了解 Office 二进制文件格式：http://msdn.microsoft.com/zh-cn/library/gg615407(v=office.14).aspx
2、了解 Word MS-DOC 二进制文件格式：http://msdn.microsoft.com/zh-CN/library/gg615596
3、了解 PowerPoint MS-PPT 二进制文件格式：http://msdn.microsoft.com/zh-CN/library/gg615594
4、了解采用 Office 二进制文件格式的图形：http://msdn.microsoft.com/zh-CN/library/gg985447
5、在二进制 PowerPoint MS-PPT 文件中查找图形：http://msdn.microsoft.com/zh-CN/library/hh244173