将HTML转成XHTML并清除一些无用的标签和属性

最新推荐文章于 2024-05-25 15:23:03 发布

weixin_30687051

最新推荐文章于 2024-05-25 15:23:03 发布

阅读量71

点赞数

文章标签： xhtml php

原文链接：http://www.cnblogs.com/ranzige/p/4365579.html

版权

介绍

　　这是一个能帮你从HTML生成有效XHTML的经典库。它还提供对标签以及属性过滤的支持。你可以指定允许哪些标签和属性可在出现在输出中，而其他的标签过滤掉。你也可以使用这个库清理Microsoft Word文档转化成HTML时生成的臃肿的HTML。你也在将HTML发布到博客网站前清理一下，否则像WordPress、b2evolution等博客引擎会拒绝的。

　它是如何工作的

　　里面有两个类：HtmlReader和HtmlWriter

　　HtmlReader拓展了著名的由Chris Clovett开发的SgmlReader。当它读取HTML时，它跳过所有有前缀的节点。其中，所有像<o:p>、<o:Document>、<st1:personname>等上百的无用标签被滤除了。这样你读取的HTML就剩下核心的HTML标签了。

　　HtmlWriter拓展了常规的XmlWriter，XmlWriter生成XML。XHTML本质上是XML格式的HTML。所有你熟悉使用的标签——比如<img>、<br>和<hr>，都不是闭合的标签——在XHTML中必需是空元素形式，像<img .. />、<br/>和<hr/>。由于XHTML是常见的XML格式，你可以方便的使用XML解析器读取XHTML文档。这使得有了应用XPath搜索的机会。

　HtmlReader

　　HtmlReader很简单，下面是完整的类：

 
         // This class skips all nodes which has some 
        
         /// kind of prefix. This trick does the job  
        
         /// to clean up MS Word/Outlook HTML markups. 
        
         ///public class HtmlReader : Sgml.SgmlReader 
        
         { 
        
         public  
         HtmlReader( TextReader reader ) : base( ) 
        
         { 
        
         base.InputStream = reader; 
        
         base.DocType =  
         "HTML" 
         ; 
        
         } 
        
         public  
         HtmlReader( string content ) : base( ) 
        
         { 
        
         base.InputStream =  
         new  
         StringReader( content ); 
        
         base.DocType =  
         "HTML" 
         ; 
        
         } 
        
         public  
         override bool Read() 
        
         { 
        
         bool status = base.Read(); 
        
         if 
         ( status ) 
        
         { 
        
         if 
         ( base.NodeType == XmlNodeType.Element ) 
        
         { 
        
         // Got a node with prefix. This must be one 
        
         // of those "" or something else. 
        
         // Skip this node entirely. We want prefix 
        
         // less nodes so that the resultant XML  
        
         // requires not namespace. 
        
         if 
         ( base.Name.IndexOf( 
         ':' 
         ) >  
         0  
         ) 
        
         base.Skip(); 
        
         } 
        
         } 
        
         return  
         status; 
        
         } 
        
         }

　HtmlWriter

　　这个类是有点麻烦。下面是使用技巧：

重写WriteString方法并避免使用常规的XML编码。对HTML文件手动更改编码。
重写WriteStartElementis以避免不被允许的标签写到输出中。
重写WriteAttributesis以避免不需求的属性。

　　让我们分部分来看下整个类：

　　可配置性

　　你可以通过修改下面的部分配置HtmlWriter：

 
         public  
         class  
         HtmlWriter : XmlTextWriter 
        
         { 
        
         // If set to true, it will filter the output 
        
         /// by using tag and attribute filtering, 
        
         /// space reduce etc 
        
         ///public bool FilterOutput = false; 
        
         // If true, it will reduce consecutive   with one instance 
        
         ///public bool ReduceConsecutiveSpace = true; 
        
         // Set the tag names in lower case which are allowed to go to output 
        
         ///public string [] AllowedTags =  
        
         new  
         string[] {  
         "p" 
         ,  
         "b" 
         ,  
         "i" 
         ,  
         "u" 
         ,  
         "em" 
         ,  
         "big" 
         ,  
         "small" 
         ,  
        
         "div" 
         ,  
         "img" 
         ,  
         "span" 
         ,  
         "blockquote" 
         ,  
         "code" 
         ,  
         "pre" 
         ,  
         "br" 
         ,  
         "hr" 
         ,  
        
         "ul" 
         ,  
         "ol" 
         ,  
         "li" 
         ,  
         "del" 
         ,  
         "ins" 
         ,  
         "strong" 
         ,  
         "a" 
         ,  
         "font" 
         ,  
         "dd" 
         ,  
         "dt" 
         }; 
        
         // If any tag found which is not allowed, it is replaced by this tag. 
        
         /// Specify a tag which has least impact on output 
        
         ///public string ReplacementTag = "dd"; 
        
         // New lines \r\n are replaced with space  
        
         /// which saves space and makes the 
        
         /// output compact 
        
         ///public bool RemoveNewlines = true; 
        
         // Specify which attributes are allowed.  
        
         /// Any other attribute will be discarded 
        
         ///public string [] AllowedAttributes = new string[]  
        
         {  
        
         "class" 
         ,  
         "href" 
         ,  
         "target" 
         ,  
         "border" 
         ,  
         "src" 
         ,  
        
         "align" 
         ,  
         "width" 
         ,  
         "height" 
         ,  
         "color" 
         ,  
         "size"  
        
         }; 
        
         }

　　WriteString方法

 
         // The reason why we are overriding 
        
         /// this method is, we do not want the output to be 
        
         /// encoded for texts inside attribute 
        
         /// and inside node elements. For example, all the   
        
         /// gets converted to &nbsp in output. But this does not  
        
         /// apply to HTML. In HTML, we need to have   as it is. 
        
         //public override void WriteString(string text) 
        
         { 
        
         // Change all non-breaking space to normal space 
        
         text = text.Replace(  
         " " 
         ,  
         " "  
         ); 
        
         /// When you are reading RSS feed and writing Html,  
        
         /// this line helps remove those CDATA tags 
        
         text = text.Replace( 
         "" 
         ,  
         "" 
         ); 
        
         // Do some encoding of our own because 
        
         // we are going to use WriteRaw which won't 
        
         // do any of the necessary encoding 
        
         text = text.Replace(  
         "<" 
         ,  
         "<"  
         ); 
        
         text = text.Replace(  
         ">" 
         ,  
         ">"  
         ); 
        
         text = text.Replace(  
         "'" 
         ,  
         "&apos;"  
         ); 
        
         text = text.Replace(  
         "\"" 
         ,  
         "" 
         e;" ); 
        
         if 
         (  
         this 
         .FilterOutput ) 
        
         { 
        
         text = text.Trim(); 
        
         // We want to replace consecutive spaces 
        
         // to one space in order to save horizontal width 
        
         if 
         (  
         this 
         .ReduceConsecutiveSpace )  
        
         text = text.Replace( 
         "   " 
         ,  
         " " 
         ); 
        
         if 
         (  
         this 
         .RemoveNewlines )  
        
         text = text.Replace(Environment.NewLine,  
         " " 
         ); 
        
         base.WriteRaw( text ); 
        
         } 
        
         else 
        
         { 
        
         base.WriteRaw( text ); 
        
         } 
        
         }

　　WriteStartElement: 应用标签过滤

 
         public  
         override  
         void  
         WriteStartElement(string prefix,  
        
         string localName, string ns) 
        
         { 
        
         if 
         (  
         this 
         .FilterOutput )  
        
         { 
        
         bool canWrite =  
         false 
         ; 
        
         string tagLocalName = localName.ToLower(); 
        
         foreach( string name in  
         this 
         .AllowedTags ) 
        
         { 
        
         if 
         ( name == tagLocalName ) 
        
         { 
        
         canWrite =  
         true 
         ; 
        
         break 
         ; 
        
         } 
        
         } 
        
         if 
         ( !canWrite )  
        
         localName =  
         "dd" 
         ; 
        
         } 
        
         base.WriteStartElement(prefix, localName, ns); 
        
         }

　　WriteAttributes方法: 应用属性过滤

 
         bool canWrite =  
         false 
         ; 
        
         string attributeLocalName = reader.LocalName.ToLower(); 
        
         foreach( string name in  
         this 
         .AllowedAttributes ) 
        
         { 
        
         if 
         ( name == attributeLocalName ) 
        
         { 
        
         canWrite =  
         true 
         ; 
        
         break 
         ; 
        
         } 
        
         } 
        
         // If allowed, write the attribute 
        
         if 
         ( canWrite )  
        
         this 
         .WriteStartAttribute(reader.Prefix,  
        
         attributeLocalName, reader.NamespaceURI); 
        
         while  
         (reader.ReadAttributeValue()) 
        
         { 
        
         if  
         (reader.NodeType == XmlNodeType.EntityReference) 
        
         { 
        
         if 
         ( canWrite )  
         this 
         .WriteEntityRef(reader.Name); 
        
         continue 
         ; 
        
         } 
        
         if 
         ( canWrite ) 
         this 
         .WriteString(reader.Value); 
        
         } 
        
         if 
         ( canWrite )  
         this 
         .WriteEndAttribute();

　结论

　　示例应用是一个你可以立即用来清理HTML文件的实用工具。你可以将这个类应用在像博客等需要发布一些HTML到Web服务的工具中。

　　原文地址：http://www.codeproject.com/Articles/10792/Convert-HTML-to-XHTML-and-Clean-Unnecessary-Tags-a

转载于:https://www.cnblogs.com/ranzige/p/4365579.html

weixin_30687051

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
将HTML转成XHTML并清除一些无用的标签和属性

介绍　　这是一个能帮你从HTML生成有效XHTML的经典库。它还提供对标签以及属性过滤的支持。你可以指定允许哪些标签和属性可在出现在输出中，而其他的标签过滤掉。你也可以使用这个库清理Microsoft Word文档转化成HTML时生成的臃肿的HTML。你也在将HTML发布到博客网站前清理一下，否则像WordPress、b2evolution等博客引擎会拒绝的。　它是如何工作的　　里...
复制链接

扫一扫