用libxml解析 html文件

Taming HTML Parsing with libxml (1)

Sep 18, 2011

xml,json都有大量的库来解析,我们如何解析html呢?

TFHpple是一个小型的封装,可以用来解析html,它是对libxml的封装,语法是xpath。

今天我看到一个直接用libxml来解析html

      



For the NSAttributedString+HTML Open Source project I chose toimplement parsing of HTML with a set of NSScanner category methods.The resulting code is relatively easy to understand but has acouple of annoying drawbacks. You have to duplicate the NSData andconvert it into an NSString effectively doubling the amount ofmemory needed. Then while parsing I am building an adhoc tree ofDTHTMLElement instances adding yet another copy of the document inRAM.

When parsing HTML – and by extension XML – you have two kinds ofoperating mode available: you can have the Sequential Access Method(SAX) where walking through the document triggers events on theindividual pieces of it. The second method is to build a tree ofnodes, a Document Object Model (DOM). NSScanner lends itself toSAX, but in this case it is less than ideal because for CSSinheritance some sort of hierarchy is necessary to walk up on.

In this post we will begin to explore the industry-standardlibxml library and see how we can thinly wrap it in Objective-Cthat it plays nicely with our code.

Getting libxml into your Xcode project is straightforward.Fortunately for us libxml is so old and established that you canfind it already installed on Unix, Mac and iOS platforms. There aretwo kinds of libraries in C: static and dynamic. libxml is thelatter which you can recognize by the .dylib extension.

Adding the Library

First we need to add the library providing all the XML and HTMLstructures and functions. We are actually using version 2.2 oflibxml, the file libxml2.dylib is a symbolic link tolibxml2.2.dylib.

Next – because libxml is not a framework that would package thenecessary headers with it – we also need to tell Xcode where theheaders can be found. Since libxml also comes with OSX, its headers– just like all other OSX system libraries can be found in/usr/include. Add /usr/include/libxml2  to theHeader Search Paths and we’re set.

Now all we need to do to access libxml’s parsing methods anddata structures is to add the appropriate import. Most of theinternal structures are shared between the XML and HTML parsers andso we just need the HTMLparser header.

#import <libxml/HTMLparser.h>

Document Structure

Before we get into parsing let me show you how libxml representsHTML documents. Everything in libxml is a node. Because C does nothave a concept of objects the classical method of representing atree is by having C structs that have member variables pointing toother structs. A child is just a pointer to the child struct/node.If there can be more than one item, i.e. a list, this isrepresented by a linked list where the first node points to thenext and so on until the very last node has a NULL pointer.

The smallest unit in libxml is xmlNode structure which isdefined as such:

typedef struct _xmlNode xmlNode;
typedef xmlNode *xmlNodePtr;
struct _xmlNode {
    void           *_private;    
    xmlElementType   type; 
    const xmlChar   *name;      
    struct _xmlNode *children;   
    struct _xmlNode *last;       
    struct _xmlNode *parent;     
    struct _xmlNode *next;       
    struct _xmlNode *prev;       
    struct _xmlDoc  *doc;        
 
    
    xmlNs           *ns;        
    xmlChar         *content;   
    struct _xmlAttr *properties;
    xmlNs           *nsDef;     
    void            *psvi;       
    unsigned short   line;       
    unsigned short   extra;      
};

The useful links depicted in the above chart as children, last,parent, next, prev and doc. The type value is the kind of role thisnode plays. If it is a tag then it is an XML_ELEMENT_NODE. Thecontents of a tag is represented by an XML_TEXT_NODE. Attributesare XML_ATTRIBUTE_NODE. Note that even if the original HTML doesnot contain a DTD, html or body tag these will be implied by theparser.

Let’s Parse Already

I sense that you grow impatient with me. Ok ok, we’re gettingright to it now that you understand how libxml represents DOMs.Assume we have some HTML data downloaded from the web, the NSURL ofit is in _baseURL.

// NSData data contains the document data
// encoding is the NSStringEncoding of the data
// baseURL the documents base URL, i.e. location 
 
CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding);
CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc);
const char *enc = CFStringGetCStringPtr(cfencstr, 0);
 
htmlDocPtr _htmlDocument = htmlReadDoc([data bytes],
      [[baseURL absoluteString] UTF8String],
      enc,
      XML_PARSE_NOERROR | XML_PARSE_NOWARNING);

Since we don’t need any warnings or errors we can just ignorethem by passing some options. The baseURL might be necessary todecode relative URLs contained in the document. And mostimportantly we cannot assume that UTF8 is used for encoding thebytes so we get the appropriate character set to pass to theparser.

Remember, this is pure C, so once we don’t need this DOM anymore we need to trigger a routine that walks through this linkedstructures and frees up the reserved memory.

if (_htmlDocument)
{
   xmlFreeDoc(_htmlDocument);
}

If _htmlDocument is not NULL then we have successfully parsedthe document. There are multiple methods how we could now use this,but for the final example in this post let me show you a functionthat just dumps the individual elements to to the log. Thisdemonstrates how to follow the links and also how to access thecontents of text elements.

xmlNodePtr currentNode = (xmlNodePtr)_htmlDocument;
 
BOOL beginOfNode = YES;
 
while (currentNode) 
{
    // output node if it is an element
    if (beginOfNode)
    {
        if (currentNode->type == XML_ELEMENT_NODE)
        {
            NSMutableArray *attrArray = [NSMutableArray array];
 
            for (xmlAttrPtr attrNode = currentNode->properties; 
                 attrNode; attrNode = attrNode->next)
            {
                xmlNodePtr contents = attrNode->children;
 
                [attrArray addObject:[NSString stringWithFormat:@"%s='%s'", 
                                      attrNode->name, contents->content]];
            }
 
            NSString *attrString = [attrArray componentsJoinedByString:@" "]; 
 
            if ([attrString length])
            {
                attrString = [@" " stringByAppendingString:attrString];
            }
 
            NSLog(@"<%s%@>", currentNode->name, attrString);
        }
        else if (currentNode->type == XML_TEXT_NODE)
        {
            NSLog(@"%s", currentNode->content);
        }
        else if (currentNode->type == XML_COMMENT_NODE)
        {
            NSLog(@"", currentNode->name);
        }
    }
 
    if (beginOfNode && currentNode->children)
    {
        currentNode = currentNode->children;
        beginOfNode = YES;
    }
    else if (beginOfNode && currentNode->next)
    {
        currentNode = currentNode->next;
        beginOfNode = YES;
    }
    else
    {
        currentNode = currentNode->parent;
        beginOfNode = NO; // avoid going to siblings or children
 
        // close node
        if (currentNode && currentNode->type == XML_ELEMENT_NODE)
        {
            NSLog(@"</%s>", currentNode->name);
        }
    }
}

Note how I use %s so that I can use the zero-terminated Cstrings without having to convert them to NSStrings.

Obviously there are other ways to iterate through the document,for example by means of recursion. But this is meant to show howyou can walk through nodes and their children and how you can alsoget the attributes.

Next time we will have a look how we can somehow wrap this pureC-code that we can more easily find and access parts of it. Wecannot simply wrap xmlNode into an Objective-C class because thenwe might end up freeing the structure while an node instance isstill present, thus creating a whole lot of junk pointers andintroducing crash potential.

This is the case with the Objective-CHTML Parser project on GitHub by Ben Reeves. But even though Idon’t share Ben’s philosophy, his prior work served as the startingpoint for this article.



  1. CFStringEncoding cfenc CFStringConvertNSStringEncodingToEncoding(encoding);  
  2. CFStringRef cfencstr CFStringConvertEncodingToIANACharSetName(cfenc);  
  3. const char *enc CFStringGetCStringPtr(cfencstr, 0);  
  4.    
  5. htmlDocPtr _htmlDocument htmlReadDoc([data bytes],  
  6.       [[baseURL absoluteString] UTF8String],  
  7.       enc,  
  8.       XML_PARSE_NOERROR XML_PARSE_NOWARNING);  
  9. if (_htmlDocument)  
  10.  
  11.    xmlFreeDoc(_htmlDocument);  
  12.  
  13.   
  14. xmlNodePtr currentNode (xmlNodePtr)_htmlDocument;  
  15.   
  16. while (currentNode)   
  17.      
  18.         // output node if it is an element  
  19.           
  20.         if (currentNode->type == XML_ELEMENT_NODE)  
  21.          
  22.             NSMutableArray *attrArray [NSMutableArray array];  
  23.               
  24.             for (xmlAttrPtr attrNode currentNode->properties; attrNode; attrNode attrNode->next)  
  25.              
  26.                 xmlNodePtr contents attrNode->children;  
  27.                   
  28.                 [attrArray addObject:[NSString stringWithFormat:@"%s='%s'", attrNode->name, contents->content]];  
  29.              
  30.               
  31.             NSString *attrString [attrArray componentsJoinedByString:@" "];   
  32.               
  33.             if ([attrString length])  
  34.              
  35.                 attrString [@" stringByAppendingString:attrString];  
  36.              
  37.               
  38.             NSLog(@"<%s%@>", currentNode->name, attrString);  
  39.          
  40.         else if (currentNode->type == XML_TEXT_NODE)  
  41.          
  42.             //NSLog(@"%s", currentNode->content);  
  43.             NSLog(@"%@", [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding]);  
  44.          
  45.         else if (currentNode->type == XML_COMMENT_NODE)  
  46.          
  47.             NSLog(@"", currentNode->name);  
  48.          
  49.       
  50.           
  51.         if (currentNode && currentNode->children)  
  52.          
  53.             currentNode currentNode->children;  
  54.          
  55.         else if (currentNode && currentNode->next)  
  56.          
  57.             currentNode currentNode->next;  
  58.          
  59.         else  
  60.          
  61.             currentNode currentNode->parent;  
  62.               
  63.             // close node  
  64.             if (currentNode && currentNode->type == XML_ELEMENT_NODE)  
  65.              
  66.                 NSLog(@"</%s>", currentNode->name);  
  67.              
  68.               
  69.             if (currentNode->next)  
  70.              
  71.                 currentNode currentNode->next;  
  72.              
  73.             else   
  74.              
  75.                 while(currentNode)  
  76.                  
  77.                     currentNode currentNode->parent;  
  78.                     if (currentNode && currentNode->type == XML_ELEMENT_NODE)  
  79.                      
  80.                         NSLog(@"</%s>", currentNode->name);  
  81.                         if (strcmp((const char *)currentNode->name, "table") == 0)  
  82.                          
  83.                             NSLog(@"over");  
  84.                          
  85.                      
  86.                       
  87.                     if (currentNode == nodes->nodeTab[0])  
  88.                      
  89.                         break;  
  90.                      
  91.                       
  92.                     if (currentNode && currentNode->next)  
  93.                      
  94.                         currentNode currentNode->next;  
  95.                         break;  
  96.                      
  97.                  
  98.              
  99.          
  100.           
  101.         if (currentNode == nodes->nodeTab[0])  
  102.          
  103.             break;  
  104.          
  105.     


不过我还是喜欢用TFHpple,因为它很简单,也好用,但是它的功能不是很完完善。比如,不能获取childrennode,我就写了两个方法,一个是获取children node,一个是获取所有的contents. 还有node的属性content的key与node'scontent的key一样,都是@"nodeContent", 正确情况下属性的应是@"attributeContent",

所以我写了这个方法,同时修改node属性的content key.

  1. NSDictionary *DictionaryForNode2(xmlNodePtr currentNode, NSMutableDictionary *parentResult)  
  2.  
  3.     NSMutableDictionary *resultForNode [NSMutableDictionary dictionary];  
  4.       
  5.     if (currentNode->name)  
  6.      
  7.         NSString *currentNodeContent  
  8.         [NSString stringWithCString:(const char *)currentNode->name encoding:NSUTF8StringEncoding];  
  9.         [resultForNode setObject:currentNodeContent forKey:@"nodeName"];  
  10.      
  11.       
  12.     if (currentNode->content)  
  13.      
  14.         NSString *currentNodeContent [NSString stringWithCString:(const char *)currentNode->content encoding:NSUTF8StringEncoding];  
  15.           
  16.         if (currentNode->type == XML_TEXT_NODE)  
  17.          
  18.             if (currentNode->parent->type == XML_ELEMENT_NODE)  
  19.              
  20.                 [parentResult setObject:currentNodeContent forKey:@"nodeContent"];  
  21.                 return nil;  
  22.              
  23.               
  24.             if (currentNode->parent->type == XML_ATTRIBUTE_NODE)  
  25.              
  26.                 [parentResult  
  27.                  setObject:  
  28.                  [currentNodeContent  
  29.                   stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]  
  30.                  forKey:@"attributeContent"];  
  31.                 return nil;  
  32.   
  33.              
  34.          
  35.      
  36.       
  37.   
  38.       
  39.     xmlAttr *attribute currentNode->properties;  
  40.     if (attribute)  
  41.      
  42.         NSMutableArray *attributeArray [NSMutableArray array];  
  43.         while (attribute)  
  44.          
  45.             NSMutableDictionary *attributeDictionary [NSMutableDictionary dictionary];  
  46.             NSString *attributeName  
  47.             [NSString stringWithCString:(const char *)attribute->name encoding:NSUTF8StringEncoding];  
  48.             if (attributeName)  
  49.              
  50.                 [attributeDictionary setObject:attributeName forKey:@"attributeName"];  
  51.              
  52.               
  53.             if (attribute->children)  
  54.              
  55.                 NSDictionary *childDictionary DictionaryForNode2(attribute->children, attributeDictionary);  
  56.                 if (childDictionary)  
  57.                  
  58.                     [attributeDictionary setObject:childDictionary forKey:@"attributeContent"];  
  59.                  
  60.              
  61.               
  62.             if ([attributeDictionary count] 0)  
  63.              
  64.                 [attributeArray addObject:attributeDictionary];  
  65.              
  66.             attribute attribute->next;  
  67.          
  68.           
  69.         if ([attributeArray count] 0)  
  70.          
  71.             [resultForNode setObject:attributeArray forKey:@"nodeAttributeArray"];  
  72.          
  73.      
  74.       
  75.     xmlNodePtr childNode currentNode->children;  
  76.     if (childNode)  
  77.      
  78.         NSMutableArray *childContentArray [NSMutableArray array];  
  79.         while (childNode)  
  80.          
  81.             NSDictionary *childDictionary DictionaryForNode2(childNode, resultForNode);  
  82.             if (childDictionary)  
  83.              
  84.                 [childContentArray addObject:childDictionary];  
  85.              
  86.             childNode childNode->next;  
  87.          
  88.         if ([childContentArray count] 0)  
  89.          
  90.             [resultForNode setObject:childContentArray forKey:@"nodeChildArray"];  
  91.          
  92.      
  93.       
  94.     return resultForNode;  
  95.  

TFHppleElement.m里加了两个key 常量
  1. NSString const TFHppleNodeAttributeContentKey  @"attributeContent";  
  2. NSString const TFHppleNodeChildArrayKey        @"nodeChildArray";  

并修改获取属性方法为:
  1. (NSDictionary *) attributes  
  2.  
  3.   NSMutableDictionary translatedAttributes [NSMutableDictionary dictionary];  
  4.   for (NSDictionary attributeDict in [node objectForKey:TFHppleNodeAttributeArrayKey])  
  5.     [translatedAttributes setObject:[attributeDict objectForKey:TFHppleNodeAttributeContentKey]  
  6.                              forKey:[attributeDict objectForKey:TFHppleNodeAttributeNameKey]];  
  7.    
  8.   return translatedAttributes;  
  9.  

并添加获取children node 方法:
  1. (BOOL) hasChildren  
  2.  
  3.     NSArray *childs [node objectForKey: TFHppleNodeChildArrayKey];  
  4.       
  5.     if (childs)   
  6.      
  7.         return  YES;  
  8.      
  9.       
  10.     return  NO;  
  11.  
  12.   
  13. (NSArray *) children  
  14.  
  15.     if ([self hasChildren])  
  16.         return [node objectForKey: TFHppleNodeChildArrayKey];  
  17.     return nil;  
  18.  



最后我还加了一个获取所有content的主法:
  1. (NSString *)contentsAt:(NSString *)xPathOrCss; 
请看 源码



参看:http://giles-wang.blogspot.com/2011/08/iphoneansi.html


  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值