用php解析html

最新推荐文章于 2024-07-01 11:31:42 发布

load_life

最新推荐文章于 2024-07-01 11:31:42 发布

阅读量590

点赞数

分类专栏： php 文章标签： php html object jquery div file

php 专栏收录该内容

168 篇文章 0 订阅

订阅专栏

最近想用php写一个爬虫，就需要解析html，在sourceforge上找到一个项目叫做PHP Simple HTML DOM Parser，它可以以类似jQuery的方式通过css选择器来返回指定的DOM元素，功能十分强大。

首先要在程序的开始引入simple_html_dom.php这个文件

 
        include_once 
        ( 
        'simple_html_dom.php' 
        ); 
       

PHP Simple HTML DOM Parser提供了3种方式来创建DOM对象

 
        // Create a DOM object from a string 
       
        $html 
         = str_get_html( 
        '<html><body>Hello!</body></html>' 
        ); 
       
        // Create a DOM object from a URL 
       
        $html 
         = file_get_html( 
        'http://www.google.com/' 
        ); 
       
        // Create a DOM object from a HTML file 
       
        $html 
         = file_get_html( 
        'test.htm' 
        );

得到DOM对象后就可以进行各种操作了

 
        // Find all anchors, returns a array of element objects 
       
        $ret 
         =  
        $html 
        ->find( 
        'a' 
        ); 
       
        // Find (N)th anchor, returns element object or null if not found (zero based) 
       
        $ret 
         =  
        $html 
        ->find( 
        'a' 
        , 0); 
       
        // Find lastest anchor, returns element object or null if not found (zero based) 
       
        $ret 
         =  
        $html 
        ->find( 
        'a' 
        , -1);  
       
        // Find all <div> with the id attribute 
       
        $ret 
         =  
        $html 
        ->find( 
        'div[id]' 
        ); 
       
        // Find all <div> which attribute id=foo 
       
        $ret 
         =  
        $html 
        ->find( 
        'div[id=foo]' 
        );

这里可以使用各种css选择器，就像在jQuery中进行DOM操作一样，非常方便。此外，还有两个特殊的属性可以得到文本和注释的内容

 
        // Find all text blocks  
       
        $es 
         =  
        $html 
        ->find( 
        'text' 
        ); 
       
        // Find all comment (<!--...-->) blocks  
       
        $es 
         =  
        $html 
        ->find( 
        'comment' 
        );

当然，还是类似于jQuery，PHP Simple HTML DOM Parser也支持链式操作，以及各种访问DOM元素的简单方法

 
        // Example 
       
 
        echo 
         $html 
        ->find( 
        "#div1" 
        , 0)->children(1)->children(1)->children(2)->id; 
       
 
        // or  
       
 
        echo 
         $html 
        ->getElementById( 
        "div1" 
        )->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute( 
        'id' 
        );