HTMLPARSER学习小结(一)

ggbwqy242

于 2015-01-06 15:45:10 发布

阅读量379

点赞数

分类专栏： htmlparser

htmlparser 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

htmlparser是个优秀的网页信息處理工具，下面小结其一些基本的用法：

1.HTMLParser的核心模块是org.htmlparser.Parser类，这个类实际完成了对于HTML页面的分析工作。这个类有下面几个构造函数：
    public Parser ();
    public Parser (Lexer lexer, ParserFeedback fb);
   public Parser (URLConnection connection, ParserFeedback fb) throws ParserException;
    public Parser (String resource, ParserFeedback feedback) throws ParserException;
   public Parser (String resource) throws ParserException;
    public Parser (Lexer lexer);
    public Parser (URLConnection connection) throws ParserException;
    和一个静态类 public static Parser createParser (String html, String charset);

我們常用的構造函數是：以例子說明

a. public Parser (URLConnection connection)，例：

Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1/HTMLParserTester.html")).openConnection() );

b. public Parser (String resource)，例：

Parser parser = new Parser("F://HTMLParserTester.html" );

此後，HTMLParser将網頁解析并保存为一个树的结构。Node是信息保存的数据类型基础。

-------------------------------------------------------------------------------

Node中包含的方法有几类：
对于树型结构进行遍历的函数，这些函数最容易理解：
Node getParent ()：取得父节点
NodeList getChildren ()：取得子节点的列表
Node getFirstChild ()：取得第一个子节点
Node getLastChild ()：取得最后一个子节点
Node getPreviousSibling ()：取得前一个兄弟
Node getNextSibling ()：取得下一个兄弟节点
取得Node内容的函数：
String getText ()：取得文本
String toPlainTextString()：取得纯文本信息。
String toHtml () ：取得HTML信息（原始HTML）
String toHtml (boolean verbatim)：取得HTML信息（原始HTML）
String toString ()：取得字符串信息（原始HTML）
Page getPage ()：取得这个Node对应的Page对象
int getStartPosition ()：取得这个Node在HTML页面中的起始位置
int getEndPosition ()：取得这个Node在HTML页面中的结束位置

--------------------------------------------------------------------------------------------------

對於這麼一段HTML（HTMLParserTester.html）

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head>
<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
    <div id="logoindex">
        <!--这是注释-->
        白泽居-www.baizeju.com
<a href="http://www.baizeju.com">白泽居-www.baizeju.com</a>
    </div>
    白泽居-www.baizeju.com
</div>
</body>
</html>

做一個簡單的測試：

<span style="text-indent: 32px; font-size: 9pt;">import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileInputStream;
import java.io.File;
import java.net.HttpURLConnection;
import java.net.URL;
import org.htmlparser.Node;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.Parser;
public class Main {
    private static String ENCODE = "</span><span style="text-indent: 32px; font-size: 9pt; color: blue;">GBK</span><span style="text-indent: 32px; font-size: 9pt;">";
    private static void message( String szMsg ) {
        try{ System.out.println(new String(szMsg.getBytes(ENCODE), System.getProperty("file.encoding"))); }     catch(Exception e ){}
    }
    public static String </span><span style="text-indent: 32px; font-size: 9pt; color: blue;">openFile</span><span style="text-indent: 32px; font-size: 9pt;">( String szFileName ) {
        try {
            BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)),    ENCODE) );
            String szContent="";
            String szTemp;
            
            while ( (szTemp = bis.readLine()) != null) {
                szContent+=szTemp+"\n";
            }
            bis.close();
            return szContent;
        }
        catch( Exception e ) {
            return "";
        }
    }
    
   public static void main(String[] args) {
        
        try{
            Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() );
        
            for (NodeIterator i = parser.elements (); i.hasMoreNodes(); ) {
                Node node = i.nextNode();
                message("getText:"+node.getText());
                message("getPlainText:"+node.toPlainTextString());
                message("toHtml:"+node.toHtml());
                message("toHtml(true):"+node.toHtml(true));
                message("toHtml(false):"+node.toHtml(false));
                message("toString:"+node.toString());
                message("=================================================");
            }            
        }
        catch( Exception e ) {     
            System.out.println( "Exception:"+e );
        }
    }
}

</span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">输出结果：</span><span style="text-indent: 32px; font-size: 9pt;">
getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
getPlainText:
toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
toHtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121
=================================================
getText:

getPlainText:

toHtml:

toHtml(true):

toHtml(false):

toString:Txt (121[0,121],123[1,0]): \n
=================================================
getText:head
getPlainText:</span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
toHtml:<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title></span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com</title></head>
toHtml(true):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title></span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com</title></head>
toHtml(false):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title></span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com</title></head>
toString:HEAD: Tag (123[1,0],129[1,6]): head
Tag (129[1,6],197[1,74]): meta http-equiv="Content-Type" content="text/html; ...
Tag (197[1,74],204[1,81]): title
    Txt (204[1,81],223[1,100]): </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
    End (223[1,100],231[1,108]): /title
End (231[1,108],238[1,115]): /head

=================================================
getText:

getPlainText:

toHtml:

toHtml(true):

toHtml(false):

toString:Txt (238[1,115],240[2,0]): \n
=================================================
getText:html xmlns="http://www.w3.org/1999/xhtml"
getPlainText:


        
                
                </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
</span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
        
        </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com



toHtml:<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
        <div id="logoindex">
                <!--</span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">这是注释</span><span style="text-indent: 32px; font-size: 9pt;">-->
                </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
<a href="http://www.baizeju.com"></span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com</a>
        </div>
        </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
</div>
</body>
</html>
toHtml(true):<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
        <div id="logoindex">
                <!--</span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">这是注释</span><span style="text-indent: 32px; font-size: 9pt;">-->
                </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
<a href="http://www.baizeju.com"></span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com</a>
        </div>
        </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
</div>
</body>
</html>
toHtml(false):<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
        <div id="logoindex">
                <!--</span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">这是注释</span><span style="text-indent: 32px; font-size: 9pt;">-->
                </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
<a href="http://www.baizeju.com"></span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com</a>
        </div>
        </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
</div>
</body>
</html>
toString:Tag (240[2,0],283[2,43]): html xmlns="http://www.w3.org/1999/xhtml"
Txt (283[2,43],285[3,0]): \n
Tag (285[3,0],292[3,7]): body 
    Txt (292[3,7],294[4,0]): \n
    Tag (294[4,0],313[4,19]): div id="top_main"
      Txt (313[4,19],316[5,1]): \n\t
      Tag (316[5,1],336[5,21]): div id="logoindex"
        Txt (336[5,21],340[6,2]): \n\t\t
        Rem (340[6,2],351[6,13]): </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">这是注释</span><span style="text-indent: 32px; font-size: 9pt;">
        Txt (351[6,13],376[8,0]): \n\t\t</span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com\n
        Tag (376[8,0],409[8,33]): a href="http://www.baizeju.com"
          Txt (409[8,33],428[8,52]): </span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com
          End (428[8,52],432[8,56]): /a
        Txt (432[8,56],435[9,1]): \n\t
        End (435[9,1],441[9,7]): /div
      Txt (441[9,7],465[11,0]): \n\t</span><span style="color: rgb(68, 68, 68); font-size: 16.19047737121582px; line-height: 22.85714340209961px; text-indent: 32px;">白泽居</span><span style="text-indent: 32px; font-size: 9pt;">-www.baizeju.com\n
      End (465[11,0],471[11,6]): /div
    Txt (471[11,6],473[12,0]): \n
    End (473[12,0],480[12,7]): /body
Txt (480[12,7],482[13,0]): \n
End (482[13,0],489[13,7]): /html

=================================================</span>

打印出了5部分內容，

第一部分：<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

第二部分：空白

第三部分：<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head>

第四部分：空白

第五部分：<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">

白泽居-www.baizeju.com
<a href="http://www.baizeju.com">白泽居-www.baizeju.com</a>
</div>
白泽居-www.baizeju.com
</div>
</body>
</html>

---------------------------------------------------------------------------------------------------------------------

对于第一个Node的内容，对应的就是第一行<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">，这个比较好理解。

从这个输出结果中，也可以看出内容的树状结构。或者说是树林结构。在Page内容的第一层Tag，如DOCTYPE，head和html，分别形成了一个最高层的Node节点（很多人可能对第二个和第四个Node的内容有点奇怪。实际上这两个Node就是两个换行符号。HTMLParser把HTML页面内容中的所有换行，空格，Tab等都转换成了相应的Tag，所以就出现了这样的Node。虽然内容少但是级别高，呵呵）

getPlainTextString是把用户可以看到的内容都包含了。有趣的有两点，一是<head>标签中的Title内容是在plainText中的，可能在标题中可见的也算可见吧。另外就是象前面说的，HTML内容中的换行符什么的，也都成了plainText，这个逻辑上好像有点问题。

另外可能大家发现toHtml，toHtml(true)和toHtml(false)的结果没什么区别。实际也是这样的，如果跟踪HTMLParser的代码就可以发现，Node的子类是AbstractNode，其中实现了toHtml()的代码，直接调用toHtml(false)，而AbstractNode的三个子类RemarkNode，TagNode和TextNode中，toHtml(boolean verbatim)的实现中，都没有处理verbatim参数，所以三个函数的结果是一模一样的。如果你不需要实现你自己的什么特殊处理，简单使用toHtml就可以了。

HTML的Node类继承关系如下图（这个是从别的文章Copy的）

AbstractNodes是Node的直接子类，也是一个抽象类。它的三个直接子类实现是RemarkNode，用于保存注释。在输出结果的toString部分中可以看到有一个"Rem (345[6,2],356[6,13]): 这是注释"，就是一个RemarkNode。TextNode也很简单，就是用户可见的文字信息。TagNode是最复杂的，包含了HTML语言中的所有标签，而且可以扩展（扩展 HTMLParser 对自定义标签的处理能力）。TagNode包含两类，一类是简单的Tag，实际就是不能包含其他Tag的标签，只能做叶子节点。另一类是CompositeTag，就是可以包含其他Tag，是分支节点