使用java自带的swing解析html,用起来简单,速度也很快。首先要导入javax.swing.text.*和javax.swing.text.html.*两个包。然后定义一个parser的类,继承了javax.swing.text.html.HTMLEditorKit.ParserCallback这个类,在javax.swing.text.html.HTMLEditorKit.ParserCallback这个类中,有如下几个方法
void |
flush() |
void |
handleComment(char[] data, int pos) |
void |
handleEndOfLineString(String eol) 它的调用是在完成流的解析之后且在调用 flush 之前。 |
void |
handleEndTag(HTML.Tag t, int pos) |
void |
handleError(String errorMsg, int pos) |
void |
handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) |
void |
handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) |
void |
handleText(char[] data, int pos) |
先拿handleStartTag方法来说,当发现html标签开始的时候调用这个函数,t是标签的名,(比如HTML.Tag.A,这些标签可以在网上查到),a是属性列,比如a标签中的hreg属性,可以通过 HTML.ATTRIBUTE.HREF来拿到。同样,属性列swing也公开了。handleEndTag是当标签结束的时候被调用。用法大家可以看看我写的parser类代码如下:
public
class
Parser
extends
ParserCallback
...
{
protected String base;
protected boolean isLink = false;
protected boolean isParagraph = false;
protected boolean isTitle = false;
protected String htmlbody = new String();
protected String urlTitle = new String();
protected Vector<String> links = new Vector<String>();
protected Vector<String> linkname = new Vector<String>();
protected String paragraphText = new String();
protected String linkandparagraph = new String();
protected String encode = new String();
public Parser(String baseurl)...{
base=baseurl;
protected String base;
protected boolean isLink = false;
protected boolean isParagraph = false;
protected boolean isTitle = false;
protected String htmlbody = new String();
protected String urlTitle = new String();
protected Vector<String> links = new Vector<String>();
protected Vector<String> linkname = new Vector<String>();
protected String paragraphText = new String();
protected String linkandparagraph = new String();
protected String encode = new String();
public Parser(String baseurl)...{
base=baseurl;