http://huangrongyou.iteye.com/blog/1748694
引入主要包:
htmlparser.jar
解析HtmlParser的主要步骤:
解析出html中url
- // Parser parser = new Parser( (HttpURLConnection) (new URL("http://www.google.com")).openConnection() );
- Parser parser = new Parser();
- parser.setEncoding(parser.getEncoding());
- parser.setURL("http://www.google.com");
- NodeFilter filter = new NodeClassFilter(LinkTag.class);
- NodeList list = parser.extractAllNodesThatMatch(filter);
- for (int i = 0; i < list.size(); i++) {
- LinkTag node = (LinkTag) list.elementAt(i);
- System.out.println(node.extractLink());
- }
Visitor方式解析Html
使用visitor方式访问Html,一般不用这种方式,不细说
//通过指定URLConnection对象创建Parser对象
- Parser parser = new Parser((HttpURLConnection)(new URL("http://www.google.com")).openConnection());
- //设置Parser对象的字符编码,一般与网页的字符编码保持一致
- parser.setEncoding("GB2312");
- //创建LinkFindingVisitor对象
- LinkFindingVisitor lvisitor = new LinkFindingVisitor("http://www.google.com");
- //查找http://www.qq.com的链接个数
- parser.visitAllNodesWith(lvisitor);
- System.out.println("网页中包含http://www.google.com的链接个数:"+lvisitor.getCount());
Filter方式解析HTML
HtmlParser也可以解析文本在本地:
- Parser parser = new Parser("d:\\1.html");
- parser.setEncoding(parser.getEncoding());
- NodeFilter filter = new NodeClassFilter(LinkTag.class);
- NodeList list = parser.extractAllNodesThatMatch(filter);
- for (int i = 0; i < list.size(); i++) {
- LinkTag node = (LinkTag) list.elementAt(i);
- System.out.println(node.extractLink());
- }
HtmlParser自带的Filter:
TagNameFilter 接受所有满足指定Tag名的TagNodes.
- TagNameFilter filter = new TagNameFilter("a");
- NodeList nodeList = parser.parse(filter);
NodeClassFilter 接受所有接受指定的类的节点.
- NodeFilter filter = new NodeClassFilter(LinkTag.class); //如链接标签
- 或
- NodeFilter filter = new NodeClassFilter(TextNode.class); //如文本标签
- NodeList nodeList = parser.parse(filter);
- Node[] nodes = nodeList.toNodeArray(); //返回Node[]节点数组的情况
- 或
- NodeClassFilter filter = new NodeClassFilter(TableTag.class);对表格的过滤获取
- NodeList nodeList = parser.parse(filter);
- TableTag tableTag = (TableTag) nodeList.elementAt(0);
- TableRow[] rows = tableTag.getRows();
HasAttributeFilter 接受所有否含有某个属性(还可以设置该属性的值)的节点.
HasChildFilter 接受所有含有子节点符合该Filter的节点.
- TagNameFilter filter = new TagNameFilter(tag);
- HasChildFilter hasChildFilter = new HasChildFilter(filter);
- NodeList nodeList = parser.parse(hasChildFilter);
HasParentFilter 接受所有含有父节点符合该Filter的节点.
LinkRegexFilter 接受所有linkTag标签的link值.匹配给定的正则表达式的节点.
LinkStringFilter 接受所有linkTag标签的link值,匹配给定的字符串的节点.
AndFilter 相当于一个AND操作符,接受所有同时满足两个Filter的节点.
NotFilter 接受所有不符合Filter的节点.
OrFilter 相当于一个AND操作符,接受所有满足两个Filter中任意一个的节点.
XorFilter 相当于一个XOR操作符,接受所有只满足其中1个Filter的节点.
RegexFilter 接受所有满足指定正则表达式的String Nodes.
StringFilter 接受所有满足指定String的String Nodes.
IsEqualFilter 接受所有和某个特定的节点相同的节点.
CssSelectorNodeFilter 接受所有支持CSS2选择器的节点.
HasSiblingFilter 接受所有含有兄弟节点符合该Filter的节点.
- TagNameFilter filter = new TagNameFilter(tag);
- HasSiblingFilter hasSiblingFilter = new HasSiblingFilter(filter);
- NodeList nodeList = parser.parse(hasSiblingFilter);
Tag类
主要和NodeClassFilter配合使用
Remark:注释,
AppletTag:
BaseHrefTag,:
BodyTag:"BODY";//getBody();内部调用额是toPlainTextString();
Bullet:"LI"
BulletList:"UL","OL"
CompositeTag:
DefinitionList:"DL"
DefinitionListBullet:"DD","DT"
Div:"DIV"
DoctypeTag,:"!DOCTYPE"
FormTag,:
FrameSetTag:
FrameTag:
HeadingTag:"H1","H2","H3","H4","H5","H6"
HeadTag:"HEAD"
Html:"HTML"
ImageTag:
InputTag:"INPUT"
JspTag:"%","%=","%@"
LabelTag:"LABEL"
LinkTag:
MetaTag:
ObjectTag:
OptionTag:
ParagraphTag:"P"
ProcessingInstructionTag:"?"
ScriptTag,:
SelectTag:"SELECT"
Span:"SPAN"
StyleTag:"STYLE"
TableColumn:"TD"
TableHeader:"TH"
TableRow:"TR"
TableTag:"TABLE"
TagNode:
TextareaTag:"TEXTAREA"
TitleTag:"TITLE"
TextNode:
Htmlparser结构:
Tag分为简单Tag和复杂Tag