HTML Parser 2.0正式发布(自行翻译)

最近用HTMLPARSER 做个小项目,很是关注其最新版本,到http://sourceforge.net,找到最新版本2.0,自行翻译2.0版本介绍.

发布人:derrickoswald
时间:2006-09-17 13:43
摘要:HTML Parser 2.0新功能

在Sourceforge 上非常流行的HTML Parser工程(http://sourceforge.net/projects/htmlparser) 已更新到一个新的版本、构造环境、仓库和新的WEB站点。基于这些从根本上的更新,版本被更新到了2。0。(待)

原文:

The very popular HTML Parser project (http://sourceforge.net/projects/htmlparser) on Sourceforge has been updated with a new license, new build environment, new repository and a new web site. To identify this radical change, the version has been revved to 2.0. 
 
In response to requests from the Apache community, the htmlparser license has changed from GNU Library or Lesser General Public License, to the more Apache friendly Common Public License 1.0 (http://opensource.org/licenses/cpl1.0.txt). 
 
As most projects are doing, the htmlparser repository has been changed from CVS to Subversion (http://subversion.tigris.org/). 
 
To support automatic integration in other projects, the build environment has changed from ant to Maven 2 (http://maven.apache.org/). This has provided an opportunity to update the web site (http://htmlparser.org). Project SNAPSHOTS and releases should be available soon, bear with us as we work out the kinks. 
 
HTML Parser is a Java library used to parse HTML in either a linear or nested fashion.  

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
HTML解析实用库,非常好用 public class HtmlLinkParser { //获取子链接,url为网页url,filter是链接过滤器,返回该页面子链接的HashSet public static Set<String> extracLinks(String url, LinkFilter filter) { Set<String> links = new HashSet<String>(); try { Parser parser = new Parser(url); parser.setEncoding("utf-8"); // 过滤 <frame >标签的 filter,用来提取 frame 标签里的 src 属性所表示的链接 NodeFilter frameFilter = new NodeFilter() { public boolean accept(Node node) { if (node.getText().startsWith("frame src=")) { return true; } else { return false; } } }; // OrFilter 接受<a>标签或<frame>标签,注意NodeClassFilter()可用来过滤一类标签,linkTag对应<标签> OrFilter linkFilter = new OrFilter(new NodeClassFilter( LinkTag.class), frameFilter); // 得到所有经过过滤的标签,结果为NodeList NodeList list = parser.extractAllNodesThatMatch(linkFilter); for (int i = 0; i < list.size(); i++) { Node tag = list.elementAt(i); if (tag instanceof LinkTag)// <a> 标签 { LinkTag link = (LinkTag) tag; String linkUrl = link.getLink();// 调用getLink()方法得到<a>标签中的链接 if (filter.accept(linkUrl))//将符合filter过滤条件的链接加入链接表 links.add(linkUrl); } else{// <frame> 标签 // 提取 frame 里 src 属性的链接如 <frame src="test.html"/> String frame = tag.getText(); int start = frame.indexOf("src="); frame = frame.substring(start); int end = frame.indexOf(" "); if (end == -1) end = frame.indexOf(">"); String frameUrl = frame.substring(5, end - 1); if (filter.accept(frameUrl)) links.add(frameUrl); } } } catch (ParserException e) {//捕捉parser的异常 e.printStackTrace(); } return links; } }

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值