htmlparser 读书笔记

htmlparser是一个纯的java写的html解析的库,它不依赖于其它的java库文件,主要用于改造或

提取html。

 

HTML Parser is a Java library used to parse HTML in either a linear or nested fashion. Primarily used for transformation or extraction, it features filters, visitors, custom tags and easy to use JavaBeans. It is a fast, robust and well tested package.

htmlparser提供了两个jar包,一个是htmllexer.jar,另一个是htmlparser.jar。

 

htmllexer.jar提供了一种轻量级的解析方法,只对html线性解析字符串、注释、标鉴。而htmlparser.jar提供了包括htmllexer.jar所实现的功能,同是支持嵌套的标鉴等复杂html解析。

 

To use the library, you will need to add either the htmllexer.jar or htmlparser.jar to your classpath when compiling and running. The htmllexer.jar provides low level access to generic string, remark and tag nodes on the page in a linear, flat, sequential manner. The htmlparser.jar, which includes the classes found in htmllexer.jar, provides access to a page as a sequence of nested differentiated tags containing string, remark and other tag nodes.

If your application requires only modest structural knowledge of the page, and is primarily concerned with individual, isolated nodes, you should consider using the lightweight lexer. But if your application requires knowledge of the nested structure of the page, for example processing tables, you will probably want to use the full parser.

 

htmlparser提供了提取(Extraction)与改造(TransFormation)两个功能.

 

更多信息请参考官方网站: http://htmlparser.sourceforge.net/

 

code segment如下:

 

Parser parser = new Parser(url); //url指定html文件的地址,可以是本地文件,也可以是网络文件
    parser.setEncoding("UTF-8");

 

NodeFilter selectFilter = new NodeClassFilter(SelectTag.class);
    NodeFilter textareaFilter = new NodeClassFilter(TextareaTag.class);
    NodeFilter inputFilter = new NodeClassFilter(InputTag.class);
    NodeFilter formFilter = new NodeClassFilter(FormTag.class);

HasAttributeFilter inputAttributeFilter = new HasAttributeFilter("type", "text");
    AndFilter andFilter = new AndFilter(inputFilter,inputAttributeFilter); //AndFilter,指定文本输入框标鉴

 

OrFilter lastFilter = new OrFilter();
    lastFilter.setPredicates(new NodeFilter[] { selectFilter, textareaFilter, andFilter, formFilter });
    NodeList nl = parser.parse(null);
    NodeList nodelist = nl.extractAllNodesThatMatch(lastFilter, true);//true提定是否递归match
    Node[] nodes = nodelist.toNodeArray();

 

ArrayList<String> result = new ArrayList<String>();

for (int i = 0; i < nodes.length; i++) {
      Node node = nodes[i];
      Tag tag = (Tag) node;
      if (tag instanceof FormTag) {
        tag.setAttribute("action", ""); //设定form标鉴的action地址
      } else if (tag instanceof SelectTag) {
        if (tag.toHtml().contains("multiple")) {
          ((SelectTag) tag).removeChild(tag.getChildren().size() - 1); // remove掉最后一个Node
        }
        result.add(tag.getAttribute("name"));
      } else {

    result.add(tag.getAttribute("name"));
      }
    } 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值