Parsing and traversing a Document

To parse a HTML document:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

(See parsing a document from a string for more info.)

The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles:

  • unclosed tags (e.g. <p>Lorem <p>Ipsum parses to <p>Lorem</p> <p>Ipsum</p>)

  • implicit tags (e.g. a naked <td>Table data</td> is wrapped into a <table><tr><td>?)

  • reliably creating the document structure (html containing a head and body, and only appropriate elements within the head)

The object model of a document

  • Documents consist of Elements and TextNodes (and a couple of other misc nodes: see thenodes package tree).

  • The inheritance chain is: Document extends Element extends NodeTextNode extends Node.

  • An Element contains a list of children Nodes, and has one parent Element. They also have provide a filtered list of child Elements only.

See also


转载于:https://my.oschina.net/u/553266/blog/296058

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值