Jsoup学习

Jsoup学习

作者:heda

创建:2013-4-26

 

最简单的

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";

Document doc = Jsoup.parse(html);

Jsoup会尽量解析文档,即使有错或不规范也尽量解析

object model of document

Document由Elements和TextNodes组成

继承链:

Document extends Element extends Node

TextNode extends Node

Element由一组children node组成,有一个父Element,有一个过滤的子childrennodes

从String获得Document

Jsoup.parse(html);

Jsoup.parse(html,baseuri);

baseuri有助于将relative path url转化为absolutepath url

有个疑问是baseuri怎么设,是相对于当前url吗?

http://stackoverflow.com/questions/7142187/jsoup-parse-vs-jsoup-parse-or-how-does-url-detection-work-in-jsoup

这上面说是整个url

parse body片段

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

解析url

直接从url获取网页并解析:

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

解析文件

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Navigate document – dom methods

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

Finding elements

Element data

Manipulating HTML and text

Use selector-syntax to find elements

CSS or jquery-like selector syntax

 

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png

Element masthead = doc.select("div.masthead").first();
  // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

 

select是上下文相关的 – contextual

Document,Element,Elements都可以使用select,也就是说可以chainingselect!

select的返回值是Elements列表

Selector overview

tagname: find elements by tag, e.g. a

ns|tag: find elements by tag in a namespace, e.g. fb|namefinds <fb:name> elements

#id: find elements by ID, e.g. #logo

.class: find elements by class name, e.g. .masthead

[attribute]: elements with attribute, e.g. [href]

[^attr]: elements with an attribute name prefix, e.g. [^data-]finds elements with HTML5 dataset attributes

[attr=value]: elements with attribute value, e.g. [width=500]

[attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain thevalue, e.g. [href*=/path/]

[attr~=regex]: elements with attribute values that match the regular expression;e.g. img[src~=(?i)\.(png|jpe?g)]

*: all elements, e.g. *

Selector combinations

el#id: elements with ID,e.g. div#logo

el.class: elements with class, e.g. div.masthead

el[attr]: elements with attribute, e.g. a[href]

Any combination, e.g. a[href].highlight

ancestorchild: child elements that descend fromancestor, e.g. .body p finds p elements anywhere under a block with class "body"

parent >child: child elements that descend directly fromparent, e.g. div.content> p finds p elements; and body > *finds the direct children of the body tag

siblingA +siblingB: finds sibling B element immediatelypreceded by sibling A, e.g. div.head + div

siblingA ~siblingX: finds sibling X element preceded bysibling A, e.g. h1 ~ p

el, el, el: group multiple selectors, find unique elements that match any ofthe selectors; e.g. div.masthead,div.logo

Pseudo Selector

·  :lt(n): find elementswhose sibling index (i.e. its position in the DOM tree relative to its parent)is less than n; e.g. td:lt(3)

·  :gt(n): find elementswhose sibling index is greater than n; e.g. div p:gt(2)

·  :eq(n): find elementswhose sibling index is equal to n; e.g. form input:eq(1)

·  :has(seletor): find elements that contain elements matching theselector; e.g. div:has(p)

·  :not(selector): find elements that do not match the selector; e.g. div:not(.logo)

·  :contains(text): find elements that contain the given text. The searchis case-insensitive; e.g. p:contains(jsoup)

·  :containsOwn(text): find elements that directly contain the given text

·  :matches(regex): find elements whose text matches the specified regularexpression; e.g. div:matches((?i)login)

·  :matchesOwn(regex): find elements whose own text matches the specifiedregular expression

·  Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index0, the second at 1, etc

Extract attributes, text, and HTML from elements

code:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"

 

其它方法:

Working with URLs

  1. Make sure you specify a base URI when parsing the document (which is implicit when loading from a URL), and
  2. Use the abs: attribute prefix to resolve an absolute URL from an attribute:

Document doc = Jsoup.connect("http://jsoup.org").get();

Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // http://jsoup.org/

除了上面的方法,Node.absUrl(String key)也可以:

String absUrl =linkEl.absUrl("href");

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值