Jsoup学习
作者:heda
创建:2013-4-26
最简单的
String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Jsoup会尽量解析文档,即使有错或不规范也尽量解析
object model of document
Document由Elements和TextNodes组成
继承链:
Document extends Element extends Node
TextNode extends Node
Element由一组children node组成,有一个父Element,有一个过滤的子childrennodes
从String获得Document
Jsoup.parse(html);
Jsoup.parse(html,baseuri);
baseuri有助于将relative path url转化为absolutepath url
有个疑问是baseuri怎么设,是相对于当前url吗?
这上面说是整个url
parse body片段
String html = "<div><p>Lorem ipsum.</p>"; Document doc = Jsoup.parseBodyFragment(html); Element body = doc.body();
解析url
直接从url获取网页并解析:
Document doc = Jsoup.connect("http://example.com/").get(); String title = doc.title();
解析文件
File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Navigate document – dom methods
Element content = doc.getElementById("content"); Elements links = content.getElementsByTag("a"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); }
Finding elements
getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key)
(and related methods)- Element siblings:
siblingElements()
,firstElementSibling()
,lastElementSibling()
;nextElementSibling()
,previousElementSibling()
- Graph:
parent()
,children()
,child(int index)
Element data
attr(String key)
to get andattr(String key, String value)
to set attributesattributes()
to get all attributesid()
,className()
andclassNames()
text()
to get andtext(String value)
to set the text contenthtml()
to get andhtml(String value)
to set the inner HTML contentouterHtml()
to get the outer HTML valuedata()
to get data content (e.g. ofscript
andstyle
tags)tag()
andtagName()
Manipulating HTML and text
append(String html)
,prepend(String html)
appendText(String text)
,prependText(String text)
appendElement(String tagName)
,prependElement(String tagName)
html(String value)
Use selector-syntax to find elements
CSS or jquery-like selector syntax
File input = new File("/tmp/input.html"); Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements links = doc.select("a[href]"); // a with href Elements pngs = doc.select("img[src$=.png]"); // img with src ending .png Element masthead = doc.select("div.masthead").first(); // div with class=masthead Elements resultLinks = doc.select("h3.r > a"); // direct a after h3
select是上下文相关的 – contextual
Document,Element,Elements都可以使用select,也就是说可以chainingselect!
select的返回值是Elements列表
Selector overview
tagname
: find elements by tag, e.g. a
ns|tag
: find elements by tag in a namespace, e.g. fb|name
finds <fb:name>
elements
#id
: find elements by ID, e.g. #logo
.class
: find elements by class name, e.g. .masthead
[attribute]
: elements with attribute, e.g. [href]
[^attr]
: elements with an attribute name prefix, e.g. [^data-]
finds elements with HTML5 dataset attributes
[attr=value]
: elements with attribute value, e.g. [width=500]
[attr^=value]
, [attr$=value]
, [attr*=value]
: elements with attributes that start with, end with, or contain thevalue, e.g. [href*=/path/]
[attr~=regex]
: elements with attribute values that match the regular expression;e.g. img[src~=(?i)\.(png|jpe?g)]
*
: all elements, e.g. *
Selector combinations
el#id
: elements with ID,e.g. div#logo
el.class
: elements with class, e.g. div.masthead
el[attr]
: elements with attribute, e.g. a[href]
Any combination, e.g. a[href].highlight
ancestorchild
: child elements that descend fromancestor, e.g. .body p
finds p
elements anywhere under a block with class "body"
parent >child
: child elements that descend directly fromparent, e.g. div.content> p
finds p
elements; and body > *
finds the direct children of the body tag
siblingA +siblingB
: finds sibling B element immediatelypreceded by sibling A, e.g. div.head + div
siblingA ~siblingX
: finds sibling X element preceded bysibling A, e.g. h1 ~ p
el, el, el
: group multiple selectors, find unique elements that match any ofthe selectors; e.g. div.masthead,div.logo
Pseudo Selector
· :lt(n)
: find elementswhose sibling index (i.e. its position in the DOM tree relative to its parent)is less than n
; e.g. td:lt(3)
· :gt(n)
: find elementswhose sibling index is greater than n
; e.g. div p:gt(2)
· :eq(n)
: find elementswhose sibling index is equal to n
; e.g. form input:eq(1)
· :has(seletor): find elements that contain elements matching theselector; e.g. div:has(p)
· :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
· :contains(text): find elements that contain the given text. The searchis case-insensitive; e.g. p:contains(jsoup)
· :containsOwn(text): find elements that directly contain the given text
· :matches(regex): find elements whose text matches the specified regularexpression; e.g. div:matches((?i)login)
· :matchesOwn(regex): find elements whose own text matches the specifiedregular expression
· Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index0, the second at 1, etc
Extract attributes, text, and HTML from elements
- To get the value of an attribute, use the
Node.attr(String key)
method - For the text on an element (and its combined children), use
Element.text()
- For HTML, use
Element.html()
, orNode.outerHtml()
as appropriate
code:
String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>"; Document doc = Jsoup.parse(html); Element link = doc.select("a").first(); String text = doc.body().text(); // "An example link" String linkHref = link.attr("href"); // "http://example.com/" String linkText = link.text(); // "example"" String linkOuterH = link.outerHtml(); // "<a href="http://example.com"><b>example</b></a>" String linkInnerH = link.html(); // "<b>example</b>"
其它方法:
Working with URLs
- Make sure you specify a base URI when parsing the document (which is implicit when loading from a URL), and
- Use the abs: attribute prefix to resolve an absolute URL from an attribute:
Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // http://jsoup.org/
除了上面的方法,Node.absUrl(String key)也可以:
String absUrl =linkEl.absUrl("href");