Jsoup学习

最新推荐文章于 2022-08-04 05:59:35 发布

softwarehe

最新推荐文章于 2022-08-04 05:59:35 发布

阅读量1k

点赞数

分类专栏： crawler

本文链接：https://blog.csdn.net/softwarehe/article/details/8855737

版权

crawler 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

Jsoup学习

作者：heda

创建：2013-4-26

最简单的

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";

Document doc = Jsoup.parse(html);

Jsoup会尽量解析文档，即使有错或不规范也尽量解析

object model of document

Document由Elements和TextNodes组成

继承链：

Document extends Element extends Node

TextNode extends Node

Element由一组children node组成，有一个父Element，有一个过滤的子childrennodes

从String获得Document

Jsoup.parse(html);

Jsoup.parse(html,baseuri);

baseuri有助于将relative path url转化为absolutepath url

有个疑问是baseuri怎么设，是相对于当前url吗？

http://stackoverflow.com/questions/7142187/jsoup-parse-vs-jsoup-parse-or-how-does-url-detection-work-in-jsoup

这上面说是整个url

parse body片段

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

解析url

直接从url获取网页并解析：

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

解析文件

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Navigate document – dom methods

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

Finding elements

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods)
Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
Graph: parent(), children(), child(int index)

Element data

attr(String key) to get and attr(String key, String value) to set attributes
attributes() to get all attributes
id(), className() and classNames()
text() to get and text(String value) to set the text content
html() to get and html(String value) to set the inner HTML content
outerHtml() to get the outer HTML value
data() to get data content (e.g. of script and style tags)
tag() and tagName()

Manipulating HTML and text

Use selector-syntax to find elements

CSS or jquery-like selector syntax

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png

Element masthead = doc.select("div.masthead").first();
  // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

select是上下文相关的 – contextual

Document，Element，Elements都可以使用select，也就是说可以chainingselect！

select的返回值是Elements列表

Selector overview

tagname: find elements by tag, e.g. a

ns|tag: find elements by tag in a namespace, e.g. fb|namefinds <fb:name> elements

#id: find elements by ID, e.g. #logo

.class: find elements by class name, e.g. .masthead

[attribute]: elements with attribute, e.g. [href]

[^attr]: elements with an attribute name prefix, e.g. [^data-]finds elements with HTML5 dataset attributes

[attr=value]: elements with attribute value, e.g. [width=500]

[attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain thevalue, e.g. [href*=/path/]

[attr~=regex]: elements with attribute values that match the regular expression;e.g. img[src~=(?i)\.(png|jpe?g)]

*: all elements, e.g. *

Selector combinations

el#id: elements with ID,e.g. div#logo

el.class: elements with class, e.g. div.masthead

el[attr]: elements with attribute, e.g. a[href]

Any combination, e.g. a[href].highlight

ancestorchild: child elements that descend fromancestor, e.g. .body p finds p elements anywhere under a block with class "body"

parent >child: child elements that descend directly fromparent, e.g. div.content> p finds p elements; and body > *finds the direct children of the body tag

siblingA +siblingB: finds sibling B element immediatelypreceded by sibling A, e.g. div.head + div

siblingA ~siblingX: finds sibling X element preceded bysibling A, e.g. h1 ~ p

el, el, el: group multiple selectors, find unique elements that match any ofthe selectors; e.g. div.masthead,div.logo

Pseudo Selector

· :lt(n): find elementswhose sibling index (i.e. its position in the DOM tree relative to its parent)is less than n; e.g. td:lt(3)

· :gt(n): find elementswhose sibling index is greater than n; e.g. div p:gt(2)

· :eq(n): find elementswhose sibling index is equal to n; e.g. form input:eq(1)

· :has(seletor): find elements that contain elements matching theselector; e.g. div:has(p)

· :not(selector): find elements that do not match the selector; e.g. div:not(.logo)

· :contains(text): find elements that contain the given text. The searchis case-insensitive; e.g. p:contains(jsoup)

· :containsOwn(text): find elements that directly contain the given text

· :matches(regex): find elements whose text matches the specified regularexpression; e.g. div:matches((?i)login)

· :matchesOwn(regex): find elements whose own text matches the specifiedregular expression

· Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index0, the second at 1, etc

Extract attributes, text, and HTML from elements

To get the value of an attribute, use the Node.attr(String key) method
For the text on an element (and its combined children), use Element.text()
For HTML, use Element.html(), or Node.outerHtml() as appropriate

code:

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"

其它方法：

Element.id()
Element.tagName()
Element.className() and Element.hasClass(String className)

Working with URLs

Make sure you specify a base URI when parsing the document (which is implicit when loading from a URL), and
Use the abs: attribute prefix to resolve an absolute URL from an attribute:

Document doc = Jsoup.connect("http://jsoup.org").get();

Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // http://jsoup.org/

除了上面的方法，Node.absUrl(String key)也可以：

String absUrl =linkEl.absUrl("href");

softwarehe

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Jsoup学习

Jsoup学习作者：heda创建：2013-4-26 最简单的String html = "First parse" + "Parsed HTML into a doc.";Document doc = Jsoup.parse(html);Jsoup会尽量解析文档，即使有错或不规范也尽量解析object model of documentDocument由E
复制链接

扫一扫

专栏目录