Jsoup解析网页

最新推荐文章于 2023-09-21 17:38:18 发布

FlyingPrgApe

最新推荐文章于 2023-09-21 17:38:18 发布

阅读量2.4k

点赞数

分类专栏： Android网络文章标签： Jsoup 网页解析

本文链接：https://blog.csdn.net/ccp1994/article/details/23191545

版权

Android网络专栏收录该内容

6 篇文章 1 订阅

订阅专栏

1.Html基础

1.html文档结构

<html>

<head><title>计算机学院</title></head>

<body>

<span >关于公布2013-2014学年第2学期转专业学生名单及做好相关工作的通

</span></a>

<span >教材科关于2013-2014学年第2学期领取教材有关事宜的通知 </span></a>

</div>

</body>

</html>

注：

1.Element: <head>.<body>.<div>标签

2.Node：<a>.<span>标签

3.SiblingElements：<head>和<body>标签互为兄弟Element

4.SiblingNode：<div>标签内的第一个<a>和第二个<a>互为兄弟Node

该html的Tree如下

2．加载html文档

1.从字符串中提取html文档

String html = “<html><head><title>你好</title></head><body><p>我是谁</p></body></html>

Document doc = Jsoup.parse(html);

2.从URL直接加载html文档

try {

Document doc = (Document) Jsoup.connect(BASIC_URL).get();

} catch (IOException e) {

// TODO Auto-generated catch block

Log.e(tag, e.toString());

}

try {

Document doc = (Document) Jsoup.connect(BASIC_URL)

.data("query","Java")//请求参数

.userAgent("Mozilla")//设置USER-AGENT

.cookie("auth", "token")//设置cookie

.timeout(60*1000)//超时

.post();

} catch (IOException e) {

// TODO Auto-generated catch block

Log.e(tag, e.toString());

}//post方式请求

3.文件加载

File file = new File(filePath);

Document doc = (Document) Jsoup.parse(file, "UTF-8", "www.suse.edu.cn");

2.1 设置Http头信息

con1.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");

con1.header("Accept-Encoding","gzip,deflate,sdch");

con1.header("Referer","http://www.suse.edu.cn/");

con1.header("Accept-Language","zh-CN,zh;q=0.8,en;q=0.6");

con1.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36");

con1.header("(Request-Line)", "POST /cgi-bin/login?lang=zh_CN HTTP/1.1");

con1.header("Cache-Control", "no-cache");

con1.header("Connection", "Keep-Alive");

con1.header("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");

con1.header("Host", "http://www.suse.edu.cn/");

Response re = con1.ignoreContentType(true).method(Method.GET).execute();

3．修改数据

Doc.select(“div.comments a”).attr(“rel”,”nofollow”);//为所有连接增加rel = nofollow属性

Doc.select(“div.comments a”).addClass(“mylinkclass”);//为所有连接增加class = mylinkclass属性

Doc.select(“img”).removeAttr(“onclick”);//删除所有图片的onClick属性

Doc.select(“input=[type=text]”).val(“”);//清空文本输入框中的文本

注：修改完后直接调用Elements的html()方法就可以获取修改完的html文档

4.文档清理--清除文档内容

String safe = Jsoup.clean(unfafe,Whitelist.basic());

使用的是Whitelist类来对html文档进行过滤，使用几个常用方法

static Whitelist	basic() This whitelist allows a fuller range of text nodes: a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, strike, strong, sub, sup, u, ul, and appropriate attributes.
static Whitelist	basicWithImages() This whitelist allows the same text tags as basic(), and also allows img tags, with appropriate attributes, with src pointing to http or https.
static Whitelist	none() This whitelist allows only text nodes: all HTML will be stripped.
static Whitelist	relaxed() This whitelist allows a full range of text and structural body HTML: a, b, blockquote, br, caption, cite, code, col, colgroup, dd, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul Links do not have an enforced rel=nofollow attribute, but you can add that if desired.
static Whitelist	simpleText() This whitelist allows only simple text formatting: b, em, i, strong, u.

5.提取html内容

5.1Dom解析

5.1.1常用方法

Document类

Element	body() Accessor to the document's body element.
Element	createElement(String tagName) Create a new Element, with this document's base uri.
Element	head() Accessor to the document's head element.
String	nodeName() Get the node name of this node.
Document	normalise() Normalise the document.
Element	text(String text) Set the text of the body of this document.
String	title() Get the string contents of the document's title element.
void	title(String title) Set the document's title element.

Element类

Element	child(int index) Get a child element of this element, by its 0-based index number.
Elements	children() Get this element's child elements.
String	data() Get the combined data of this element.
List<DataNode>	dataNodes() Get this element's child data nodes.
Map<String,String>	dataset() Get this element's HTML5 custom data attributes.
Integer	elementSiblingIndex() Get the list index of this element in its element sibling list.
Element	empty() Remove all of the element's child nodes.
Element	firstElementSibling() Gets the first element sibling of this element.

获取Elements

Elements getAllElements() Find all elements under this element (including self, and children of children).
Element	getElementById(String id) Find an element by ID, including or under this element.
Elements	getElementsByAttribute(String key) Find elements that have a named attribute set.
Elements	getElementsByAttributeStarting(String keyPrefix) Find elements that have an attribute name starting with the supplied prefix.
Elements	getElementsByAttributeValue(String key, String value) Find elements that have an attribute with the specific value.
Elements	getElementsByAttributeValueContaining(String key, String match) Find elements that have attributes whose value contains the match string.
Elements	getElementsByAttributeValueEnding(String key, String valueSuffix) Find elements that have attributes that end with the value suffix.
Elements	getElementsByAttributeValueMatching(String key, Pattern pattern) Find elements that have attributes whose values match the supplied regular expression.
Elements	getElementsByAttributeValueMatching(String key, String regex) Find elements that have attributes whose values match the supplied regular expression.
Elements	getElementsByAttributeValueNot(String key, String value) Find elements that either do not have this attribute, or have it with a different value.
Elements	getElementsByAttributeValueStarting(String key, String valuePrefix) Find elements that have attributes that start with the value prefix.
Elements	getElementsByClass(String className) Find elements that have this class, including or under this element.
Elements	getElementsByIndexEquals(int index) Find elements whose sibling index is equal to the supplied index.
Elements	getElementsByIndexGreaterThan(int index) Find elements whose sibling index is greater than the supplied index.
Elements	getElementsByIndexLessThan(int index) Find elements whose sibling index is less than the supplied index.
Elements	getElementsByTag(String tagName) Finds elements, including and recursively under this element, with the specified tag name.
Elements	getElementsContainingOwnText(String searchText) Find elements that directly contain the specified string.
Elements	getElementsContainingText(String searchText) Find elements that contain the specified string.
Elements	getElementsMatchingOwnText(Pattern pattern) Find elements whose own text matches the supplied regular expression.
Elements	getElementsMatchingOwnText(String regex) Find elements whose text matches the supplied regular expression.
Elements	getElementsMatchingText(Pattern pattern) Find elements whose text matches the supplied regular expression.
Elements	getElementsMatchingText(String regex) Find elements whose text matches the supplied regular expression.
boolean	hasClass(String className) Tests if this element has a class.
int	hashCode()
boolean	hasText() Test if this element has any text content (that is not just whitespace).
String	html() 获取该Element的html源代码
Element	html(String html) 设置该Element的源代码
String	id() Get the id attribute of this element.
Element	insertChildren(int index, Collection<? extends Node> children) Inserts the given child nodes into this element at the specified index.
boolean	isBlock() Test if this element is a block-level element.
Element	lastElementSibling() Gets the last element sibling of this element
Element	nextElementSibling() Gets the next sibling element of this element.
String	nodeName() Get the node name of this node.
Element	parent() Gets this node's parent node.
Elements	parents() Get this element's parent and ancestors, up to the document root.
Element	prepend(String html) Add inner HTML into this element.
Element	prependChild(Node child) Add a node to the start of this element's children.
Element	prependElement(String tagName) Create a new element by tag name, and add it as the first child.
Element	prependText(String text) Create and prepend a new TextNode to this element.
Element	previousElementSibling() Gets the previous element sibling of this element.
Element	removeClass(String className) Remove a class name from this element's class attribute.
Elements	select(String cssQuery) Find elements that match the Selector CSS query, with this element as the starting context.
Elements	siblingElements() Get sibling elements.
Tag	tag() Get the Tag for this element.
String	tagName() Get the name of the tag for this element.
Element	tagName(String tagName) Change the tag of this element.
String	text() Gets the combined text of this element and all its children.
Element	text(String text) Set the text of this element.
List<TextNode>	textNodes() Get this element's child text nodes.
String	val() Get the value of a form element (input, textarea, etc).
Element	val(String value) Set the value of a form element (input, textarea, etc).
Element	wrap(String html) Wrap the supplied HTML around this element.