1.Html基础
1.html文档结构
<html> <head><title>计算机学院</title></head> <body> <div id="Notice"> <a href="Article.aspx?t=5&id=9297" > <span >关于公布2013-2014学年第2学期转专业学生名单及做好相关工作的通 </span></a> <a href="Article.aspx?t=5&id=9296" > <span >教材科关于2013-2014学年第2学期领取教材有关事宜的通知 </span></a> </div> </body> </html> |
注: 1.Element: <head>.<body>.<div>标签 2.Node:<a>.<span>标签 3.SiblingElements:<head>和<body>标签互为兄弟Element 4.SiblingNode:<div>标签内的第一个<a>和第二个<a>互为兄弟Node |
该html的Tree如下
|
2.加载html文档
1.从字符串中提取html文档 String html = “<html><head><title>你好</title></head><body><p>我是谁</p></body></html> Document doc = Jsoup.parse(html); |
2.从URL直接加载html文档 try { Document doc = (Document) Jsoup.connect(BASIC_URL).get(); } catch (IOException e) { // TODO Auto-generated catch block Log.e(tag, e.toString()); } try { Document doc = (Document) Jsoup.connect(BASIC_URL) .data("query","Java")//请求参数 .userAgent("Mozilla")//设置USER-AGENT .cookie("auth", "token")//设置cookie .timeout(60*1000)//超时 .post(); } catch (IOException e) { // TODO Auto-generated catch block Log.e(tag, e.toString()); }//post方式请求 |
3.文件加载 File file = new File(filePath); Document doc = (Document) Jsoup.parse(file, "UTF-8", "www.suse.edu.cn"); |
2.1 设置Http头信息
con1.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"); con1.header("Accept-Encoding","gzip,deflate,sdch"); con1.header("Referer","http://www.suse.edu.cn/"); con1.header("Accept-Language","zh-CN,zh;q=0.8,en;q=0.6"); con1.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36"); con1.header("(Request-Line)", "POST /cgi-bin/login?lang=zh_CN HTTP/1.1"); con1.header("Cache-Control", "no-cache"); con1.header("Connection", "Keep-Alive"); con1.header("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8"); con1.header("Host", "http://www.suse.edu.cn/"); Response re = con1.ignoreContentType(true).method(Method.GET).execute(); |
3.修改数据
Doc.select(“div.comments a”).attr(“rel”,”nofollow”);//为所有连接增加rel = nofollow属性 Doc.select(“div.comments a”).addClass(“mylinkclass”);//为所有连接增加class = mylinkclass属性 Doc.select(“img”).removeAttr(“onclick”);//删除所有图片的onClick属性 Doc.select(“input=[type=text]”).val(“”);//清空文本输入框中的文本
注:修改完后直接调用Elements的html()方法就可以获取修改完的html文档 |
4.文档清理--清除文档内容
String safe = Jsoup.clean(unfafe,Whitelist.basic()); | ||||||||||
使用的是Whitelist类来对html文档进行过滤,使用几个常用方法
|
5.提取html内容
5.1Dom解析
5.1.1常用方法
body() | |
createElement(String tagName) | |
head() | |
nodeName() | |
normalise() | |
text(String text) | |
title() | |
void |
child(int index) | |
children() | |
data() | |
dataNodes() | |
dataset() | |
elementSiblingIndex() | |
empty() | |
firstElementSibling() |
获取Elements
getAllElements() | |
getElementById(String id) | |
getElementsByAttribute(String key) | |
getElementsByAttributeStarting(String keyPrefix) | |
getElementsByAttributeValue(String key, String value) | |
getElementsByAttributeValueContaining(String key, String match) | |
getElementsByAttributeValueEnding(String key, String valueSuffix) | |
getElementsByAttributeValueMatching(String key, Pattern pattern) | |
getElementsByAttributeValueMatching(String key, String regex) | |
getElementsByAttributeValueNot(String key, String value) | |
getElementsByAttributeValueStarting(String key, String valuePrefix) | |
getElementsByClass(String className) | |
getElementsByIndexEquals(int index) | |
getElementsByIndexGreaterThan(int index) | |
getElementsByIndexLessThan(int index) | |
getElementsByTag(String tagName) | |
getElementsContainingOwnText(String searchText) | |
getElementsContainingText(String searchText) | |
getElementsMatchingOwnText(Pattern pattern) | |
getElementsMatchingOwnText(String regex) | |
getElementsMatchingText(Pattern pattern) | |
getElementsMatchingText(String regex) | |
boolean | hasClass(String className) |
int | hashCode() |
boolean | hasText() |
html() | |
id() | |
insertChildren(int index, Collection<? extends Node> children) | |
boolean | isBlock() |
lastElementSibling() | |
nextElementSibling() | |
nodeName() | |
parent() | |
parents() | |
prependChild(Node child) | |
prependElement(String tagName) | |
prependText(String text) | |
previousElementSibling() | |
removeClass(String className) | |
select(String cssQuery) | |
siblingElements() | |
tag() | |
tagName() | |
text() | |
textNodes() | |
val() | |
val(String value) | |
wrap(String html) |
选择器
Jsoup连接网络,错误处理
1.超时
修改url
如:访问的url:http://rwxy.suse.edu.cn/images/Maincn.asp/
修改为:http://rwxy.suse.edu.cn/images/Maincn.asp
注:去掉了末尾的“/”斜杠符号