jsoup解析html

最新推荐文章于 2019-10-20 14:39:32 发布

我要爆炸啦

最新推荐文章于 2019-10-20 14:39:32 发布

阅读量782

点赞数

//SimpleDateFormat类中parse()方法用于将输入的特定字符串转换成Date类的对象

// Jsoup.parse解析HTML字符串，如Jsoup.parse("<html><head><title>Firstparse</title></head>")

// Jsoup.connect解析url网站地址，如Jsoup.connect(http://www.baidu.com).get()

1.解析方式 
（1）从字符串解析 

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
         
         String html = "<html><head><title>First parse</title></head>" 
         
         
                     +"<body><p>Parse HTML into a doc.</p></body></html>"; 
         
         
         Document doc = Jsoup.parse(html); 
         
       
  
  
 
 
（2）从URL获取解析

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
         
         Document doc = Jsoup.connect("http://example.com/").get(); 
         
         
         String title = doc.title(); 
         
       
  
  
 
 
 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
        4 
        
        
        5 
        
        
        6 
        
        
         
         Document doc = Jsoup.connect("http://example.com") 
         
         
           .data("query","Java") 
         
         
           .userAgent("Mozilla") 
         
         
           .cookie("auth","token") 
         
         
           .timeout(3000) 
         
         
           .post(); 
         
       
  
  
 
 
（3）从文件解析

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
         
         File input = newFile("/tmp/input.html"); 
         
         
         Document doc = Jsoup.parse(input, "UTF-8","http://example.com/"); 
         
       
  
  
 
 
2.DOM方式遍历元素
（1）搜索元素

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
        4 
        
        
        5 
        
        
        6 
        
        
         
         getElementById(String id) 
         
         
         getElementByTag(String tag) 
         
         
         getElementByClass(String className) 
         
         
         getElementByAttribute(String key) 
         
         
         siblingElements(), firstElementSibling(), lastElementSibling(), nextElementSibling(), previousElementSibling() 
         
         
         parent(), children(), child(intindex) 
         
       
  
  
 
 
（2）获取元素数据

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
        4 
        
        
        5 
        
        
        6 
        
        
        7 
        
        
        8 
        
        
         
         attr(String key) – 获取key属性 
         
         
         attributes() – 获取属性 
         
         
         id(), className(), classNames() 
         
         
         text() – 获取文本内容 
         
         
         html() – 获取元素内部HTML内容 
         
         
         outerHtml() – 获取包括此元素的HTML内容 
         
         
         data() – 获取<srcipt>或<style>标签中的内容 
         
         
         tag(), tagName() 
         
       
  
  
 
 
3.选择器语法（jsoup与其他解析器的区别就是可以使用类似jquery的选择器语法来搜索及过滤出所需的元素）
（1）基本选择器

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
        4 
        
        
        5 
        
        
        6 
        
        
        7 
        
        
        8 
        
        
        9 
        
        
        10 
        
        
         
         tagname: 搜索tag标签的元素 
         
         
         ns|tag: 搜索命名空间内tag标签的元素，如fb|name：<fb:name> 
         
         
         #id: 搜索有指定id的元素 
         
         
         .class: 搜索有指定class的元素 
         
         
         [attribute]: 搜索有attrribute属性的元素 
         
         
         [^attri]: 搜索有以attri开头的属性的元素 
         
         
         [attr=value]: 搜索有指定属性及其属性值的元素 
         
         
         [attr^=value], [attr$=value], [attr*=value]: 搜索有指定attr属性，且其属性值是以value开头、结尾或包括value的元素，如[href*=/path/] 
         
         
         [attr~=regex]: 搜索有指定attr属性，且其属性值符合regex正则表达式的元素 
         
         
         *: 搜索所有元素 
         
       
  
  
 
 
（2）选择器组合

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
        4 
        
        
        5 
        
        
        6 
        
        
        7 
        
        
        8 
        
        
        9 
        
        
         
         el#id: 同时指定标签名称和id 
         
         
         el.class: 同时指定标签名称和class 
         
         
         el[attr]: 同时指定标签名称和及其中所含属性的名称 
         
         
         上述3项的任意组合，如a[href].highlight 
         
         
         ancestor child: 包含，如div.content p，即搜索<div class=”content”>下含有<p>标签的元素 
         
         
         ancestor > child: 直接包含，如div.content > p，即搜索直属<div class="content">节点下的<p>标签元素；div.content > *，即搜索<div class="content">下的所有元素 
         
         
         siblingA + siblingB: 直接遍历，如div.head + div，即搜索<div class="head"><div>的元素，其中不再包含子元素 
         
         
         siblingA ~ siblingX: 遍历，如h1 ~ p，即<h1>下直接或间接有<p>的元素 
         
         
         el, el, el: 组合多个选择器，搜索满足其中一个选择器的元素 
         
       
  
  
 
 
（3）伪选择器（条件选择器）

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
        4 
        
        
        5 
        
        
        6 
        
        
        7 
        
        
        8 
        
        
        9 
        
        
        10 
        
        
         
         :lt(n): 搜索n号元素之前的元素 
         
         
         :gt(n): 搜索n号元素之后的元素 
         
         
         :eq(n): 搜索n号元素 
         
         
         :has(seletor): 搜索符合指定选择器的元素 
         
         
         :not(seletor): 搜索不符合指定选择器的元素 
         
         
         :contains(text): 搜索包含指定文本的元素，区分大小写 
         
         
         :containsOwn(text): 搜索直接指包含指定文本的元素 
         
         
         :matches(regex): 搜索符合指定正则表达式的元素 
         
         
         :matchesOwn(regex): 搜索本元素文本中符合指定正则表达式的元素 
         
         
         注意：以上伪选择器的索引中，第一个元素位于索引0，第二个元素位于索引1，…… 
         
       
  
  
 
 
4.获取元素的属性、文本和HTML

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
         
         获取元素的属性值：Node.attr(String key) 
         
         
         获取元素的文本，包括与其组合的子元素：Element.text() 
         
         
         获取HTML：Element.html()或Node.outerHtml() 
         
       
  
  
 
 
5.操作URL

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
         
         Element.attr("href") – 直接获取URL 
         
         
         Element.attr("abs:href")或Element.absUrl("href") – 获取完整URL。如果HTML是从文件或字符串解析过来的，需要调用Jsoup.setBaseUri(String baseUri)来指定基URL，否则获取的完整URL只会是空字符串 
         
       
  
  
 
 

6.测试例子

 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
         
         li[class=info] a[class=Author] - 空格前后表示包含关系，即表示li里的a 
         
         
         div[class=mod mod-main mod-lmain]:contains(教学反思) - div中包含"教学反思"，适合同时有多个同名DIV的情况 
         
       
  
  
 
 
 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
        4 
        
        
        5 
        
        
        6 
        
        
        7 
        
        
        8 
        
        
        9 
        
        
        10 
        
        
        11 
        
        
        12 
        
        
        13 
        
        
        14 
        
        
        15 
        
        
         
         /* 
         
         
           previousSibling()获取某标签前面的代码 
         
         
           nextSibling()获取某标签后的代码 
         
         
           如： 
         
         
           <form id=form1> 
         
         
           第一名：Lily  <br/> 
         
         
           第二名：Tom   <br/> 
         
         
           第三名：Peter <br/> 
         
         
           </form> 
         
         
         */ 
         
         
         Elements items = doc.select("form[id=form1]"); 
         
         
         Elements prevs = items.select("br"); 
         
         
         for(Element p : prevs){ 
         
         
            String prevStr = p.previousSibling().toString().trim()); 
         
         
         } 
         
       
  
  
 
 
 
 
  
  
   
   
    
    ?
   
   
   
    
        
        1 
        
        
        2 
        
        
        3 
        
        
        4 
        
        
        5 
        
        
        6 
        
        
        7 
        
        
        8 
        
        
        9 
        
        
        10 
        
        
        11 
        
        
        12 
        
        
        13 
        
        
         
         /* 
         
         
          最常用的链接抓取 
         
         
         */ 
         
         
         String itemTag = "div[class=mydiv]"; 
         
         
         String linkTag = "a" 
         
         
         Elements items = doc.select(itemTag); 
         
         
         Elements links = items.select(linkTag); 
         
         
         for(Element l : links){  
         
         
           String href = l.attr("abs:href");//完整Href 
         
         
           String absHref = l.attr("href");//相对路径 
         
         
           String text = l.text(); 
         
         
           String title = l.attr("title"); 
         
         
         } 
         
       
  
  
 
 
jsoup 是一款 Java 的HTML 解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于JQuery的操作方法来取出和操作数据

jsoup的主要功能如下：
从一个URL，文件或字符串中解析HTML； 
使用DOM或CSS选择器来查找、取出数据； 
可操作HTML元素、属性、文本； 

jsoup解析
Jsoup提供一系列的静态解析方法生成Document对象
static Document parse(File in, String charsetName)
static Document parse(File in, String charsetName, String baseUri)
static Document parse(InputStream in, String charsetName, String baseUri)
static Document parse(String html)
static Document parse(String html, String baseUri)   
static Document parse(URL url, int timeoutMillis)
static Document parseBodyFragment(String bodyHtml)
static Document parseBodyFragment(String bodyHtml, String baseUri) 
其中baseUri表示检索到的相对URL是相对于baseUriURL的 
其中charsetName表示字符集

Connection connect(String url) 根据给定的url(必须是http或https)来创建连接

Connection 提供一些方法来抓去网页内容
Connection cookie(String name, String value) 发送请求时放置cookie 
Connection data(Map<String,String> data) 传递请求参数 
Connection data(String... keyvals) 传递请求参数
Document get() 以get方式发送请求并对返回结果进行解析
Document post()以post方式发送请求并对返回结果进行解析 
Connection userAgent(String userAgent) 
Connection header(String name, String value) 添加请求头
Connection referrer(String referrer) 设置请求来源

jsoup提供类似JS获取html元素：
getElementById(String id) 用id获得元素
getElementsByTag(String tag) 用标签获得元素
getElementsByClass(String className) 用class获得元素
getElementsByAttribute(String key)  用属性获得元素
同时还提供下面的方法提供获取兄弟节点：siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()

获得与设置元素的数据
attr(String key)  获得元素的数据 attr(String key, String value) 设置元素数据 
attributes() 获得所以属性
id(), className()  classNames() 获得id class得值
text()获得文本值
text(String value) 设置文本值
html() 获取html 
html(String value)设置html
outerHtml() 获得内部html
data()获得数据内容
tag()  获得tag 和 tagName() 获得tagname 

操作html元素：
append(String html), prepend(String html)
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)
html(String value)

jsoup还提供了类似于JQuery方式的选择器
采用选择器来检索数据
tagname 使用标签名来定位，例如 a 
ns|tag     使用命名空间的标签定位，例如 fb:name 来查找 <fb:name> 元素 
#id     使用元素 id 定位，例如 #logo 
.class     使用元素的 class 属性定位，例如 .head 
*     定位所有元素 
[attribute] 使用元素的属性进行定位，例如 [href] 表示检索具有 href 属性的所有元素 
[^attr] 使用元素的属性名前缀进行定位，例如 [^data-] 用来查找 HTML5 的 dataset 属性 
[attr=value]使用属性值进行定位，例如 [width=500] 定位所有 width 属性值为 500 的元素 
[attr^=value],[attr$=value],[attr*=value] 这三个语法分别代表，属性以 value 开头、结尾以及包含 
[attr~=regex]使用正则表达式进行属性值的过滤，例如 img[src~=(?i)\.(png|jpe?g)] 
以上是最基本的选择器语法，这些语法也可以组合起来使用

组合用法
el#id      定位id值某个元素，例如 a#logo -> <a id=logo href= … > 
el.class 定位 class 为指定值的元素，例如 div.head -> <div class=head>xxxx</div> 
el[attr] 定位所有定义了某属性的元素，例如 a[href] 
以上三个任意组合     例如 a[href]#logo 、a[name].outerlink 

除了一些基本的语法以及这些语法进行组合外，jsoup 还支持使用表达式进行元素过滤选择
:lt(n)     例如 td:lt(3) 表示小于三列 
:gt(n)     div p:gt(2) 表示 div 中包含 2 个以上的 p 
:eq(n)     form input:eq(1) 表示只包含一个 input 的表单 
:has(seletor)     div:has(p) 表示包含了 p 元素的 div 
:not(selector)     div:not(.logo) 表示不包含 class=logo 元素的所有 div 列表 
:contains(text)     包含某文本的元素，不区分大小写，例如 p:contains(oschina) 
:containsOwn(text)     文本信息完全等于指定条件的过滤 
:matches(regex)     使用正则表达式进行文本过滤：div:matches((?i)login) 
:matchesOwn(regex)     使用正则表达式找到自身的文本 


例子：

 
 
  
  package com.javen.Jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupTest {
    static String url="http://www.cnblogs.com/zyw-205520/archive/2012/12/20/2826402.html";
    /**
     * @param args
     * @throws Exception
     */
    public static void main(String[] args) throws Exception {
        
        // TODO Auto-generated method stub
        BolgBody();
        //test();
        //Blog();
        /*
         * Document doc = Jsoup.connect("http://www.oschina.net/")
         * .data("query", "Java") // 请求参数 .userAgent("I ’ m jsoup") // 设置
         * User-Agent .cookie("auth", "token") // 设置 cookie .timeout(3000) //
         * 设置连接超时时间 .post();
         */// 使用 POST 方法访问 URL

        /*
         * // 从文件中加载 HTML 文档 File input = new File("D:/test.html"); Document doc
         * = Jsoup.parse(input,"UTF-8","http://www.oschina.net/");
         */
    }

    /**
     * 获取指定HTML 文档指定的body
     * @throws IOException
     */
    private static void BolgBody() throws IOException {
        // 直接从字符串中输入 HTML 文档
        String html = "<html><head><title> 开源中国社区 </title></head>"
                + "<body><p> 这里是 jsoup 项目的相关文章 </p></body></html>";
        Document doc = Jsoup.parse(html);
        System.out.println(doc.body());
        
        
        // 从 URL 直接加载 HTML 文档
        Document doc2 = Jsoup.connect(url).get();
        String title = doc2.body().toString();
        System.out.println(title);
    }

    /**
     * 获取博客上的文章标题和链接
     */
    public static void article() {
        Document doc;
        try {
            doc = Jsoup.connect("http://www.cnblogs.com/zyw-205520/").get();
            Elements ListDiv = doc.getElementsByAttributeValue("class","postTitle");
            for (Element element :ListDiv) {
                Elements links = element.getElementsByTag("a");
                for (Element link : links) {
                    String linkHref = link.attr("href");
                    String linkText = link.text().trim();
                    System.out.println(linkHref);
                    System.out.println(linkText);
                }
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }
    /**
     * 获取指定博客文章的内容
     */
    public static void Blog() {
        Document doc;
        try {
            doc = Jsoup.connect("http://www.cnblogs.com/zyw-205520/archive/2012/12/20/2826402.html").get();
            Elements ListDiv = doc.getElementsByAttributeValue("class","postBody");
            for (Element element :ListDiv) {
                System.out.println(element.html());
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        
    }

}
  
  
   
   
  
  
 
 
     下面来介绍android中使用Jsoup异步解析网页的数据 请注意： 这里很容易遇到一个乱码的问题


 
 配置文件：AndroidManifest.xml中加 权限 <uses-permission android:name="android.permission.INTERNET"></uses-permission>
layout的布局文件

   
   
    
    
     
     
    
    
    
    <LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
    xmlns:tools="http://schemas.android.com/tools"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:orientation="vertical" >

    <WebView
        android:id="@+id/webView"
        android:layout_width="fill_parent"
        android:layout_height="200dp" />

    <ScrollView
        android:layout_width="wrap_content"
        android:layout_height="wrap_content" >

        <TextView
            android:id="@+id/textView"
            android:layout_width="wrap_content"
            android:layout_height="wrap_content"
            android:text="@string/hello_world" />
    </ScrollView>

</LinearLayout>
    
    
     
     
    
    
   
   

 
主要异步加载数据的代码

   
   
    
    
     
     按 Ctrl+C 复制代码
    
    
    
    
    
    
     
     按 Ctrl+C 复制代码