JSoup快速入门

Jsoup

Jsoup是用于解析HTML,就类似XML解析器用于解析XML.Jsoup它解析HTML成为真实世界的HTML。它与jquery选择器的语法非常相似,并且非常灵活容易使用以获得所需的结果。在本教程中,我们将介绍很多Jsoup的例子。

能用Jsoup实现什么?

  • 从URL,文件或字符串中刮取并解析HTML
  • 查找和提取数据,使用DOM遍历或CSS选择器
  • 操纵HTML元素,属性和文本
  • 根据安全的白名单清理用户提交的内容,以防止XSS攻击
  • 输出整洁的HTML

安装 - 运行时依赖关系

您可以使用下面的maven依赖项将Jsoup jar包含到项目中。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-xml"><span style="color:#990055"><span style="color:#990055"><span style="color:#999999"><</span>dependency</span><span style="color:#999999">></span></span>
  <span style="color:slategray"><!-- jsoup HTML parser library @ http://jsoup.org/ --></span>
  <span style="color:#990055"><span style="color:#990055"><span style="color:#999999"><</span>groupId</span><span style="color:#999999">></span></span>org.jsoup<span style="color:#990055"><span style="color:#990055"><span style="color:#999999"></</span>groupId</span><span style="color:#999999">></span></span>
  <span style="color:#990055"><span style="color:#990055"><span style="color:#999999"><</span>artifactId</span><span style="color:#999999">></span></span>jsoup<span style="color:#990055"><span style="color:#990055"><span style="color:#999999"></</span>artifactId</span><span style="color:#999999">></span></span>
  <span style="color:#990055"><span style="color:#990055"><span style="color:#999999"><</span>version</span><span style="color:#999999">></span></span>1.10.2<span style="color:#990055"><span style="color:#990055"><span style="color:#999999"></</span>version</span><span style="color:#999999">></span></span>
<span style="color:#990055"><span style="color:#990055"><span style="color:#999999"></</span>dependency</span><span style="color:#999999">></span></span>
</code></span></span>

XML

JSoup应用的主要类

虽然完整的类库中有很多类,但大多数情况下,给出下面3个类的英文我们需要重点了解的。

1. org.jsoup.Jsoup类

Jsoup类是任何Jsoup程序的入口点,并将提供从各种来源加载和解析HTML文档的方法。

Jsoup类的一些重要方法如下:

方法描述
static Connection connect(String url)创建并返回URL的连接。
static Document parse(File in, String charsetName)将指定的字符集文件解析成文档。
static Document parse(String html)将给定的HTML代码解析成文档。
static String clean(String bodyHtml, Whitelist whitelist)从输入HTML返回安全的HTML,通过解析输入HTML并通过允许的标签和属性的白名单进行过滤。

2. org.jsoup.nodes.Document类

该类表示通过Jsoup库加载HTML文档。可以使用此类执行适用于整个HTML文档的操作。

Element类的重要方法可以参见 - http://jsoup.org/apidocs/org/jsoup/nodes/Document.html

3. org.jsoup.nodes.Element类

HTML元素是由标签名称,属性和子节点组成。使用元类,您可以提取数据,遍历节点和操作HTML。

Element类的重要方法可参见 - http://jsoup.org/apidocs/org/jsoup/nodes/Element.html

应用实例

现在我们来看一些使用Jsoup API处理HTML文档的例子。

1.载入文件

从URL加载文档,使用Jsoup.connect()方法从URL加载HTML。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">connect</span><span style="color:#999999">(</span><span style="color:#669900">"http://www.yiibai.com"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">get</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>document<span style="color:#999999">.</span><span style="color:#dd4a68">title</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

2.从文件加载文档

使用Jsoup.parse()方法从文件加载HTML。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span> <span style="color:#0077aa">new</span> File<span style="color:#999999">(</span> <span style="color:#669900">"D:/temp/index.html"</span> <span style="color:#999999">)</span> <span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span> <span style="color:#999999">)</span><span style="color:#999999">;</span>
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>document<span style="color:#999999">.</span><span style="color:#dd4a68">title</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

3.从String加载文档

使用Jsoup.parse()方法从字符串加载HTML。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    String html <span style="color:#a67f59">=</span> <span style="color:#669900">"<html><head><title>First parse</title></head>"</span>
                    <span style="color:#a67f59">+</span> <span style="color:#669900">"<body><p>Parsed HTML into a doc.</p></body></html>"</span><span style="color:#999999">;</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span>html<span style="color:#999999">)</span><span style="color:#999999">;</span>
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>document<span style="color:#999999">.</span><span style="color:#dd4a68">title</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

4.从HTML获取标题

如上图所示,调用document.title()方法HTML电子杂志页面的标题。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span> <span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/xyz/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>document<span style="color:#999999">.</span><span style="color:#dd4a68">title</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

5.获取HTML页面的Fav图标

假设favicon图像将HTML的英文的文档<head>部分中的第一个图像,您可以使用下面的代码。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java">String favImage <span style="color:#a67f59">=</span> <span style="color:#669900">"Not Found"</span><span style="color:#999999">;</span>
<span style="color:#0077aa">try</span> <span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    Element element <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">head</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"link[href~=.*\\.(ico|png)]"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">first</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#0077aa">if</span> <span style="color:#999999">(</span>element <span style="color:#a67f59">==</span> null<span style="color:#999999">)</span> 
    <span style="color:#999999">{</span>
        element <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">head</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"meta[itemprop=image]"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">first</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        <span style="color:#0077aa">if</span> <span style="color:#999999">(</span>element <span style="color:#a67f59">!=</span> null<span style="color:#999999">)</span> 
        <span style="color:#999999">{</span>
            favImage <span style="color:#a67f59">=</span> element<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"content"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        <span style="color:#999999">}</span>
    <span style="color:#999999">}</span> 
    <span style="color:#0077aa">else</span>
    <span style="color:#999999">{</span>
        favImage <span style="color:#a67f59">=</span> element<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"href"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#999999">}</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>favImage<span style="color:#999999">)</span><span style="color:#999999">;</span>
</code></span></span>

Java的

6.获取HTML页面中的所有链接

要获取网页中的所有链接,请使用以下代码。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    Elements links <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"a[href]"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    <span style="color:#0077aa">for</span> <span style="color:#999999">(</span>Element link <span style="color:#a67f59">:</span> links<span style="color:#999999">)</span> 
    <span style="color:#999999">{</span>
         System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"link : "</span> <span style="color:#a67f59">+</span> link<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"href"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
         System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"text : "</span> <span style="color:#a67f59">+</span> link<span style="color:#999999">.</span><span style="color:#dd4a68">text</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    <span style="color:#999999">}</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

7.获取HTML页面中的所有图像

要获取网页中显示的所有图像,请使用以下代码。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    Elements images <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"img[src~=(?i)\\.(png|jpe?g|gif)]"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#0077aa">for</span> <span style="color:#999999">(</span>Element image <span style="color:#a67f59">:</span> images<span style="color:#999999">)</span> 
    <span style="color:#999999">{</span>
        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"src : "</span> <span style="color:#a67f59">+</span> image<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"src"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"height : "</span> <span style="color:#a67f59">+</span> image<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"height"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"width : "</span> <span style="color:#a67f59">+</span> image<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"width"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"alt : "</span> <span style="color:#a67f59">+</span> image<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"alt"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#999999">}</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

8.获取URL的元信息

元信息包括Google等搜索引擎用来确定网页内容的索引为目的。它们以HTML页面的HEAD部分中的一些标签的形式存在。要获取有关网页的元信息,请使用下面的代码。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>

    String description <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"meta[name=description]"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">get</span><span style="color:#999999">(</span><span style="color:#990055">0</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"content"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"Meta description : "</span> <span style="color:#a67f59">+</span> description<span style="color:#999999">)</span><span style="color:#999999">;</span>  

    String keywords <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"meta[name=keywords]"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">first</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"content"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"Meta keyword : "</span> <span style="color:#a67f59">+</span> keywords<span style="color:#999999">)</span><span style="color:#999999">;</span>  
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

9.在HTML页面中获取表单属性

在网页中获取表单输入元素非常简单。使用唯一ID查找FORM元素; 然后找到该表单中存在的所有输入元素。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java">Document doc <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"c:/temp/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span><span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
Element formElement <span style="color:#a67f59">=</span> doc<span style="color:#999999">.</span><span style="color:#dd4a68">getElementById</span><span style="color:#999999">(</span><span style="color:#669900">"loginForm"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  

Elements inputElements <span style="color:#a67f59">=</span> formElement<span style="color:#999999">.</span><span style="color:#dd4a68">getElementsByTag</span><span style="color:#999999">(</span><span style="color:#669900">"input"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
<span style="color:#0077aa">for</span> <span style="color:#999999">(</span>Element inputElement <span style="color:#a67f59">:</span> inputElements<span style="color:#999999">)</span> <span style="color:#999999">{</span>  
    String key <span style="color:#a67f59">=</span> inputElement<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"name"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    String value <span style="color:#a67f59">=</span> inputElement<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"value"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"Param name: "</span><span style="color:#a67f59">+</span>key<span style="color:#a67f59">+</span><span style="color:#669900">" \nParam value: "</span><span style="color:#a67f59">+</span>value<span style="color:#999999">)</span><span style="color:#999999">;</span>  
<span style="color:#999999">}</span>
</code></span></span>

Java的

10.更新元素的属性/内容

只要您使用上述方法找到您想要的元素; 可以使用Jsoup API来更新这些元素的属性或innerHTML。例如,想更新文档中存在的“ rel = nofollow”的所有链接。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai.com.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    Elements links <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"a[href]"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    links<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"rel"</span><span style="color:#999999">,</span> <span style="color:#669900">"nofollow"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

10.消除不信任的HTML(以防止XSS)

假设在应用程序中,想显示用户提交的HTML片段。例如用户可以在评论框中放入HTML内容。这可能会导致非常严重的问题,如果您允许直接显示此HTML。用户可以在其中放入一些恶意脚本,并将用户重定向到另一个脏网站。

为了清理这个HTML,Jsoup提供Jsoup.clean()方法。此方法期望HTML格式的字符串,并将返回清洁的HTML。要执行此任务,Jsoup使用白名单过滤器.jsoup白名单过滤器通过解析输入HTML(在安全的沙盒环境中)工作,然后遍历解析树,只允许将已知安全的标签和属性(和值)通过清理后输出。

它不使用正则表达式,这对于此任务是不合适的。

清洁器不仅用于避免XSS,还限制了用户可以提供的元素的范围:您可以使用文本,强元素,不能但构造div或表元素。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java">String dirtyHTML <span style="color:#a67f59">=</span> <span style="color:#669900">"<p><a href='http://www.yiibai.com/' onclick='sendCookiesToMe()'>Link</a></p>"</span><span style="color:#999999">;</span>

String cleanHTML <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">clean</span><span style="color:#999999">(</span>dirtyHTML<span style="color:#999999">,</span> Whitelist<span style="color:#999999">.</span><span style="color:#dd4a68">basic</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>

System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>cleanHTML<span style="color:#999999">)</span><span style="color:#999999">;</span>
</code></span></span>

Java的

执行后输出结果如下 -

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#a67f59"><</span>p<span style="color:#a67f59">></span><span style="color:#a67f59"><</span>a href<span style="color:#a67f59">=</span><span style="color:#669900">"http://www.yiibai.com/"</span> rel<span style="color:#a67f59">=</span><span style="color:#669900">"nofollow"</span><span style="color:#a67f59">></span>Link<span style="color:#a67f59"><</span><span style="color:#a67f59">/</span>a<span style="color:#a67f59">></span><span style="color:#a67f59"><</span><span style="color:#a67f59">/</span>p<span style="color:#a67f59">></span></code></span></span>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值