Jsoup学习之Jsoup类的clean方法

最新推荐文章于 2023-02-15 00:06:45 发布

哭的好伤心

最新推荐文章于 2023-02-15 00:06:45 发布

阅读量2k

点赞数 1

本文链接：https://blog.csdn.net/qq_43678748/article/details/86545173

版权

soup类

一、类结构

java.lang.Object

org.jsoup.Jsoup

public classJsoup

extends

Object

Jsoup类来自于org.jsoup.Jsoup包，并且继承自Object类。

二、方法

Method Summary
static String	clean(String bodyHtml, String baseUri, Whitelist whitelist) 使用 Whitelist对输入的Html文档过滤，只允许特定的标签或者属性，防止恶意代码。
static String	clean(String bodyHtml, String baseUri, Whitelist whitelist, Document.OutputSettings outputSettings) 使用Whitelist对输入的Html文档过滤，只允许特定的标签或者属性，防止恶意代码。
static String	clean(String bodyHtml, Whitelist whitelist) 使用Whitelist对输入的Html文档过滤，只允许特定的标签或者属性，防止恶意代码。
static Connection	connect(String url) 创建URL连接
static boolean	isValid(String bodyHtml, Whitelist whitelist) 判断输入的Html文档是否符合Whitelist过滤条件的要求
static Document	parse(File in, String charsetName) 解析文件的内容，生成Html
static Document	parse(File in, String charsetName, String baseUri) 解析文件的内容，生成Html
static Document	parse(InputStream in, String charsetName, String baseUri) 读取输入流，解析成Document
static Document	parse(InputStream in, String charsetName, String baseUri, Parser parser) 读取输入流，解析成Document
static Document	parse(String html) 将字符串解析成Html文档
static Document	parse(String html, String baseUri) 将字符串解析成Html文档
static Document	parse(String html, String baseUri, Parser parser) 利用提供的Parser，将字符串解析成Html文档
static Document	parse(URL url, int timeoutMillis) 通过URL，解析成Document
static Document	parseBodyFragment(String bodyHtml) 将Html片段解析成body格式
static Document	parseBodyFragment(String bodyHtml, String baseUri) 将Html片段解析成body格式

三、方法详解

clean

public static String clean(String bodyHtml,
                           String baseUri,
                           Whitelist whitelist)

使用Whitelist对输入的Html文档过滤，只允许特定的标签或者属性，防止恶意代码。

参数:

bodyHtml –不安全的html片段

baseUri –将html中相对路径转换为绝对路径的URL

whitelist –白名单允许的html标签和属性

返回值:

安全的html片段

解析:

这个函数按照whitelist提供的过滤规则对html进行过滤，只保留whitelist允许的标签和属性。Html文档中往往会有很多的连接、图片、引用的外部脚本、css文件等，可能会是相对路径，jsoup会利用baseUri这个参数，自动为这些相对路径加前缀变成绝对路径。例如：<a href=”/photo/2.jpg”>图片</a>会变成

实例：

[java] view plain copy

String html = "<ahref='http://www.baidu/' οnclick='stealCookies()'> 百度一下，你就知道 </a>";
String doc = Jsoup.clean(html,Whitelist.basic());
//输出：<a href="http://www.baidu/"rel="nofollow"> 百度一下，你就知道 </a>

注：Whitelist包含几种过滤模式：none、basic、simpleText、basicWithImages、relaxed，具体过滤规则请参考：Whitelist类

拓展：

public static String clean(String bodyHtml,  Whitelist whitelist)
这个没有提供baseUri这个参数，即没有提供将相对路径转换为绝对路径的功能。
public static String clean(String bodyHtml,
                           String baseUri,
                           Whitelist whitelist,
                           Document.OutputSettings outputSettings)
Document.OutputSettings：文档的输出设置，控制精细打印

connect

public static Connection connect(String url)

创建URL连接

参数:

url –必须为http或者https类型的连接.

返回值:

返回连接。你可以添加data,cookies,和 headers；设置user-agent, referrer,method

解析：

与url建立连接，这个方法只支持http和https协议，连接的方式可以是get也可以是post，并且可以为连接提供所需要的信息，如data、cookies、userAgent、method等。

实例：

[java] view plain copy

Document doc = Jsoup.connect("http://example.com")
.userAgent("Mozilla").data("name","jsoup").get();
Document doc = Jsoup.connect("http://example.com")
.cookie("auth","token").post();

isValid

public static boolean isValid(String bodyHtml, Whitelist whitelist)

判断输入的Html文档是否符合Whitelist过滤条件的要求。

参数:

bodyHtml –要测试的html

whitelist –测试的过滤规则whitelist

返回值:

如果html中包含的标签和属性都包含在whitelist定义的规则内，即whitelist没有过滤掉bodyhtml中的内容，则返回true，否则返回false。

实例:

[java] view plain copy

String html = "<ahref='http://www.baidu/'οnclick='stealCookies()'> 百度一下，你就知道 </a>";
System.out.println(Jsoup.isValid(html, Whitelist.basic()));
//输出false，即Whitelist.basic()对html过滤掉了html中的部分内容，onclick这个属性不属于Whitelist.basic()，被过滤掉。

parse

public static Document parse(File in,  String charsetName, String baseUri)    throws IOException

解析文件的内容，生成Html

参数:

in – html文件

charsetName –设置文档编码格式。如果存在标签http-equiv，并且将charsetName设置为null，则按照标签http-equiv内规定的编码进行编码。否则为了安全起见，一般设置为UTF-8.

baseUri -将html中相对路径转换为绝对路径的URL。

返回值:

返回健全的html文档

异常:

IOException –如果文件找不到或者不能读取或者charsetName设置无效，抛出异常。

实例：

[java] view plain copy

File file = new File("C://baidu.txt");
Document doc = Jsoup.parse(file,"GBK","http://www.baidu.com");

拓展：

public static Document parse(File in, String charsetName)  throws IOException

parse

public static Document parse(InputStream in,  String charsetName,String baseUri, Parser parser)    throws IOException

读取输入流，解析成Document。可以提供一个轮流的解析器（parser），例如XML解释器（或者非XML解释器）

参数:

in –输入流。确保解析完成后关闭输入流。

charsetName -设置文档编码格式。如果存在标签http-equiv，并且将charsetName设置为null，则按照标签http-equiv内规定的编码进行编码。否则为了安全起见，一般设置为UTF-8.

baseUri -将html中相对路径转换为绝对路径的URL。

parser –轮流解析器

返回值:

返回健全的html文档

异常:

IOException -如果文件找不到或者不能读取或者charsetName设置无效，抛出异常。

实例：

[java] view plain copy

FileInputStream input = new FileInputStream("C://baidu.txt");
Document doc = Jsoup.parse(input, "GBK", "http://www.baidu.com", Parser.htmlParser());
System.out.println(doc);
input.close();

拓展：

public static Document parse(InputStream in, String charsetName,String baseUri)   throws IOException

parse

public static Document parse(String html,  String baseUri,  Parser parser)

将Html字符串转换为Document。可以提供一个轮流的解析器（parser），例如XML解释器（或者非XML解释器）

参数:

html – Html字符串

baseUri -将html中相对路径转换为绝对路径的URL

parser -轮流解析器

返回值:

返回健全的html文档

解析：

将Html字符串解析为Document。

实例：

[java] view plain copy

String html = "<html><head><title>Firstparse</title></head>"
+ "<body>Parsed HTML into adoc.</body></html>";
Document doc = Jsoup.parse(html,"www.baidu.com", Parser.htmlParser());
System.out.println(doc);

拓展：

public static Document parse(String html, String baseUri)
public static Document parse(String html)

parse

public static Document parse(URL url, int timeoutMillis) throws IOException

连接URL，获取Html转换为Document。通常使用connect函数来代替它，字符编码的设置根据http-equiv，或者自动回落为UTF-8编码。

参数:

url –需要连接的URL（使用GET方法），必须使用http或者https协议。

timeoutMillis –读取超时时间，毫秒级，如果超时会抛出IOException异常。

返回值:

解析后的Html

解析：

这个方法是与connect相兼容的方法，一般都使用connect方法来代替它。

异常:

MalformedURLException –请求的URL使用的协议不是http或者https。

HttpStatusException –HTTP返回的状态不是OK

UnsupportedMimeTypeException –响应的MIME类型不被支持

SocketTimeoutException –连接超时

IOException –连接或者读取错误

实例:

[java] view plain copy

URL url = new URL("http://www.baidu.com");
Document doc = Jsoup.parse(url,500);
System.out.println(doc);

哭的好伤心

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Jsoup学习之Jsoup类的clean方法

Jsoup学习之Jsoup类的clean方法
复制链接

扫一扫

Jsoup学习之Jsoup类的clean方法

clean

connect

isValid

parse

parse

parse

parse

“相关推荐”对你有帮助么？