网络爬虫（二）--Jsoup的使用

最新推荐文章于 2024-05-03 17:19:41 发布

lightingsui

最新推荐文章于 2024-05-03 17:19:41 发布

阅读量251

点赞数

分类专栏：网络爬虫

本文链接：https://blog.csdn.net/qq_40697120/article/details/102813033

版权

网络爬虫专栏收录该内容

3 篇文章 0 订阅

订阅专栏

网络爬虫（二）–Jsoup的使用

需要引入的依赖

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-log4j12 -->
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-log4j12</artifactId>
    <version>1.7.25</version>
</dependency>

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
</dependency>

<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.4</version>
</dependency>

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.8.1</version>
</dependency>

Jsoup的3种解析方式

Url解析传入URL对象，后台通过请求进行解析
File解析解析Html类型的文件
字符串的解析直接解析包含html标签的字符串

url的解析

Document doc = Jsoup.parse(new URL("http://www.baidu.com"), 10000);
String title = doc.getElementsByTag("title").first().text();
System.out.println(title);

这里的getElementsByTags得到的是一个类似数组，所以需要取第一个值，即first，text的内容得到的是标签内的文本内容，这里可能会有一个疑问，为什么jsoup都可以直接得到网页的内容了，还需要HttpClients(CloseableHttpClient)，因为在实际开发中，要用到多线程，连接池，代理等方式，而jsoup却不能很好地支持这些，所以我们仅仅把Jsoup作为解析html的工具，而不作为请求工具。

File的解析

File file = new File("C:\\Users\\LightingSui\\Desktop\\baidu.html");
        
Document doc = Jsoup.parse(file, "utf8");
String title = doc.getElementsByTag("title").last().text();
System.out.println(title);

这里给Jsoup传入的是一个文件，也就是一个html的文件，通过解析html文件的方式得到想要的结果。

字符串的解析

CloseableHttpClient httpClient = HttpClients.createDefault();
HttpGet httpGet = new HttpGet("http://www.baidu.com");

CloseableHttpResponse execute = httpClient.execute(httpGet);

if(execute.getStatusLine().getStatusCode() == 200){
    String content = EntityUtils.toString(execute.getEntity(), "utf8");
    Document doc = Jsoup.parse(content);
    String title = doc.getElementsByTag("title").first().text();
    System.out.println(title);
}

通过HttpClient请求得到html，然后Jsoup通过解析Html得到想要的结果

选择器的使用

单个选择器

getElementsByTagName(String) 根据标签名字获取元素

String title = doc.getElementsByTag("title").next().text();

getElementById(String)

String title = doc.getElementById("myId").text();

getElementsByClass(String)

String title = doc.getElementsByClass("title").first().text();

getElementsByAttrbute(String)

Element ele = doc.getElementsByAttrbute("src").first();

getElementsByAttrbuteValue(String key, String value)

Element ele = doc.getElementsByAttrbuteValue("src", "www.djhfgsd.com/fasfusahf.png").first();

一共有这5种类型的单一选择器，在上面的示例中。可以看到好多都使用了像next()，first()等这些函数，之所以使用这些函数是因为除了getElementById之外，其它选择器获取到的都是多个Element，所以就需要一个过滤，得到我们想要的那个元素，我列举出了常用的这些过滤函数：

first() 得到拿到的第一个元素
last() 得到拿到的最后一个元素
prev() 拿到得到的元素的前与之标签名相同的兄弟元素
prevall() 拿到得到的元素的前所有兄弟元素
parents() 拿到此元素的父亲节点
next() 拿到此元素后与之标签名相同的兄弟元素
nextall() 拿到此元素后的所有兄弟元素

再有就是对**text()**函数的讲解，**text()**函数就是得到标签里的内容体，例如下面这个标签

<span>输入法</span>

使用Element的text()方法拿到的就是输入法这三个字。

attr(String)，这个函数的作用是拿到标签内属性的值，例如

<input type="submit" value="百度一下" id="su" class="btn self-btn bg s_btn">

String value = ele.attr("value);

使用Element的attr()，得到的就是百度一下这四个字。

组合选择器

学过前端的应该对这个上手比较快，因为这个组合选择器和css中的选择器是一样的

首先，还是得先获得Document对象

Document doc = Jsoup.parse(new URL("http://www.baidu.com"), 10000);

然后通过Document的select()方法进行查找

标签选择器

Elements span = doc.select("span");

id选择器

Elements span = doc.select("#first-id");

class选择器

Elements span = doc.select(".font-result");

属性选择器

Elements span = doc.select("[src]");

属性值选择器

Elements span = doc.select("[src=www.baidu.com/adasgg.png]");

并且还可以对这些选择器进行组合，例如

// 获取标签为div的元素且此div的id为div-id
Elements eles = doc.select("div#div-id");

lightingsui

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
网络爬虫（二）--Jsoup的使用

网络爬虫（二）–Jsoup的使用需要引入的依赖<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.2</version>&l...
复制链接

扫一扫

专栏目录