Jsoup爬虫的基本使用

什么是Jsoup?

jsoup 是一款Java 的HTML解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据(简称爬虫)。

基本使用

新建一个maven项目

<dependencies>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.0.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpcore</artifactId>
        <version>4.0.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpmime</artifactId>
        <version>4.0.1</version>
    </dependency>
    <dependency>
        <groupId>commons-codec</groupId>
        <artifactId>commons-codec</artifactId>
        <version>1.4</version>
    </dependency>
    <dependency>
        <groupId>commons-logging</groupId>
        <artifactId>commons-logging</artifactId>
        <version>1.1.1</version>
    </dependency>
    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>1.4</version>
    </dependency>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.11.3</version>
    </dependency>
    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.1</version>
    </dependency>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.11</version>
        <scope>compile</scope>
    </dependency>
</dependencies>

测试类

 @Test
    public void test111() throws Exception{
//        1、爬取的url
        String targetUrl = "https://zhipeng0908.gitee.io";
//        2、获取connection,CrawlerUtil工具类在下方
        Connection connect = CrawlerUtil.getConnection(targetUrl);
//        4、执行
        Connection.Response response = connect.method(Connection.Method.GET).execute();
//        5、处理爬虫结果
//        得到dom
        Document document = response.parse();
//        <body></body>
        Element bodyElement = document.body();
        // .post-header为这个html中一个div的类名
//        Elements 类继承了ArrayList类
        Elements cardElement = bodyElement.select(".post-header");
//        处理结果,获得文本内容
        for (Element blog : cardElement) {
            Elements titleElement = blog.select(".post-title");
            String title = titleElement.text();
            Elements timeElement = blog.select(".post-meta > span.post-time > time");
            String time = timeElement.text();
            Elements linkElement = blog.select(".post-title-link");
            String link = linkElement.attr("href");
            System.out.println("博客标题:"+title + "\t" + "url:" + (targetUrl+link) + "\t"+"发布时间:"+time);
        }
    }

工具类

public static Connection getConnection(String targetUrl){
        Connection connect = Jsoup.connect(targetUrl);
//        3、伪造请求头
        connect.header("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9");
        connect.header("Accept-Encoding","gzip, deflate, br");
        connect.header("Accept-Language","zh-CN,zh;q=0.9");
        connect.header("Cache-Control","no-cache");
        connect.header("Connection","keep-alive");
        connect.header("Cookie","_ga=GA1.2.2130438396.1588431092; Hm_lvt_ec661610f14acf2457496da3a87d804d=1588840665,1589378478; Hm_lpvt_ec661610f14acf2457496da3a87d804d=1589378528");
        connect.header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36");
        return connect;
    }

结果
在这里插入图片描述

  • 4
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值