Springboot-爬虫代码(豆瓣读书)

最新推荐文章于 2024-06-11 11:38:44 发布

W_Meng_H

最新推荐文章于 2024-06-11 11:38:44 发布

阅读量757

点赞数

分类专栏： # SpringBoot 经验总结 # ElasticSearch

本文链接：https://blog.csdn.net/W_Meng_H/article/details/111408984

版权

经验总结同时被 3 个专栏收录

64 篇文章 1 订阅

订阅专栏

SpringBoot

56 篇文章 1 订阅

订阅专栏

ElasticSearch

16 篇文章 0 订阅

订阅专栏

个人学习需要，自己也不想造大量的数据（太懒～哈哈～），就爬了一下豆瓣读书的数据（感谢豆瓣～）

流程：使用 Java 的 jsoup 对豆瓣读书进行爬虫，保存到本地 mysql 中，再使用 logstash 将 mysql 的数据传输到 elasticsearch

项目源码：https://github.com/Vmetrio/reptile

jsoup官网：https://jsoup.org

豆瓣读书：https://book.douban.com/latest?icn=index-latestbook-all

一、创建springboot项目，pom文件引入jsoup的依赖

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

二、分析豆瓣读书的html页面，写对应的爬虫代码

核心代码：

    @GetMapping("/reptile")
    public List<Books> Index() throws Exception {
        //获取url请求
        String url = "https://book.douban.com/latest?icn=index-latestbook-all";
        //解析网页,Jsoup返回的是Document对象（浏览器Document对象）
        Document document = Jsoup.parse(new URL(url), 30000);
        //所有在js中使用的方法，这里都能使用
        Element element = document.getElementById("content");
        //在获得网页内容后，获取所有的li标签
        Elements elements = element.getElementsByTag("li");

        ArrayList<Books> booksList = new ArrayList<Books>();

        //获取元素的标签后，再获取标签中的内容
        for (Element el : elements) {
            String bookurl = el.getElementsByClass("cover").attr("href");
            String imgurl = el.getElementsByTag("img").attr("src");
            String bookname = el.getElementsByTag("h2").eq(0).text();
            String author = el.getElementsByClass("color-gray").eq(0).text();
            String detail = el.getElementsByClass("detail").eq(0).text();

            Books books = new Books();
            books.setBookurl(bookurl);
            books.setImgurl(imgurl);
            books.setBookname(bookname);
            books.setAuthor(author);
            books.setDetail(detail);
            booksList.add(books);
        }
        return booksList;
    }

三、测试

W_Meng_H

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
Springboot-爬虫代码(豆瓣读书)

个人学习需要，自己也不想造数据（太懒～哈哈～），就爬了一下豆瓣读书的数据（感谢豆瓣～）流程：使用 Java 的 jsoup 对豆瓣读书进行爬虫，保存到本地 mysql 中，再使用 logstash 插件，将 mysql 的数据传输到 elasticsearch项目源码：https://github.com/Vmetrio/reptilejsoup官网：https://jsoup.org豆瓣读书：https://book.douban.com/latest?icn=index-latestbo
复制链接

扫一扫