使用Jsoup完成网页爬虫

最新推荐文章于 2023-07-17 16:58:52 发布

爱码～

最新推荐文章于 2023-07-17 16:58:52 发布

阅读量659

点赞数

分类专栏：技巧或是工具操作文章标签：爬虫 java

本文链接：https://blog.csdn.net/weixin_47120348/article/details/120238846

版权

技巧或是工具操作专栏收录该内容

9 篇文章 2 订阅

订阅专栏

网络爬虫
含义就是自动抓取互联网信息的程序，
jsoup可以通过url获取到html源文件，源文件中包含着网站数据，我们可以解析html源文件的数据来获取需要的信息，
开发步骤
1 引入jar包
2 使用jsonp获取网页html源文件，转化成Document对象
3 通过Document对象，获取需要的Element对象，
4 获取Element对象中的数据，
5 设置循坏自动爬取

public class CrawlerDemo {
    //爬虫
    public static void main(String[] args) {
        //使用jsoup获取网页中的html源文件，转化成Document对象，
        try {
            Document parse = Jsoup.parse(new URL("https://pic.netbian.com/"), 5000);
            System.out.println(parse); //输出的源文件数据信息
            //通过document对象来获取需要element对象
            Elements img = parse.getElementsByAttributeValue("alt", "天空小姐姐 黑色唯美裙子 厚涂画风 4k动漫壁纸");
            Elements title = parse.getElementsByAttributeValue("title", "4k壁纸");
            Elements select = parse.select(".w");
            System.out.println("++++++++++++++++++++");
            System.out.println(img);
            System.out.println(title);
            System.out.println(select);
            //获取Element对象中的数据
            String href = img.get(0).attr("src");
            String href1 = title.get(1).attr("href");
            String text = select.text();
            System.out.println("+++++++++++++++++++++++++++");
            System.out.println("href"+href);
            System.out.println("href1"+href1);
            System.out.println("text"+text);
            System.out.println(href1+href);
        } catch (IOException e) {
            e.printStackTrace();
        }


    }

认识Jsoup
是用来解析html页面的工具包，把页面解析出来封装成一个document对象，同时也可以解析xml配置文件，

//第一步先了解parse()方法 获取document对象
try {
            //解析html页面 parse方法是解析文件或是路径
            String path = "com/bjsxt/xml/haha.xml";
            Jsoup.parse(new File(path), "utf-8");//给定本地文件的路径和字符集 返回document对象
            Jsoup.parse("html");//给定一个页面解析成document对象  返回document对象
            Jsoup.parse(new URL("url"),1000);//给定时间内解析url网页文件  返回document对象
        } catch (IOException e) {
            e.printStackTrace();
        }

//第二步 根据parse方法返回的documnet对象来获取对应的元素标签 Elements对象

Elements a = parse.getElementsByTag("a"); //根据选择标签名来获取对应的标签 
Elements img = parse.getElementsByAttributeValue("alt", "天空小姐姐 黑色唯美裙子 厚涂画风 4k动漫壁纸");
Elements title = parse.getElementsByAttributeValue("title", "4k壁纸"); //根据元素标签的属性值来获取对应的标签
Elements select = parse.select(".w"); //根据选择器选择元素标签
parse.getElementById(Sting id) //根据元素标签中的id属性来获取对应的标签

// 根据获取到的元素标签 来获取标签中的内容，属性值
String href = img.get(0).attr("src"); //获取到标签中的src属性值
String href1 = title.get(1).attr("href");//获取到的是标签中的href属性值
String text = select.text();//获取到的是标签中的文本内容 
select.html();//获取元素包含带标签的文本，

爱码～

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
使用Jsoup完成网页爬虫

网络爬虫含义就是自动抓取互联网信息的程序，jsoup可以通过url获取到html源文件，源文件中包含着网站数据，我们可以解析html源文件的数据来获取需要的信息，开发步骤1 引入jar包2 使用jsonp获取网页html源文件，转化成Document对象3 通过Document对象，获取需要的Element对象，4 获取Element对象中的数据，5 设置循坏自动爬取public class CrawlerDemo { //爬虫 public static vo..
复制链接

扫一扫