JSOUP爬取数据的学习测试

博客胡

已于 2022-04-26 17:05:08 修改

阅读量406

点赞数 2

文章标签： java html5

于 2022-04-25 11:27:14 首次发布

本文链接：https://blog.csdn.net/weixin_44451527/article/details/124400288

版权

Jsoup是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。

只是为了自学（切记不可作于商用！不可做违法行为！）

直接上测试代码话不多说。

参数就不详解了，通过链接get请求会获取完整的html页面，包含隐藏的input框（而post方式则没有）
ignoreContentType：忽略头部请求
ignoreHttpErrors：忽略请求错误（属性我记得是大概这么个意思，大家可以查一下）

																																							String cookie = "cookevaluesasdsadas";																															
    Document document = Jsoup.connect("获取数据的url")
            .referrer("null")
            .cookie(cookie,cookie)
            .data("key","value")
            .data("key","value")
            .ignoreContentType(true)
            .ignoreHttpErrors(true)
            .timeout(100000)
            .maxBodySize(0)
            .get();
    Element pageParam=document.getElementById("Id");

post链接：当post链接时，jsoup可能无法获取延迟后的内容，那大家可以在设置一个动态参数去抵消延迟时间。（调用一遍get方式的该方法，没有动态参数）

    test = Jsoup.connect("url")
                .cookie(cookie,cookie)
                .data("ScriptManager1","UpdatePanel2|AspNetPager1")
                .data("__EVENTTARGET","AspNetPager1")
                .data("__EVENTARGUMENT","3")
                .data("__ASYNCPOST","true")
                .data("__VIEWSTATEGENERATOR","9C794D29")
                .data("txt_s_time","2022-04-03")
                .data("txt_e_time","2022-04-14")
                .data("__VIEWSTATE",getId())
                //.data("null",get())
               .ignoreContentType(true)
                .ignoreHttpErrors(true)
                .timeout(100000)
                .post();

然后得到的Document对象，就可以自行解析了。给大家一个解析的模板。

解析内容要根据实际得到的html去解析，得到的html并不通用，但是或许table表格一些html的自身属性是通用的

	String html="html字符串";
    Document doc = Jsoup.parse(html);

    //解析body片段
    Document bodyHtml = Jsoup.parseBodyFragment(html);
    System.out.println(bodyHtml);
    
    //获取
    Elements workElements = doc.select("div[style*=padding-left:5px; width:100px; background-color:#6a9351; color:#fff; float:left;]");
    workElements.get(0).text();

    //获取
    Elements hoursElements = doc.select("div[style*=float:right;padding-right:5px;]");
    hoursElements.get(0).select("a").get(0).text();

    // 解析第一个表格
    Element element = doc.select("table").first();
    //解析所有表格
    List<Element> tableList = doc.select("table");
    List<List<String>> stringList=new ArrayList<>();
    for (int i=1;i<tableList.size();i++){
        Elements trElements = tableList.get(i).select("tr");
        List<String> trList=new ArrayList<>();
        for (Element elem : trElements) {
            String tdNews = elem.select("td").get(0).text();
            Elements tdList=elem.select("td");
            for (Element as:tdList){
                trList.add(as.text());
            }
        }
        stringList.add(trList);
    }
    System.out.println(stringList);