java利用Jsoup做爬虫简单方法示例

最新推荐文章于 2024-04-28 11:35:41 发布

zhangshengqiang168

最新推荐文章于 2024-04-28 11:35:41 发布

阅读量495

点赞数

分类专栏： java 文章标签： java爬取

本文链接：https://blog.csdn.net/zhangshengqiang168/article/details/104006561

版权

java 专栏收录该内容

44 篇文章 0 订阅

订阅专栏

1.限制太多,与反扒斗智斗勇.需要分析,项目不一定能爬到,只是方法示例

项目地址:

链接：https://pan.baidu.com/s/1jkhT4mJqP_tsDaN2VEJiZw
提取码：nsyu
复制这段内容后打开百度网盘手机App，操作更方便哦

1.pom.xml文件

        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.2</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpcore -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpcore</artifactId>
            <version>4.4.6</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
         <!--fastJson-->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.58</version>
        </dependency>

JS请求获取数据类型

1.分析网页,观察请求,分析规律

2.js请求获取数据类型的爬取,就能拿到数据,解析数据,分析数据即可

String parentUrl = "https://movie.douban.com/j/search_tags?type=tv&source=";
String title = Jsoup.connect(parentUrl)
                .ignoreContentType(true).get().body().text();
JSONArray tags = JSONArray.parseObject(title).getJSONArray("tags");

数据嵌套在页面中

1.分析每一个图书都是一个li标签的class=subject-item包裹着,所以获取这个图书信息的代码为

 String url = "https://book.douban.com/tag/小说";
 Element e = Jsoup.connect(url).get().select(".subject-item").get(0);

这样就获取到了所有的class=subject-item 下的图书了

2.获取标题可以看出标题在subject-item class下的 class=info的div下. div下的h2标签的a标签下 ,所以代码一步步选中

Element a2 = e.select(".info").get(0);  //获取info下的数据
String title = a2.select("h2").select("a").attr("title"); //图书名称

3.就获取到了title值为坏小孩

4.其他数据同理获取

模拟浏览器登录

1.登录,查看cookic信息

      Map<String, String> cookies = new HashMap();
        cookies.put("loc-last-index-location-id", "118254");
        cookies.put("dbcl2", "164885001:pu3jCf6Fsls");

2.模拟登录发起请求

Elements els = Jsoup.connect(url)
         .header("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)")
         .cookies(cookies)
         .get().select(".tagCol").select("a");