Java爬虫实践

最新推荐文章于 2023-06-02 11:16:00 发布

VIP文章王粤同学

最新推荐文章于 2023-06-02 11:16:00 发布

阅读量1k

点赞数 2

分类专栏： java 文章标签： java

本文链接：https://blog.csdn.net/weixin_44327435/article/details/95897555

版权

Java_spider_实战

源码及资料点这里！！！

爬虫的执行流程: 1) 确定首页url 2) 发送请求, 获取数据 3) 解析数据 4) 保存数据
爬虫的三大核心模块:
1发送请求获取数据 : httpClient
* 1)获取httpClient对象:
* 2) 创建请求方式的对象
* 3) 设置请求参数, 请求头
* 4) 发送请求, 获取响应对象
* 5) 获取数据:
* 6) 释放资源
2解析数据 : Jsoup
* 常见方法 :
* static parse(String html) ; 根据html字符串转换成document对象
* select(“选择器”) ; 根据选择器获取对应的元素
* text()/html() ; 获取指定元素的内容体中数据
* attr(String name) ; 根据属性的名称获取属性的值
3保存数据 :

0. 梳理整个爬虫的流程

0.1 163娱乐爬虫的流程

0.2 腾讯娱乐爬虫的流程

[外链图片转存中...(img-je6CPrCj-1563093902041)]

1.1 准备工作 :

1)创建项目 : gossip-spider-news (maven jar工程)

[外链图片转存失败(img-zvL8ETKt-1563093902027)(assets/1557621834394.png)]

2)添加pom依赖:

<dependencies>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.4</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.3</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.38</version>
        </dependency>

        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-jdbc</artifactId>
            <version>4.2.4.RELEASE</version>
        </dependency>

        <dependency>
            <groupId>c3p0</groupId>
            <artifactId>c3p0</artifactId>
            <version>0.9.1.2</version>
        </dependency>

        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.8.1</version>
        </dependency>

        <dependency>
            <groupId>redis.clients</groupId>
            <artifactId>jedis</artifactId>
            <version>2.9.0</version>
        </dependency>


    </dependencies>

3)加入工具类:

[外链图片转存失败(img-wkct0RYO-1563093902031)(assets/1557623438827.png)]

1.2 确定首页url

[外链图片转存失败(img-5PjuFk4c-1563093902031)(assets/1557624105019.png)]

结论: 新闻数据, 不是通过同步请求, 获取到, 而是通过异步请求, 悄悄获取的

如何获取异步请求的url :

首页的url : https://ent.163.com/special/000380VU/newsdata_index.js?callback=data_callback
分页的url :  https://ent.163.com/special/000380VU/newsdata_index_02.js?callback=data_callback

1.3 发送请求, 获取数据

public class News163Spider {
   


    public static void main(String[] args )  throws  Exception{
   
        //1. 确定首页url:
        String indexUrl = "https://ent.163.com/special/000380VU/newsdata_index.js?callback=data_callback";
        //2. 发送请求, 获取数据
        // 此处获取的json的数据, 并不是一个非标准的json
        String jsonStr = HttpClientUtils.doGet(indexUrl);
        jsonStr =  splitJson(jsonStr);
        System.out.println(jsonStr);

        //3. 解析数据(json) :
        /**
         * 1) json格式一共有几种:     二种  一般复合格式认为是前二种的扩展格式
         *      一种:  [value1,value2,value3 ....]    ---数组格式
         *      二种:  {key1:value1,key2:value2}
         *      三种:  {key1:[value1,value2]}
         *      四种:  [{},{}]
         *
         * 2) 如何区分一个json的格式是一个对象呢, 还是一个数组呢?
         *      查看最外层的符号即可, 如果是[] 那就是数组, 如果{}那就是对象
         *          [{key,[{key,value}]}] : 转回对应的类型
         *              List<Map<String,List<Map<String,String>>>>
         *
         *  3) json格式本质上就是一个字符串: 在js  和 java中表示的类型有所不同的:
         *           js                      java
         *    []    数组                    数组/List/set
         *    {}    对象                    javaBean对象/Map
         *
         *    js中如何定义一个对象:  var persion = {username:'张三'};   persion.username
         */


    }
    // 将非标准的json转换为标准的json字符串
    private static String splitJson(String jsonStr) {
   
        int firstIndex = jsonStr.indexOf("(");
        int lastIndex = jsonStr.lastIndexOf(")");

        return jsonStr.substring(firstIndex+1,lastIndex);

    }
}

1.4 解析数据(json)

解析新闻的列表页:

// 解析json的方法
    private static void parseJson(String jsonStr) {
   
        //3.1 将json字符串转换成 指定的对象
        Gson gson = new Gson();

        List<Map<String, Object>> newsList = gson.fromJson(jsonStr, List.class);
        // 3.2 遍历整个新闻的结合, 获取每一个新闻的对象
        for (Map<String, Object> newsObj : newsList) {
   
            // 新闻 :  标题, 时间,来源 , 内容 , 新闻编辑  ,  新闻的url
            //3.2.1 获取新闻的url , 需要根据url, 获取详情页中新闻数据
            String docUrl = (String) newsObj.get("docurl");
            // 过滤掉一些不是新闻数据的url
            if(docUrl.contains("photoview")){
   
                continue;
            }
            if(docUrl.contains("v.163.com")){
   
                continue;
            }
            //System.out.println(docUrl);
            //3.2.2 获取新闻详情页的数据
            parseNewsItem(docUrl);

        }
    }

创建news类(pojo):

// 新闻对象
public class News {
   
    private String id;
    private String title;
    private String time;
    private String source;
    private String content;
    private String editor;
    private String docurl;
	// 省略 get...  set... 方法
}

解析新闻的详情页的内容:

// 根据url 解析新闻详情页:
    private static News parseNewsItem(String docUrl) throws  Exception {
   
        //  3.3.1 发送请求, 获取新闻详情页数据
        String html = HttpClientUtils.doGet(docUrl);

        //3.3.2 解析新闻详情页:
        Document document = Jsoup.parse(html);

        //3.3.2.1 :  解析新闻的标题:
        News news = new News();
        Elements h1El = document.select("#epContentLeft h1");
        String title = h1El.text();
        news.setTitle(title);

        //3.3.2.2 :  解析新闻的时间:
        Elements timeAndSourceEl = document.select(".post_time_source");

        String timeAndSource = timeAndSourceEl.text();

        String[] split = timeAndSource.split("　来源: ");// 请各位一定一定一定要复制, 否则会切割失败
        news.setTime(split[0]);
        //3.3.2.3 :  解析新闻的来源:
        news.setSource(split[1]);
        //3.3.2.4 :  解析新闻的正文:
        Elements ps = document.select("#endText p");
        String content = ps.text();
        news.setContent(content);
        //3.3.2.5 :  解析新闻的编辑:
        Elements spanEl = document.select(".ep-editor");
        // 责任编辑：陈少杰_b6952
        String edit

最低0.47元/天解锁文章

王粤同学

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
Java爬虫实践

Java_spider_实战爬虫的执行流程: 1) 确定首页url 2) 发送请求, 获取数据 3) 解析数据 4) 保存数据爬虫的三大核心模块:发送请求获取数据 : httpClient获取httpClient对象:创建请求方式的对象设置请求参数, 请求头发送请求, 获取响应对象获取数据:释放资源解析数据 :...
复制链接

扫一扫