项目实训——扩展阅读爬取相关网页（2）_网页数据爬取项目实训-CSDN博客

本文链接：https://blog.csdn.net/weixin_46346504/article/details/125241241

上次整理了我的基本思路，这次实现之后，把详细的代码展示如下：
上次整理的思路如下：
1、初始页面定义了几个平时比较常用的网页，例如百度和csdn官网等，将初始页面存入oldmap中；
2、开始遍历oldmap，一条一条的爬取其中的网页源码并解析：
一是提取出其源码中的所有的有效链接，与两个map比对，去重，然后将新爬取的链接加入到newmap中；
二是获取当前网址的标题，并进行字符串匹配，看是否包含用户定义的几个关键词，如果能匹配，就将该网址加入到结果列表result中，最后达到输出条件（爬够20条）后一起输出；
3、当oldmap中所有的链接都解析完之后，把newmap中的链接加入到oldmap，并且清空newmap；
4、重复以上步骤，开始解析oldmap中没有被访问过的网址。

实现过程以注释的形式体现如下：

public List<String> getHtmlString(String[] arg) throws IOException {
//这里将Boolean类型的值设置为false，被解析之后设置为true，以此来辨别是否被解析过
    Map<String, Boolean> newmap = new HashMap<String, Boolean>();
    Map<String, Boolean> oldmap = new HashMap<String, Boolean>();
    ArrayList<String> result = new ArrayList<String>();
    //几个初始网站
    oldmap.put("https://www.baidu.com",false);
    oldmap.put("https://www.csdn.net",false);
    oldmap.put("https://wenku.baidu.com/",false);
    oldmap.put("https://baike.baidu.com/",false);
//爬够二十条信息之后停止继续爬取，输出结果
    for (;result.size()<21;){
        //遍历oldmap，这里选用了entry方法来遍历
        for (Map.Entry<String, Boolean> entry: oldmap.entrySet()){
            try {
            //判断非空，以及结果没有达到要求
                if(!entry.getValue()&&result.size()<16){
                    String url=entry.getKey();
                    //用jsoup解析网页成DOM树形式
                    Document doc = Jsoup.connect(url).get();
                    //获取网页title
                    String title=doc.title();
                    System.out.println(title);
                    //判断该网址的标题是否符合要求，进行字符串匹配。遍历字符串数组，依次匹配
                    for (int i=0;i<arg.length;i++){
                        Boolean p=title.contains(arg[i]);
                        //能匹配则将该网址加入resultmap
                        if (p==true){
                           // resultmap.put(title,url);
                            result.add(title);
                            result.add(url);
                        }
                    }
                    //爬取当前网页的所有链接
                    Elements links = (Elements) doc.getElementsByTag("a");
                    for (Element link : links) {
                        String linkHref = link.attr("href");
                        //链接去重，不在oldmap,newmap中，就加入newmap，去重也是使用的字符串匹配方法
                        if (!oldmap.containsKey(linkHref)&& !newmap.containsKey(linkHref)){
                            newmap.put(linkHref,false);
                        }
                    }
                    //当前链接已经爬完了，改为true
                    oldmap.put(url,true);
                }else if (result.size()>20){
                    break;
                }
            }catch (IllegalArgumentException e){
                System.out.println("无效URL");
            }catch(UnknownHostException e){
                System.out.println("无效URL");
            }catch(SocketTimeoutException e){
                System.out.println("读取超时");
            }catch (HttpStatusException e){

            }catch (UnsupportedMimeTypeException e){

            }catch (SSLException e){

            }catch( Exception e ){

            }

        }
        //如果newmap非空，则说明有新的链接出现，把新链接加入oldmap中，进行遍历
        if (!newmap.isEmpty()){
            System.out.println("newmap中总共爬到了"+newmap.size()+"条链接");
            oldmap.putAll(newmap);
            //清空newmap
            newmap.clear();
        }

    }
    return(result);
}
 }

在爬虫的过程中，异常抓取非常重要，因为爬链接时，经常会有链接无效或者被限制访问的情况，如果访问这些地址，那么就会报错，被迫停止程序，所以一定要记得抓异常。