Java爬虫（三）--获取网页中的所有地址

最新推荐文章于 2024-07-01 02:04:52 发布

零零叁2019

最新推荐文章于 2024-07-01 02:04:52 发布

阅读量966

点赞数

分类专栏： java爬虫

本文链接：https://blog.csdn.net/lhb2019/article/details/79948172

版权

java爬虫专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在这里就懒得处理异常，代码很简单，看注释

package test;

import java.net.URL;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class GetUrl {
    public static List getUrl(String uri) throws Exception{
        List list=new ArrayList<>();//用list来存放地址
        URL url=new URL(uri);
        String protocol=url.getProtocol();//获取协议
        String host=url.getHost();//获取域名
        Document doc=Jsoup.connect(uri).get();//dom解析html
        Elements ele=doc.getElementsByTag("a");//获取网页中的a标签
        for(Element a:ele){//遍历
            String href=a.attr("href");
            /**
             * a标签中有四种值，需要判断，例如：
             * 1.只有路径：/citylist.html
             * 2.含有js代码：javascript:void(0)
             * 3.网址全称：http://www.xuecheyi.com/Info/List-83.html
             * 4.没有后缀/Info
             * 
             */
            String reg="[a-zA-z]+://[^\\s]*";
            Pattern p=Pattern.compile(reg);
            Matcher m=p.matcher(href);
            if(m.find()){//通过正则表达式匹配了第三种http://jx.xuecheyi.com/member/login/index
                list.add(href);
            }else if(href.indexOf("/")==0){//匹配第一四两种
                /**
                 * /login/ind
                 * 0123456789
                 * 匹配出来的地址需要在前面加上协议和域名
                 */
                list.add(protocol+"://"+host+href);
            }
        }
        return list;
    }
}

从main方法读取

package test;

import java.util.Iterator;
import java.util.List;

public class TestPc {
    public static void main(String[] args) throws Exception {
        String uri="http://www.xuecheyi.com/Info/show-27818.html";
        //Util.DownLoadPage(uri);
        List list=GetUrl.getUrl(uri);
        Iterator it=list.iterator();
        while(it.hasNext()){
            uri=(String)it.next();
            System.out.println(uri);
        }
    }
}

效果图：
这里写图片描述

零零叁2019

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
Java爬虫（三）--获取网页中的所有地址

在这里就懒得处理异常，代码很简单，看注释package test;import java.net.URL;import java.util.ArrayList;import java.util.List;import java.util.regex.Matcher;import java.util.regex.Pattern;import org.jsoup.Jsoup;...
复制链接

扫一扫