提取一个网页内的所有链接

最新推荐文章于 2024-08-22 03:58:35 发布

caoxu1987728

最新推荐文章于 2024-08-22 03:58:35 发布

阅读量925

点赞数

分类专栏： Search Engine 文章标签： newline filter url string null file

本文链接：https://blog.csdn.net/caoxu1987728/article/details/3062494

版权

Search Engine 专栏收录该内容

28 篇文章 0 订阅

订阅专栏

看上去可能没有什么用，但对于做小东西来说还是有点用处的。比如把这些链接放到Heritrix里面进行统一抓取。

在这里，我只是简单记录一下，没什么实质的内容：

代码如下：

 
 public void extract()
    {
        BufferedWriter bw=null;
        String title="hangdian";
        
        NodeFilter  body_filter=new AndFilter(new TagNameFilter("A"),new NotFilter(new HasAttributeFilter("href","../")));
        NodeFilter  option_filter=new AndFilter(new TagNameFilter("option"),new NotFilter(new HasAttributeFilter("value","../")));
        //[a-zA-z]+://[^/s]* 
        //^((https?|ftp|news):)?([a-z]([a-z0-9/-]*[/.。])+([a-z]{2}| 
        
        try
        {
            bw= new BufferedWriter(new FileWriter(new File(this 
                    .getOutputPath() 
                    + title + ".txt"))); 
               NodeList title_nodes = this.getParser().parse(body_filter); 
                for(int i=0;i<title_nodes.size();i++) 
                { 
                    LinkTag node=(LinkTag)title_nodes.elementAt(i);      
                     
                    String url=node.getAttribute("href"); 
                    bw.write(url+NEWLINE);
                }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
        this.getParser().reset();
        try
        {
             NodeList title_nodes = this.getParser().parse(option_filter); 
                for(int i=0;i<title_nodes.size();i++) 
                { 
                    OptionTag node=(OptionTag)title_nodes.elementAt(i);      
                     
                    String url=node.getAttribute("value"); 
                    bw.write(url+NEWLINE);
                }
        }
        catch(Exception e)
        {
            e.printStackTrace();
        }
        
        try
        {
            if(bw!=null)
                bw.close();
        }catch(IOException e)
        {
            e.printStackTrace();
        }
    }
 

这是对我校主页的分析。

加入main（）函数：

 
     public static void main(String[]args) throws Exception
    {
        Extractor extractor=new ExtractHangdian();
        extractor.setOutputPath("目标文件的保存位置");
        File file=new File("我校主页的存放位置");
        traverse(extractor,file);
    }