java获取网页主信息之三:html to tree（转）

最新推荐文章于 2023-06-25 20:38:34 发布

liuxinglanyue

最新推荐文章于 2023-06-25 20:38:34 发布

阅读量114

点赞数

分类专栏：主题爬虫文章标签： HTML Java F#

本文链接：https://blog.csdn.net/liuxinglanyue/article/details/83774289

版权

主题爬虫专栏收录该内容

31 篇文章 0 订阅

订阅专栏

1.所需文件

param.txt:存放需要提取信息的网页路径
TestPage:存放需要提取信息的网页
Out.txt:输出的网页内容

2.测试程序

package test;   
  
import java.io.*;   
import Source.*;   
  
//提取页面主要信息测试   
public class ETest{   
  
    public static void main(String args[])   
    {   
        //输出文件   
        String out = "out.txt";   
        File outfile = new File(out);   
        //建立html树   
        HTML2Tree h2t = new HTML2Tree();   
        String file = getFilename();   
        h2t.main(file);   
        HTree tree = h2t.getTree();   
        //允许标准差   
        double th = 0.79;   
        //选择主要信息块   
        ChooseBlock cb = new ChooseBlock(th);   
        //输出主要信息   
        String str = cb.getContent(tree);   
        if(str == null)   
        {   
            System.out.println("文件为空");   
            System.exit(1);   
        }   
        try  
        {   
            PrintWriter p = new PrintWriter(new BufferedWriter(new FileWriter(outfile)));   
            p.println(str);   
            p.close();   
        }   
        catch(IOException e)   
        {   
            System.out.println(e);   
            System.exit(1);   
        }   
    }   
    //获取要提取的网页文件名   
    private static String getFilename()   
    {   
        String file = "";   
        try  
        {   
            File f = new File("param.txt");   
            BufferedReader fis = new BufferedReader(new FileReader(f));   
            String s;   
            while((s = fis.readLine()) != null)    
            if(!s.equalsIgnoreCase(""))   
            {   
                 file = s;   
                 break;   
            }   
        }   
        catch(IOException e)   
        {   
            System.out.println(e);   
            System.exit(1);   
        }   
        return file;   
    }   
}

liuxinglanyue

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java获取网页主信息之三:html to tree（转）

1.所需文件 param.txt:存放需要提取信息的网页路径 TestPage:存放需要提取信息的网页 Out.txt:输出的网页内容2.测试程序package test; import java.io.*; import Source.*; //提取页面主要信息测试 public class ETest{ ...
复制链接

扫一扫

专栏目录