java下载网页并读取内容

最新推荐文章于 2024-08-18 03:08:17 发布

马如林

最新推荐文章于 2024-08-18 03:08:17 发布

阅读量1w

点赞数

分类专栏： JavaEE等信息检索、过滤 Web相关搜索引擎相关文章标签： java string thread download windows exception

本文链接：https://blog.csdn.net/longronglin/article/details/2527921

版权

JavaEE等同时被 3 个专栏收录

281 篇文章 2 订阅 ¥19.90 ¥99.00

订阅专栏

超级会员免费看

Web相关

83 篇文章 0 订阅

订阅专栏

信息检索、过滤

19 篇文章 0 订阅

订阅专栏

这篇博客介绍了如何使用Java进行网页下载，并详细展示了如何读取下载后的HTML内容，包括处理robots.txt文件和处理不同类型的搜索引擎爬虫规则。内容涵盖了线程安全、异常处理以及Windows平台下的应用。

摘要由CSDN通过智能技术生成

下载回来怎么也得读取内容：

package com.core.crawl;

import java.io.IOException;

import com.util.file.Files;

public class Crawl {

    /**
     * @param args
     * @throws IOException 
     * @throws InterruptedException 
     */
    public static void main(String[] args) throws IOException, InterruptedException {

	long begin = System.currentTimeMillis();
	//WebSpider spider2 = new WebSpider();
	WebSpider spider1 = new WebSpider();
	spider1.setWebAddress("http://www.w3c.org/robots.txt");
	spider1.setDestFile(Files.getSysPath() + "/"+"robots.");
	
	//spider2.setWebAddress("http://blog.csdn.net/longronglin");
	//spider2.setDestFile(Files.getSysPath() + "/"