用Java写爬网页的小工具

最新推荐文章于 2024-05-11 02:23:38 发布

codingxm

最新推荐文章于 2024-05-11 02:23:38 发布

阅读量2.2k

点赞数

分类专栏： WEB开发 Java 文章标签： java output string input path url

本文链接：https://blog.csdn.net/MINWH/article/details/5570374

版权

WEB开发同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

Java

4 篇文章 0 订阅

订阅专栏

其实也不是爬网页，只是从一个XML文件取URL，然后抓取网页内容。

主要有几点，备忘：

1. 全局代理，设置之后所有的URL都使用此代理，这样可以直接调用FileUtils.copyURLToFile：

private void initProxy(String host, int port,
		final String username, final String password) {
	Authenticator.setDefault(new Authenticator() {
		protected PasswordAuthentication getPasswordAuthentication() {
			return new PasswordAuthentication(username,
						password.toCharArray());
		}
	});
	System.setProperty("http.proxyType", "4");
	System.setProperty("http.proxyPort",
			Integer.toString(port));
	System.setProperty("http.proxyHost", host);
	System.setProperty("http.proxySet", "true");
}

2. 设置User-Agent，有些站点禁止Java作为访问客户端，必须通过URLConnection设置User-Agent来模拟浏览器，所以不能用FileUtils.copyURLToFile，复制其源码小修改一下即可：

URLConnection httpConnection = new URL(url).openConnection();
httpConnection.setRequestProperty("User-Agent",
	"Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
File dest = new File(path);
InputStream input = httpConnection.getInputStream();
try {
	FileOutputStream output = FileUtils.openOutputStream(dest);
	try {
		IOUtils.copy(input, output);
	} finally {
		IOUtils.closeQuietly(output);
	}
} finally {
	IOUtils.closeQuietly(input);
}

3.dom4j读取超大xml文件，使用event based模式防止堆溢出错误：

reader.addHandler("/RDF/ExternalPage", new ElementHandler() {
	public void onStart(ElementPath path) {
	}

	public void onEnd(ElementPath path) {
		Element node = path.getCurrent();
		// 节点相关操作
		// 关键操作，从内存的节点树中剔除节点，释放内存
		node.detach();
	}
});

4.多线程可以用jdk1.5新加入的线程池java.util.concurrent.ThreadPoolExecutor，方便实用。稍微要注意的就是在添加完所有任务后记得执行一下ThreadPoolExecutor.shutdown()。

codingxm

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用Java写爬网页的小工具

其实也不是爬网页，只是从一个XML文件取URL，然后抓取网页内容。主要有几点，备忘：1. 全局代理，设置之后所有的URL都使用此代理，这样可以直接调用FileUtils.copyURLToFile：private void initProxy(String host, int port, final String username, final String password)
复制链接

扫一扫