用Java批量下载PDF

最新推荐文章于 2023-05-12 11:52:43 发布

loveyu0428

最新推荐文章于 2023-05-12 11:52:43 发布

阅读量1.1k

点赞数

文章标签： java

Internet当中有着无数的标准和规范，其中IETF（因特网工程任务组）中的RFC就达到几千个http://ietfreport.isoc.org/rfc/PDF/，包括http协议，uri等等。最近学习时常常看到参考书上提到各种RFC，每次下载实在有些麻烦，正好在学习相关内容，于是就写了个程序将官网上六千多个pdf文档下载下来。在这里跟大家分享。

整体思路：

首先从RFC列表的页面（ http://ietfreport.isoc.org/rfc/PDF/）中解析出各个RFC文件的名称（这其中用到了正则表达式），存放在一个String数组中。

然后遍历整个数组，在遍历的过程当中，用apache的common.io包中的FileUtils.copyURLtoFile（）方法（关于common io项目，请参考官网，具体地址在文章末尾）从网上下载pdf并保存到RFC目录下。

代码中也许还有许多优化的地方，欢迎各位指正。

源代码

整个程序源码如下：

import java.io.File;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.FileUtils;

/**
 * Downloads RFCs
 * 
 * @author bingduanLin
 * 
 */
public class DownloadPdfs {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		String url = "http://ietfreport.isoc.org/rfc/PDF";
		String[] list = readList(url);
		System.out.println("开始下载，请耐心等待");
		for (String u : list) {
			String source = url + "/" + u;
			String des = "RFC" + File.separator + u;
			downloadAndSave(source, des);
		}
		System.out.println("恭喜你，全部下载成功！");
		

	}

	/**
	 * 从网页中读取文件名列表
	 * 
	 * @param urlString
	 * @return
	 */
	public static String[] readList(String urlString) {

		String[] lists = new String[6734];
		try {
			URL url = new URL(urlString);
			Scanner scanner = new Scanner(url.openStream());
			int i = 0;
			int up = 6960; // 226
			while (scanner.hasNextLine() && i < up) {
				String line = scanner.nextLine();
				if (i >= 226) {
					lists[i - 226] = dealString(line);
				}
				i++;
			}
			scanner.close();
		} catch (MalformedURLException e) {
			System.out.println("URL格式出错，请检查");
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		System.out.println(lists[0]);
		System.out.println(lists[6733]);
		return lists;

	}

	/**
	 * 从网页的<li>元素中提取文件名
	 * 
	 * @param source
	 *            待处理的<li>元素
	 * @return 处理后的文件名
	 */
	public static String dealString(String source) {
		String string = source;
		String result = string;
		String patterString = ">\\s.+.pdf";
		Pattern pattern = Pattern.compile(patterString);
		Matcher macher = pattern.matcher(string);
		while (macher.find()) {
			int start = macher.start();
			int end = macher.end();
			result = string.substring(start, end);
		}
		result = result.substring(2, result.length());
		return result;
	}

	/**
	 * @param source
	 *            the url of PDF to be downloaded
	 * @param destination
	 *            the destination to be saved
	 */
	public static void downloadAndSave(String source, String destination) {
		try {
			URL url = new URL(source); // "http://ietfreport.isoc.org/rfc/PDF/rfc1341.pdf");
			File file = new File(destination); // "rfc1341.pdf");
			FileUtils.copyURLToFile(url, file);
			System.out.println(source + "下载完成");
		} catch (MalformedURLException e) {
			System.out.println("URL格式出错，请检查");
			e.printStackTrace();
		} catch (IOException e) {
			System.out.println("I/O 错误");
			e.printStackTrace();
		}
	}

}

说明：

下载几千个文件需要花个几分钟，程序中不少地方都是硬编码，欢迎各位给出优化方案。

如果IETF网站中的页面发生变化，需要修改代码方可运行。

参考资料：

1. StackOverFlow中的几个讨论：

http://stackoverflow.com/questions/921262/how-to-download-and-save-a-file-from-internet-using-java

http://stackoverflow.com/questions/1378238/downloaded-pdf-with-java-is-corrupt

2. Apache Common IO ：

http://commons.apache.org/io/download_io.cgi

原创作品，转载请务必注明出处。

loveyu0428

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
用Java批量下载PDF

Internet当中有着无数的标准和规范，其中IETF（因特网工程任务组）中的RFC就达到几千个http://ietfreport.isoc.org/rfc/PDF/，包括http协议，uri等等。最近学习时常常看到参考书上提到各种RFC，每次下载实在有些麻烦，正好在学习相关内容，于是就写了个程序将官网上六千多个pdf文档下载下来。在这里跟大家分享。整体思路：首先从RFC列表的页面（h...
复制链接

扫一扫