java URL转PDF文件

最新推荐文章于 2024-07-26 16:25:12 发布

youz1976

最新推荐文章于 2024-07-26 16:25:12 发布

阅读量7.9k

点赞数

分类专栏： java 文章标签： url pdf 博客测试 java

本文链接：https://blog.csdn.net/u014520797/article/details/50958810

版权

java 专栏收录该内容

85 篇文章 3 订阅

订阅专栏

1、java将URL网页博客转化为pdf文件

2、测试博客网页地址为：http://blog.csdn.net/u014520797/article/details/50944998

3、测试结果为

4、工程代码结构为：

5、部分代码展示：

public static String[] extractBlogInfo(String blogURL) throws Exception {
		String[] info = new String[4];
		//报错：Exception in thread "main" org.jsoup.HttpStatusException:HTTP error fetching URL. Status=403, URL=http://blog.csdn.net/u014520797/article/details/50944998/
//		org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).get();
//		爬取某个网站太快，会被封。于是要模拟像人一样的取爬取某个网站，那样的话估计几秒爬取一个网页
//		参考http://blog.sina.com.cn/s/blog_664fdc7e0102vesz.html
		org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31").timeout(10000).get();
		org.jsoup.nodes.Element e_title = doc.select("span.link_title").first();
		info[0] = e_title.text();

		org.jsoup.nodes.Element category_r = doc.select("div.category_r").first();
		info[1] = category_r.after("label").after("span").text().replace("作者同类文章X", "");

		org.jsoup.nodes.Element e_date = doc.select("span.link_postdate").first();
		info[2] = e_date.text();
		org.jsoup.nodes.Element entry = doc.select("div.article_content").first();
		info[3] = formatContentTag(entry);
		info[3]="<?xml version=\"1.0\" encoding=\"UTF-8\"?>"  
				+"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"  
				+"<html xmlns=\"http://www.w3.org/1999/xhtml\">  "
				+"<head>  "
				+"<style>  "
				+"body{  "
				+"font-family:SimSun;  "
				+"font-size:14px;  "
				+"}  "
				+"</style>  "
				+"<meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"></meta></head><body>"+info[3]+"</body></html>";
      
			System.out.println("info.toString():"+info[0]+",\n"+info[1]+",\n"+info[2]+",\n"+info[3]+",\n");
		return info;
	}

6、不能使用org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).get();，因为爬取某个网站太快，会被封。于是要模拟像人一样的取爬取某个网站，那样的话估计几秒爬取一个网页。

7、需要在网页部分添加，避免无法显示中文。

<?xml version=\"1.0\" encoding=\"UTF-8\"?> 
				<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">  
				<html xmlns=\"http://www.w3.org/1999/xhtml\"> 
				<head> 
				<style> 
				body{  
				font-family:SimSun;  
				font-size:14px;  
				}  
				</style>  
				<meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"></meta></head><body>"+info[3]</body></html>

8、代码下载地址：http://download.csdn.net/detail/u014520797/9469285