1、java将URL网页博客转化为pdf文件
2、测试博客网页地址为:http://blog.csdn.net/u014520797/article/details/50944998
3、测试结果为
4、工程代码结构为:
5、部分代码展示:
public static String[] extractBlogInfo(String blogURL) throws Exception {
String[] info = new String[4];
//报错:Exception in thread "main" org.jsoup.HttpStatusException:HTTP error fetching URL. Status=403, URL=http://blog.csdn.net/u014520797/article/details/50944998/
// org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).get();
// 爬取某个网站太快,会被封。于是要模拟像人一样的取爬取某个网站,那样的话估计几秒爬取一个网页
// 参考http://blog.sina.com.cn/s/blog_664fdc7e0102vesz.html
org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31").timeout(10000).get();
org.jsoup.nodes.Element e_title = doc.select("span.link_title").first();
info[0] = e_title.text();
org.jsoup.nodes.Element category_r = doc.select("div.category_r").first();
info[1] = category_r.after("label").after("span").text().replace("作者同类文章X", "");
org.jsoup.nodes.Element e_date = doc.select("span.link_postdate").first();
info[2] = e_date.text();
org.jsoup.nodes.Element entry = doc.select("div.article_content").first();
info[3] = formatContentTag(entry);
info[3]="<?xml version=\"1.0\" encoding=\"UTF-8\"?>"
+"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"
+"<html xmlns=\"http://www.w3.org/1999/xhtml\"> "
+"<head> "
+"<style> "
+"body{ "
+"font-family:SimSun; "
+"font-size:14px; "
+"} "
+"</style> "
+"<meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"></meta></head><body>"+info[3]+"</body></html>";
System.out.println("info.toString():"+info[0]+",\n"+info[1]+",\n"+info[2]+",\n"+info[3]+",\n");
return info;
}
6、不能使用org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).get();,因为爬取某个网站太快,会被封。于是要模拟像人一样的取爬取某个网站,那样的话估计几秒爬取一个网页。
7、需要在网页部分添加,避免无法显示中文。
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">
<html xmlns=\"http://www.w3.org/1999/xhtml\">
<head>
<style>
body{
font-family:SimSun;
font-size:14px;
}
</style>
<meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"></meta></head><body>"+info[3]</body></html>
8、代码下载地址:http://download.csdn.net/detail/u014520797/9469285