java url转成pdf_java URL转PDF文件

最新推荐文章于 2024-08-18 21:31:07 发布

杨妹妹

最新推荐文章于 2024-08-18 21:31:07 发布

阅读量1.9k

点赞数

文章标签： java url转成pdf

本文链接：https://blog.csdn.net/weixin_30670053/article/details/114709086

版权

本文介绍了如何使用Java通过Jsoup库抓取网页内容，并将其转换为PDF文件。具体步骤包括设置User-Agent以避免被封，处理网页内容确保中文正常显示，并给出了代码示例和下载链接。

摘要由CSDN通过智能技术生成

1、java将URL网页博客转化为pdf文件

2、测试博客网页地址为：http://blog.csdn.net/u014520797/article/details/50944998

3、测试结果为

4、工程代码结构为：

5、部分代码展示：

public static String[] extractBlogInfo(String blogURL) throws Exception {

String[] info = new String[4];

//报错：Exception in thread "main" org.jsoup.HttpStatusException:HTTP error fetching URL. Status=403, URL=http://blog.csdn.net/u014520797/article/details/50944998/

//org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).get();

//爬取某个网站太快，会被封。于是要模拟像人一样的取爬取某个网站，那样的话估计几秒爬取一个网页

//参考http://blog.sina.com.cn/s/blog_664fdc7e0102vesz.html

org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31").timeout(10000).get();

org.jsoup.nodes.Element e_title = doc.select("span.link_title").first();

info[0] = e_title.text();

org.jsoup.nodes.Element category_r = doc.select("div.category_r").first();

info[1] = category_r.after("label").after("span").text().replace("作者同类文章X", "");

org.jsoup.nodes.Element e_date = doc.select("span.link_postdate").first();

info[2] = e_date.text();

org.jsoup.nodes.Element entry = doc.select("div.article_content").first();

info[3] = formatContentTag(entry);

info[3]=""

+""-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"

+" "

+""+info[3]+"";