情景说明:网页的数据格式比较简单,只是把小说内容爬取到本地保存,没有遇到反爬。
使用到的依赖如下:
<!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient --> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.3</version> </dependency> <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup --> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.3</version> </dependency>
网页代码:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>第十一章 末代皇帝&最后一个克格勃(3)-龙族3·黑月之潮(中)</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="keywords" content="第十一章 末代皇帝&最后一个克格勃(3)-龙族3·黑月之潮(中)" />
<meta name="description" content="第十一章 末代皇帝&最后一个克格勃(3)-龙族3·黑月之潮(中)" />
<!–[if lt IE 9]>
<script src=/css3-mediaqueries.js></script>
<![endif]–>
<link rel="stylesheet" type="text/css" media="screen and (max-width: 900px)" href="/wap.css" />
<link rel="stylesheet" type="text/css" media="screen and (min-width: 900px)" href="/dcy.css" />
<link rel="alternate" type="application/rss+xml" href="http://www.********.cc/longzu3heiyuezhichaozhong/feed.asp?cmt=371" title="Comments Feed for 第十一章 末代皇帝&最后一个克格勃(3)" />
<script src="http://www.********.cc/longzu3heiyuezhichaozhong/script/common.js" type="text/javascript"></script>
<script src="http://www.********.cc/longzu3heiyuezhichaozhong/function/c_html_js_add.asp" type="text/javascript"></script>
</head>
<body><div class="v"><h1 align="center" class="STYLE1">龙族3·黑月之潮(中)</h1></div>
<div class="site clearfix"><span style="float:right;"> <a href="http://www.********.cc/longzu3heiyuezhichaozhong/" >返回首页</a></span><a href="http://www.********.cc/longzu3heiyuezhichaozhong/">龙族3·黑月之潮(中)</a> > 第十一章 末代皇帝&最后一个克格勃(3)</div>
<div class="chaptertitle clearfix">
<h1>第十一章 末代皇帝&最后一个克格勃(3)</h1>
</div>
<div id="p_adtop" class="clearfix">
<div id="p_ad_t1"><script language="javascript" type="text/javascript" src="/ad1.js"></script></div>
<div id="p_ad_t2"><script language="javascript" type="text/javascript" src="/ad1.js"></script></div>
<div id="p_ad_t4"></div>
</div>
<div class="bookcontent clearfix" id="BookText"> 御神刀斩落,带着大片的弧光。橘正宗血光飞溅,战栗着倒地。<br/><br/> 怀刃插在地上,橘正宗用来握刀的右手五指尽落,因此他没能把怀剑插进自己的肚子里。<br/><br/> 源稚生面无表情地收刀回鞘,从怀里抽出手帕沿着断指根部扎紧来止血。他的刀术极精,一刀斩断橘正宗的五指,却还留下短短的指根来止血。<br/><br/> <br/><br/> 1937年12月,南京被攻克,之后的六个星期中。城里有三十万平民被屠杀。南京城里西方桥民的证词是审判战犯的关键证据,一位法国天主教堂的修女说,日军甚至冲进西方教堂开设的育婴堂。强暴藏身在里面的中国女人。老嬷嬷让中国女人们穿上修女的衣服,秘密地带他们出城。他们在江边被日本军队拦截,藤原胜少校发现他们都是假修女,于是所有女人都遭到了强暴,反抗者被用刺刀刨开了肚子。没有遭到侵害的只有带队的那位老嬷嬷,但她目睹了那血腥残酷的一幕后无法忍受,于是开枪自杀。死前她诅咒说神会惩罚罪人,用雷电用火焰……”<br/><br/> 【THEEND】<br/><br/><div id="p_ad_t3"><script language="javascript" type="text/javascript" src="/xm.js"></script></div></div>
<!--content-->
<div id="p_ad_b1" class="clearfix">
</div>
<div class="bottomlink clearfix">
<div class="linkbtn clearfix"> <h2><a href="http://www.********.cc/longzu3heiyuezhichaozhong/370.html"><span>(快捷键:←)上一页</span></a> <a href="http://www.********.cc/longzu3heiyuezhichaozhong/"><span>返回章节目录(快捷键:回车)</span></a> <a href=""><span>下一页(快捷键:→)</span></a></h2> </div>
</div>
<div class="bottomlink clearfix">
<div style="display:none;" id="divAjaxComment"></div>
<div class="post" id="divCommentPost">
<p class="posttop"><a name="comment">发表评论:</a></p>
<form id="frmSumbit" target="_self" method="post" action="http://www.********.cc/longzu3heiyuezhichaozhong/cmd.asp?act=cmt&key=32c3ee99" >
<input type="hidden" name="inpId" id="inpId" value="371" />
<input type="hidden" name="inpArticle" id="inpArticle" value="" />
<input type="hidden" name="inpLocation" id="inpLocation" value="" />
<p><input type="text" name="inpName" id="inpName" class="text" value="" size="28" tabindex="1" /> <label for="inpName">名称(必填)</label></p>
<p><input type="text" name="inpEmail" id="inpEmail" class="text" value="" size="28" tabindex="2" /> <label for="inpEmail">邮箱(可以不填写)</label></p>
<!--<p><input type="text" name="inpHomePage" id="inpHomePage" class="text" value="" size="28" tabindex="3" /> <label for="inpHomePage">网站链接</label></p>-->
<p><label for="txaArticle">正文(留言最长字数:1000)</label></p>
<p>
<textarea name="txaArticle" id="txaArticle" οnchange="GetActiveText(this.id);" οnclick="GetActiveText(this.id);" οnfοcus="GetActiveText(this.id);" class="text" cols="50" rows="4" tabindex="5" style="width:80%;resize:none;" ></textarea>
</p>
<p><input name="btnSumbit" type="submit" tabindex="6" value="提交" οnclick="JavaScript:return VerifyMessage()" class="button" /> <input type="checkbox" name="chkRemember" value="1" id="chkRemember" /> <label for="chkRemember">记住我,下次回复时不用重新输入个人信息</label></p>
<script language="JavaScript" type="text/javascript">objActive="txaArticle";ExportUbbFrame();</script>
</form>
<p class="postbottom">◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。</p>
<script language="JavaScript" type="text/javascript">LoadRememberInfo();</script>
</div>
</div>
<div id="p_ad_b2" class="clearfix">
</div>
<!--页脚-->
<div class="footer clearfix"> <span class="page-comment">
</span> <span class="fright">
<div id="pagebottom">
</div>
</span> <span class="fleft gray-link"></script>Copyright 2015-2017 <a href="http://www.********.cc/longzu3heiyuezhichaozhong/">龙族3·黑月之潮(中)</a> all rights reserved <script language="javascript" type="text/javascript" src="//js.users.51.la/19241152.js"></script>
</span></div>
<div id="allbottom">
</div>
</body>
</html>
网站就不给看了用***替代一下,下面直接上代码
import org.apache.http.HttpEntity; import org.apache.http.client.ClientProtocolException; import org.apache.http.client.HttpClient; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.util.EntityUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.*; /* 爬取网站小说 */ public class CaptureDemo { public static void main(String[] args) { for (int page = 345; page <= 360 ; page++) { String url = "http://www.********.cc/longzu3heiyuezhichaoxia/"+page+".html"; String bookContent = getBookContent(url); System.out.println(bookContent); File file = new File("E:\\龙族3-黑月之潮(下).txt"); saveToLocal(bookContent, file); System.out.println(url+" is over."); } } // 保存数据到本地文件中 private static String saveToLocal(String bookContent, File file) { FileWriter fw = null; try { // 如果文件存在就在文件中追加内容,不存在就创建 fw = new FileWriter(file,true); fw.write(bookContent); fw.flush(); fw.close(); return "scueess"; } catch (IOException e) { e.printStackTrace(); } return "failed"; } // 获取目标信息 private static String getBookContent(String url) { StringBuffer sb = new StringBuffer("\n"); // 爬取网页信息 CloseableHttpClient closeableHttpClient = HttpClients.createDefault(); try { HttpGet httpGet = new HttpGet(url); CloseableHttpResponse closeableHttpResponse = closeableHttpClient.execute(httpGet); try { // 获取响应实体 HttpEntity entity = closeableHttpResponse.getEntity(); // 打印响应状态 if (entity != null){ System.out.println(entity.toString()); // 将获取的网页数据以utf8编码读取出来 String html = EntityUtils.toString(entity, "utf8"); // Jsoup 解析网页数据 Document document = Jsoup.parse(html); // 获取目标内容 Element bookText = document.getElementById("BookText"); // 章节标题 Elements chaptertitle = document.getElementsByClass("chaptertitle"); String headTitle = chaptertitle.text(); String content = bookText.text().replaceAll(" ","\n"); return sb.append(headTitle).append("\n").append(content).append("\n\n").toString(); } }catch (Exception e){ e.printStackTrace(); } } catch (ClientProtocolException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return null; } }
仅做学习记录。