最近女王迷上了《延禧宫略》,让我给她下一本,我搜了一下,网上有现成的嘛,这里就是一个:http://www.pingyaoji.com/yanxigonglue。
正好最近在搞爬虫,便把每一章抓取下来,做了一个txt。下载在这里:https://download.csdn.net/download/zhaohuakai/10604236,或者百度云盘https://pan.baidu.com/s/1Bnc0HUtljY0jyqEwhpfzgw,密码gk1k。
程序使用了jsoup,基本代码如下:
Document doc = Jsoup.parse(new URL("http://www.pingyaoji.com/yanxigonglue/"), 5000);
List<Element> ls_li = doc.getElementsByTag("li");
FileWriter writer = new FileWriter("D:/延禧宫略.txt");
boolean findChapOne = false;
for (Element ele_li : ls_li) {
String str_li = ele_li.toString();
if (str_li.contains("第一章")) {
findChapOne = true;
}
if (!findChapOne) {
continue;
}
Element ele_a = ele_li.getElementsByTag("a").get(0);
String urlEachChap = "http://www.pingyaoji.com" + ele_a.attr("href");
String eachTitle = ele_a.text();
writer.write("\r\n" + eachTitle + "\r\n");
Document docChap = Jsoup.parse(new URL(urlEachChap), 5000);
Element eleChapDatail = docChap.getElementsByClass("post").get(0);
List<Element> ls_p = eleChapDatail.getElementsByTag("p");
for (Element ele_para : ls_p) {
String para = ele_para.toString();
if (para.contains("<b>")) {
break;
}
if (para.contains(" ") && para.length() < 15) {
break;
}
para = para.replaceAll("[ | ]", "").replaceAll("<.*?>", "");
writer.write(para + "\r\n");
}
writer.flush();
}
writer.close();