爬虫(二）小说

最新推荐文章于 2024-07-16 18:59:25 发布

看那远处的行人

最新推荐文章于 2024-07-16 18:59:25 发布

阅读量198

点赞数

分类专栏：学习笔记文章标签：爬虫

本文链接：https://blog.csdn.net/MurphySecret/article/details/100074051

版权

学习笔记专栏收录该内容

23 篇文章 0 订阅

订阅专栏

逻辑：
得到标题，创建对应文件
得到每一章的文本，写入文件
得到跳转下一章的url，进入第二章开始重复工作

静态变量

	public static final String WORKSPACE = "/test"; //文件路径
	public static File textFile; //文件类

创建文件夹

File directory = new File(WORKSPACE);
		if (!directory.exists() && !directory.isDirectory()) {
			directory.mkdir();
		}

判断文件夹是否存在 File.exists()
判断是否是文件夹 File.isDirectory()

用Jsoup连接网页，获取链接的document对象

public static Document getDocument(String url) {
		boolean flag = false;
		Document document = null;
		do {
			try {
				document = Jsoup.connect(url).userAgent(
						"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31")
						.timeout(5000).get();
				flag = false;
			} catch (IOException e) {
				e.printStackTrace();
				flag = true;
			}
		} while (flag);
		return document;
	}

创建文件写入类

public static void writeFile(File file, String text) {
		try {
			BufferedWriter writer = new BufferedWriter(new FileWriter(file, true));
			writer.write(text);
			writer.close();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

爬取链接：
https://www.bqg3.com/12_12602/100907523.html
根据逻辑，我们得从第一章链接开始

			String  url="https://www.bqg3.com/12_12602/100907523.html";

得到document

		    Document document = getDocument(url);

我们可以打印document
可以看到得到一个html
文本都在id为content的div里
在这里插入图片描述

得到标题

			String title = document.title();

得到文本

			String text = document.select("#content").text();

由于得到的title是：
“第一章神格_圣堂小说在线阅读：第一章神格-笔趣阁”
这么一长串
所以我们截取一下

 			title=title.split("：")[1];//分割
		    title= title.substring(0, title.length()-4);//截取

截取完后得到
title=”第一章神格“

创建txt文件

textFile = new File(WORKSPACE + "/" + title + ".txt");
			try {
				if (textFile.exists()) // 如果文件存在
					textFile.delete(); // 则先删除
				    textFile.createNewFile(); // 再创建
			} catch (IOException e) {
				e.printStackTrace();
			}

写入文章

		    writeTxtFile(textFile, text);

同样在document中可以知道
在这里插入图片描述
下一章的链接在class为bottem2的div里

			Elements nextdiv = document.select(".bottem2");

得到

在这里插入图片描述

通过分割 “下一章” 和 “–>" 得到下一章url
\u2192就是箭头

			String nexturl = nextdiv.toString();
			String[] nextdivurls = nexturl.split("下一章");
			nextdivurls = nextdivurls[0].split("\u2192");

得到数组最后一条

			nexturl = nextdivurls[1];

得到

 			<a href="/12_12602/100907525.html">

再截取一下

			nexturl=nexturl.substring(12, nexturl.length() - 2);

得到

			/12_12602/100907525.html

拼接一下，得到下一章url

			url = "https://www.bqg3.com" + nexturl;

将这些步骤封装成方法，然后递归

完整代码：

public class SText {
	public static int page=1;
	public static final String WORKSPACE = "/Users/oneway/Winter";
	/**
	 * 定义文件类
	 */
	public static File textFile;

	public static void main(String[] args) {
		File directory = new File(WORKSPACE);
		if (!directory.exists() && !directory.isDirectory()) {
			directory.mkdir();
		}
		getText("https://www.bqg3.com/12_12602/100907523.html");
	}

	public static void getText(String url ) {
		    System.out.println("开始爬第"+page+"章");
		    Document document = getDocument(url);
			String text = document.select("#content").text();
			text=text.substring(30,text.length()-30);//截取文本 去掉一些广告
			
			// 创建存储文件
			String title = document.title();
		    title=title.split("：")[1];
		    title= title.substring(0, title.length()-4);
			textFile = new File(WORKSPACE + "/" + title + ".txt");
			try {
				if (textFile.exists()) // 如果文件存在
					textFile.delete(); // 则先删除
				    textFile.createNewFile(); // 再创建
			} catch (IOException e) {
				e.printStackTrace();
			}

		    writeTxtFile(textFile, text); 

			Elements nextdiv = document.select(".bottem2");
			String nexturl = nextdiv.toString();
			String[] nextdivurls = nexturl.split("下一章");
		 
			nextdivurls = nextdivurls[0].split("\u2192");
			nexturl = nextdivurls[1];
			nexturl=nexturl.substring(12, nexturl.length() - 2);
			url = "https://www.bqg3.com" + nexturl;
			page++;
			  // 去到下一章
			getText(url);
	}

	/**
	 * 写入内容
	 *
	 * @param file
	 *            文件类
	 * @param text
	 *            要写入的文本
	 */
	public static void writeTxtFile(File file, String text) {
		try {
			BufferedWriter writer = new BufferedWriter(new FileWriter(file, true));
			writer.write(text);
			writer.close();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

	/**
	 * 获取链接的document对象
	 * 
	 * @param url
	 * @return document
	 */
	public static Document getDocument(String url) {
		boolean flag = false;
		Document document = null;
		do {
			try {
				document = Jsoup.connect(url).userAgent(
						"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31")
						.timeout(5000).get();
				flag = false;
			} catch (IOException e) {
				// TODO 自动生成的 catch 块
				e.printStackTrace();
				flag = true;
			}
		} while (flag);
		return document;
	}
}

看那远处的行人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫(二）小说

一个理工男的奇妙遐想文/米周壹满月的时候，在阳台上放一个碗，等一小会儿，就可以得到一碗月亮。把这碗月亮倒进西瓜汁里，你就得到一杯月亮西瓜。月亮冲淡了西瓜的甜腻，清凉可口。也可以去超市买一小瓶微风，和月亮西瓜兑在一起，喝到嘴里荡漾不停。不过小心，月亮隔夜就不新鲜了。贰雨落下来，记得收集一大罐子。在阴凉的地方不断地搅拌，一直搅拌到固液分离，—就像法国人制作奶酪那样。倒掉上层的水，剩下下面的固...
复制链接

扫一扫