目的
为实现从笔趣阁下载小说的目标,我们将采用以下策略:使用HTTP请求池进行批量检测和下载,结合线程池和异步写入,以提高效率和性能。以下是详细实现步骤和代码示例。
实现步骤
- 准备环境:
- 确保网络环境稳定
- HTTP请求池:
- 使用
requests
库进行HTTP请求,并创建请求池以进行批量检测
- 使用
- 线程池:
- 利用
concurrent.futures.ThreadPoolExecutor
来管理多个线程,进行并发下载
- 利用
- 异步写入:
- 使用
aiofiles
库进行异步文件写入,提高IO操作效率
- 使用
依赖处理
安装依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<!-- 定义POM模型的版本 -->
<modelVersion>4.0.0</modelVersion>
<!-- 定义项目的组ID -->
<groupId>org.example</groupId>
<!-- 定义项目的artifact ID -->
<artifactId>WXMNWikiPC</artifactId>
<!-- 定义项目的版本 -->
<version>1.0-SNAPSHOT</version>
<!-- Maven构建属性设置 -->
<properties>
<!-- 设置编译器的源版本 -->
<maven.compiler.source>17</maven.compiler.source>
<!-- 设置编译器的目标版本 -->
<maven.compiler.target>17</maven.compiler.target>
<!-- 设置项目构建的源码编码 -->
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<!-- 项目依赖的定义 -->
<dependencies>
<!-- 添加Jsoup库的依赖 -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
<!-- 添加SQLite JDBC驱动的依赖 -->
<dependency>
<groupId>org.xerial</groupId>
<artifactId>sqlite-jdbc</artifactId>
<version>3.46.0.0</version>
</dependency>
<!-- 添加Apache HttpClient的依赖 -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
<!-- 添加Apache POI库的依赖 -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>4.1.2</version>
</dependency>
<!-- 添加Apache POI Ooxml库的依赖 -->
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.1.2</version>
</dependency>
</dependencies>
<!-- 项目构建的插件配置 -->
<build>
<plugins>
<!-- 编译器插件配置 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<!-- 设置编译器的源版本 -->
<source>17</source>
<!-- 设置编译器的目标版本 -->
<target>17</target>
</configuration>
</plugin>
<!-- 打包插件配置 -->
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<!-- 使用包含所有依赖的jar打包 -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<!-- 设置主类 -->
<manifest>
<mainClass>org.example.xspc.WebCrawler</mainClass>
<!-- 替换为你的主类 -->
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- 指定唯一的ID -->
<phase>package</phase>
<!-- 绑定到打包阶段 -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- Kotlin插件配置 -->
<plugin>
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-maven-plugin</artifactId>
<version>1.7.10</version>
<executions>
<execution>
<id>compile</id>
<phase>process-sources</phase>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<sourceDirs>
<!-- 指定源码目录 -->
<source>src/main/java</source>
<source>target/generated-sources/annotations</source>
</sourceDirs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
解析说明:
- 基本项目信息:
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>WXMNWikiPC</artifactId>
<version>1.0-SNAPSHOT</version>
modelVersion
:指定POM模型的版本。groupId
:项目的组ID。artifactId
:项目的artifact ID。version
:项目的版本号。
- 构建属性:
<properties>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
maven.compiler.source
和maven.compiler.target
:指定Java编译的源和目标版本。project.build.sourceEncoding
:定义源码的编码格式。
- 依赖管理:
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.17.2</version>
</dependency>
<dependency>
<groupId>org.xerial</groupId>
<artifactId>sqlite-jdbc</artifactId>
<version>3.46.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.13</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>4.1.2</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>4.1.2</version>
</dependency>
</dependencies>
- 各个依赖项添加了对应的库,如
jsoup
、sqlite-jdbc
、httpclient
和poi
。
- 插件配置:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>17</source>
<target>17</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>org.example.xspc.WebCrawler</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-maven-plugin</artifactId>
<version>1.7.10</version>
<executions>
<execution>
<id>compile</id>
<phase>process-sources</phase>
<goals>
<goal>compile</goal>
</goals>
<configuration>
<sourceDirs>
<source>src/main/java</source>
<source>target/generated-sources/annotations</source>
</sourceDirs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
maven-compiler-plugin
:设置了编译器的源和目标版本。maven-assembly-plugin
:配置了打包插件,生成包含所有依赖的jar文件,并指定了主类。kotlin-maven-plugin
:配置了Kotlin的编译插件,指定了源码目录。
解释
- 获取章节列表:
fetch_chapter_list
函数发送HTTP请求,获取小说的章节列表,并使用BeautifulSoup解析HTML。
- 下载章节内容:
download_chapter
函数发送HTTP请求,获取每个章节的内容。
- 异步写入文件:
async_write_to_file
函数使用aiofiles
库异步写入章节内容到文件中。
- 下载任务:
download_task
函数结合下载和写入操作,并在主线程中运行。
- 主函数:
main
函数获取章节列表,并使用ThreadPoolExecutor
并发执行下载任务。
代码处理
package org.example.xspc;
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.*;
import java.util.*;
import java.util.concurrent.*;
public class WebCrawler {
private static final Logger logger = LoggerFactory.getLogger(WebCrawler.class);
private static final String CRAWLED_URLS_FILE = "crawled_urls.txt";
private static Set<String> crawledUrls = new HashSet<>();
static HashMap<String, String> TitleAndUrl = new HashMap<>();
static HashMap<String, String> SjURLAndZJUrl = new HashMap<>();
static List<String> links = new ArrayList<>();
static List<String> NoteNRLinks = new ArrayList<>();
static CloseableHttpClient httpClient;
public static void main(String[] args) {
loadCrawledUrls();
int numThreads = 8;
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
connectionManager.setMaxTotal(200);
connectionManager.setDefaultMaxPerRoute(20);
httpClient = HttpClients.custom()
.setConnectionManager(connectionManager)
.build();
ljcx();
createFoldersFromTitles();
List<Future<?>> futures1 = new ArrayList<>();
for (String link : links) {
futures1.add(executor.submit(() -> getWebPageSourceCode(link)));
}
try {
for (Future<?> future : futures1) {
future.get();
}
} catch (Exception e) {
e.printStackTrace();
}
List<Future<?>> futures2 = new ArrayList<>();
for (String ZjNr : NoteNRLinks) {
futures2.add(executor.submit(() -> getXSZJNR(ZjNr)));
}
try {
for (Future<?> future : futures2) {
future.get();
}
} catch (Exception e) {
e.printStackTrace();
}
executor.shutdown();
try {
if (!executor.awaitTermination(60, TimeUnit.SECONDS)) {
executor.shutdownNow();
}
} catch (InterruptedException e) {
executor.shutdownNow();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
saveCrawledUrls();
}
private static void loadCrawledUrls() {
try (BufferedReader reader = new BufferedReader(new FileReader(CRAWLED_URLS_FILE))) {
String url;
while ((url = reader.readLine()) != null) {
crawledUrls.add(url);
}
} catch (IOException e) {
logger.error("Failed to load crawled URLs from file: " + CRAWLED_URLS_FILE, e);
}
}
private static void saveCrawledUrls() {
try (BufferedWriter writer = new BufferedWriter(new FileWriter(CRAWLED_URLS_FILE))) {
for (String url : crawledUrls) {
writer.write(url);
writer.newLine();
}
} catch (IOException e) {
logger.error("Failed to save crawled URLs to file: " + CRAWLED_URLS_FILE, e);
}
}
private static void recordCrawledUrl(String url) {
crawledUrls.add(url);
}
private static void createFoldersFromTitles() {
for (String title : TitleAndUrl.values()) {
String folderPath = "E:\\小说下载\\" + title;
File folder = new File(folderPath);
if (!folder.exists()) {
if (folder.mkdirs()) {
logger.info("Created folder: " + folderPath);
} else {
logger.error("Failed to create folder: " + folderPath);
}
}
}
}
private static void ljcx() {
String url = "https://www.bqgui.cc/";
try {
HttpResponse response = getHttpResponse(url);
String content = EntityUtils.toString(response.getEntity());
Document document = Jsoup.parse(content);
Elements linksOnPage = document.select("a[href]");
for (Element link : linksOnPage) {
String href = link.attr("href");
if (href.contains("/book") && !isHtmlEnding(href)) {
href = "https://www.bqgui.cc" + href;
links.add(href);
String text = link.text();
TitleAndUrl.put(href, text);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static HttpResponse getHttpResponse(String url) throws IOException {
HttpGet httpGet = new HttpGet(url);
HttpResponse response = httpClient.execute(httpGet);
return response;
}
public static void getWebPageSourceCode(String url) {
try {
HttpResponse response = getHttpResponse(url);
String content = EntityUtils.toString(response.getEntity());
Document document = Jsoup.parse(content);
Elements linksOnPage = document.select("a[href]");
for (Element link : linksOnPage) {
String href = "https://www.bqgui.cc" + link.attr("href");
if (href.contains(url) && isHtmlEnding(href)) {
NoteNRLinks.add(href);
SjURLAndZJUrl.put(href, url);
logger.info("Added link: " + href);
}
}
} catch (IOException e) {
logger.error("Failed to get web page source code for URL: " + url, e);
}
}
public static void getXSZJNR(String url) {
try {
int lastSlashIndex = url.lastIndexOf("/");
int dotHtmlIndex = url.indexOf(".html");
String number = url.substring(lastSlashIndex + 1, dotHtmlIndex);
HttpResponse response = getHttpResponse(url);
String content = EntityUtils.toString(response.getEntity());
Document document = Jsoup.parse(content);
Element h1Element = document.select("h1.wap_none").first();
String h1Text = h1Element.text();
logger.info("Chapter title: " + h1Text);
Element divElement = document.select("#chaptercontent").first();
String htmlContentWithFormat = divElement.html();
String localWJZ = TitleAndUrl.get(SjURLAndZJUrl.get(url));
String folderPath = "E:\\小说下载\\" + localWJZ + "\\" + number + h1Text + ".txt";
writeToFileAsync(folderPath, htmlContentWithFormat)
.thenRun(() -> logger.info("Async write completed: " + folderPath))
.exceptionally(e -> {
logger.error("Async write error: " + e.getMessage(), e);
return null;
});
recordCrawledUrl(url);
} catch (Exception e) {
logger.error("Failed to get novel content for URL: " + url, e);
}
}
private static CompletableFuture<Void> writeToFileAsync(String filePath, String content) {
return CompletableFuture.runAsync(() -> {
try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
writer.write(content);
System.out.println("内容已异步写入文件: " + filePath);
} catch (IOException e) {
System.err.println("异步写入文件时发生错误: " + e.getMessage());
}
});
}
public static boolean isHtmlEnding(String url) {
return url.toLowerCase().endsWith(".html");
}
}
解析说明
- 导入必要的库:
import org.apache.http.HttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.*;
import java.util.*;
import java.util.concurrent.*;
- 定义类和静态变量:
public class WebCrawler {
private static final Logger logger = LoggerFactory.getLogger(WebCrawler.class);
private static final String CRAWLED_URLS_FILE = "crawled_urls.txt";
private static Set<String> crawledUrls = new HashSet<>();
static HashMap<String, String> TitleAndUrl = new HashMap<>();
static HashMap<String, String> SjURLAndZJUrl = new HashMap<>();
static List<String> links = new ArrayList<>();
static List<String> NoteNRLinks = new ArrayList<>();
static CloseableHttpClient httpClient;
- 主方法:
public static void main(String[] args) {
loadCrawledUrls();
int numThreads = 8;
ExecutorService executor = Executors.newFixedThreadPool(numThreads);
PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
connectionManager.setMaxTotal(200);
connectionManager.setDefaultMaxPerRoute(20);
httpClient = HttpClients.custom()
.setConnectionManager(connectionManager)
.build();
ljcx();
createFoldersFromTitles();
List<Future<?>> futures1 = new ArrayList<>();
for (String link : links) {
futures1.add(executor.submit(() -> getWebPageSourceCode(link)));
}
try {
for (Future<?> future : futures1) {
future.get();
}
} catch (Exception e) {
e.printStackTrace();
}
List<Future<?>> futures2 = new ArrayList<>();
for (String ZjNr : NoteNRLinks) {
futures2.add(executor.submit(() -> getXSZJNR(ZjNr)));
}
try {
for (Future<?> future : futures2) {
future.get();
}
} catch (Exception e) {
e.printStackTrace();
}
executor.shutdown();
try {
if (!executor.awaitTermination(60, TimeUnit.SECONDS)) {
executor.shutdownNow();
}
} catch (InterruptedException e) {
executor.shutdownNow();
}
try {
httpClient.close();
} catch (IOException e) {
e.printStackTrace();
}
saveCrawledUrls();
}
- 读取已爬取的URL:
private static void loadCrawledUrls() {
try (BufferedReader reader = new BufferedReader(new FileReader(CRAWLED_URLS_FILE))) {
String url;
while ((url = reader.readLine()) != null) {
crawledUrls.add(url);
}
} catch (IOException e) {
logger.error("Failed to load crawled URLs from file: " + CRAWLED_URLS_FILE, e);
}
}
- 保存已爬取的URL:
private static void saveCrawledUrls() {
try (BufferedWriter writer = new BufferedWriter(new FileWriter(CRAWLED_URLS_FILE))) {
for (String url : crawledUrls) {
writer.write(url);
writer.newLine();
}
} catch (IOException e) {
logger.error("Failed to save crawled URLs to file: " + CRAWLED_URLS_FILE, e);
}
}
- 记录已爬取的URL:
private static void recordCrawledUrl(String url) {
crawledUrls.add(url);
}
- 创建文件夹:
private static void createFoldersFromTitles() {
for (String title : TitleAndUrl.values()) {
String folderPath = "E:\\小说下载\\" + title;
File folder = new File(folderPath);
if (!folder.exists()) {
if (folder.mkdirs()) {
logger.info("Created folder: " + folderPath);
} else {
logger.error("Failed to create folder: " + folderPath);
}
}
}
}
- 链接存储:
private static void ljcx() {
String url = "https://www.bqgui.cc/";
try {
HttpResponse response = getHttpResponse(url);
String content = EntityUtils.toString(response.getEntity());
Document document = Jsoup.parse(content);
Elements linksOnPage = document.select("a[href]");
for (Element link : linksOnPage) {
String href = link.attr("href");
if (href.contains("/book") && !isHtmlEnding(href)) {
href = "https://www.bqgui.cc" + href;
links.add(href);
String text = link.text();
TitleAndUrl.put(href, text);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
- 获得HTTP响应:
private static HttpResponse getHttpResponse(String url) throws IOException {
HttpGet httpGet = new HttpGet(url);
HttpResponse response = httpClient.execute(httpGet);
return response;
}
- 获取网页源代码:
public static void getWebPageSourceCode(String url) {
try {
HttpResponse response = getHttpResponse(url);
String content = EntityUtils.toString(response.getEntity());
Document document = Jsoup.parse(content);
Elements linksOnPage = document.select("a[href]");
for (Element link : linksOnPage) {
String href = "https://www.bqgui.cc" + link.attr("href");
if (href.contains(url) && isHtmlEnding(href)) {
NoteNRLinks.add(href);
SjURLAndZJUrl.put(href, url);
logger.info("Added link: " + href);
}
}
} catch (IOException e) {
logger.error("Failed to get web page source code for URL: " + url, e);
}
}
- 获取小说具体内容:
public static void getXSZJNR(String url) {
try {
int lastSlashIndex = url.lastIndexOf("/");
int dotHtmlIndex = url.indexOf(".html");
String number = url.substring(lastSlashIndex + 1, dotHtmlIndex);
HttpResponse response = getHttpResponse(url);
String content = EntityUtils.toString(response.getEntity());
Document document = Jsoup.parse(content);
Element h1Element = document.select("h1.wap_none").first();
String h1Text = h1Element.text();
logger.info("Chapter title: " + h1Text);
Element divElement = document.select("#chaptercontent").first();
String htmlContentWithFormat = divElement.html();
String localWJZ = TitleAndUrl.get(SjURLAndZJUrl.get(url));
String folderPath = "E:\\小说下载\\" + localWJZ + "\\" + number + h1Text + ".txt";
writeToFileAsync(folderPath, htmlContentWithFormat)
.thenRun(() -> logger.info("Async write completed: " + folderPath))
.exceptionally(e -> {
logger.error("Async write error: " + e.getMessage(), e);
return null;
});
recordCrawledUrl(url);
} catch (Exception e) {
logger.error("Failed to get novel content for URL: " + url, e);
}
}
- 异步写入文件:
private static CompletableFuture<Void> writeToFileAsync(String filePath, String content) {
return CompletableFuture.runAsync(() -> {
try (BufferedWriter writer = new BufferedWriter(new FileWriter(filePath))) {
writer.write(content);
System.out.println("内容已异步写入文件: " + filePath);
} catch (IOException e) {
System.err.println("异步写入文件时发生错误: " + e.getMessage());
}
});
}
- 判断URL是否以.html结尾:
public static boolean isHtmlEnding(String url) {
return url.toLowerCase().endsWith(".html");
}
API解析:
- HttpClients.custom():
- 创建一个自定义的HttpClient构建器,允许通过链式调用设置各种参数和属性。
- PoolingHttpClientConnectionManager:
- 管理HTTP连接池。允许重用连接,以提高性能。
- HttpGet:
- 表示一个HTTP GET请求。
- HttpResponse:
- 表示服务器的响应,其中包含状态码、消息头和响应体。
- EntityUtils.toString():
- 将HttpResponse中的实体内容转换为字符串。
- Jsoup.parse():
- 将HTML内容解析为一个Document对象,以便于进行DOM操作。
- Document.select():
- 使用选择器语法从Document中选择元素集合。
- Logger:
- 提供日志记录功能,支持不同级别的日志消息(如info、error等)。
注意事项
- 反爬虫机制:
- 笔趣阁等网站通常有反爬虫机制,建议加入适当的请求间隔。
- 在请求头中伪装成浏览器请求。
- 异常处理:
- 需要在实际应用中加入异常处理逻辑,以处理请求失败、网络异常等情况。
- 合法性和合规性:
- 请确保遵守目标网站的使用条款和法律法规,不要进行非法下载和分发小说内容。
通过这些步骤,我们可以高效地从笔趣阁下载小说,并将其存储为本地文件。