有什么不开心的,说出来让大家开心开心啊!
(本期只有一节,编号接上期,内容较长)
0x04 重构!重构!
要说Java最为人熟知也最臭名昭著的黑点是什么?巨大无比的JRE?设计糟糕的Date API?No!
当然是——其无比严谨而又无比冗长的架构设计。
从这个角度上来说,Java是许许多多其他语言、其他编程思想、其他程序架构的试验田,但同时也是他们的综合体,因此常常会有这样的吐槽:
好,长话短说,在多次迭代终于成功地把CrawlerTest写成一团糟之后,我毅然决然地决定做一件每一个程序猿最引(tou)以(teng)为(yu)傲(lie)的事:重构。
观察到这几次所有的爬虫实验的步骤其实都非常相似(扩充自本系列的第一篇):
- 将一个/多个起始页面压入队列(比如http://www.sfls.cn)
- 从队列中提取链接(原来是ExecutorService自带,现在改成TreeSet)
- 检查页面是否已经被爬过(原来是MySQL,现在改成内存里的HashSet)
- 未爬过则插入爬虫记录以防重复爬(现在是HashSet)
- 拉取网页内容+解析网页(HttpClient+Jsoup)
- 对解析好的网页进行处理(比如打印出链接+标题)
- 遍历
<a>
标签,提出href属性对应的绝对url(Jsoup) - 判断是否应该进一步爬url(比如是否在给定的域下)
- 是则将此url压入队列(现在是TreeSet)
- 队列不为空则跳转到2
从这张更加详细的步骤列表中可以看出,除掉每个爬虫都相同的部分,整个爬虫剩余的执行逻辑可以被分为三个部分:
- 爬过的url记录(步骤3&4),可选用数据库/内存中的HashSet,用途是防止已经爬过的页面被重复爬到。
- 待爬url队列(步骤2&9),可选用ExecutorService内部的任务队列或者TreeSet(去重),用途是记录将要爬但是还没有爬到的页面。
- 页面处理逻辑(步骤1&6&7),决定爬虫的起点、路径和爬到页面后的输出。
接口设计
因此我们可以提取出三个接口:
ICrawlerModel.java:
package com.std4453.crawlerlab.crawler;
/**
*
*/
public interface ICrawlerRecord {
/**
* @param url url to check
* @return whether the url is already crawled
*/
boolean pageCrawled(String url);
/**
* Insert the given url to the crawler's Record
* @param url url to insert
*/
void insertPage(String url);
}
ITaskQueue.java:
package com.std4453.crawlerlab.crawler;
/**
*
*/
public interface ITaskQueue {
/**
* Submit page to task queue, returns {@code true} if page is successfully
* submitted, which will control the increment of {@code submittedCounter}
*
* @param url
* url to submit
*
* @return whether the page is successfully submitted
*/
boolean submitPage(String url);
/**
* Retrieve ( and remove ) a url from the task queue, returns {@code null} if the
* queue is empty.
*
* @return url from task queue
*/
String pollTask();
}
ICrawlerRules.java:
package com.std4453.crawlerlab.crawler;
import org.jsoup.nodes.Document;
import java.util.List;
/**
*
*/
public interface ICrawlerRules {
/**
* @return the pages from where the crawler start.
*/
List<String> rootPages();
/**
* Called when crawler reaches the given page.
*
* @param url
* address of the given page.
* @param doc
* parsed document of the page.
*/
void pageCrawled(String url, Document doc);
/**
* Whether the crawler should continue from source page (specified by {@code from})
* to target page (specified by {@code to}).
*
* @param from
* absolute url of source page
* @param to
* absolute url of target page
*
* @return whether the crawler should continue
*/
boolean shouldContinue(String from, String to);
}
又注意到这三个部分都需要某种程度上的初始化和清理环境(建立/关闭数据库连接,打开/关闭文件等),因此我们在爬虫开始前和结束后都设一个事件,也提取成一个接口,并让另外三个接口继承它:
ICrawlerListener:
package com.std4453.crawlerlab.crawler;
/**
*
*/
public interface ICrawlerListener {
/**
* Called before actual crawl started
*/
void beforeCrawl();
/**
* Called after whole crawling stopped
*/
void afterCrawl();
}
public interface ICrawlerRecord extends ICrawlerListener {
public interface ITaskQueue extends ICrawlerListener {
public interface ICrawlerRules extends ICrawlerListener {
这样一提,原来的CrawlerTest就能脱离ball of mud的泥潭,自然也可以改一个名字,叫做GenericCrawler:
public class GenericCrawler {
CrawlerMonitor去耦合
下一步则是给爬虫的monitor正名——之前的CrawlerMonitor于CrawlerTest(GenericCrawler)耦合相当紧密,对此,可以用一个爬虫事件的listener来去耦合:
ICrawlerActionListener:
package com.std4453.crawlerlab.crawler;
/**
*
*/
public interface ICrawlerActionListener {
/**
* Called when a new page is successfully submitted to the task queue.
*
* @param url
* the submitted url
*/
void onSubmit(String url);
/**
* Called when parsing of a page is completed.
*
* @param url
* url of the completed page
*/
void onComplete(String url);
/**
* Called when a new distinct page is found ( before being inserted into the
* crawler's record ).
*
* @param url
* url of the new distinct page
*/
void onDistinctPage(String url);
}
CrawlerMonitor也可以脱离内部类的仆人地位,成为一个堂堂正正的public class:
package com.std4453.crawlerlab.crawler;
import java.util.Calendar;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;
/**
* A monitor that regularly reports the state of the crawler.
*/
public class CrawlerMonitor implements ICrawlerActionListener, ICrawlerStateListener {
private class MonitorReportTask implements Runnable {
@Override
public void run() {
int reportId = CrawlerMonitor.this.monitorReportCounter.incrementAndGet();
int[] time = CrawlerMonitor.getTime();
System.out.printf("[%02d:%02d:%02d] Monitor report #%d:\n",
time[0], time[1], time[2], reportId);
// memory usage
Runtime runtime = Runtime.getRuntime();
long freeMem = runtime.freeMemory();
long totalMem = runtime.totalMemory();
long usedMem = totalMem - freeMem;
double usedMemMB = usedMem / 1048576d;
double totalMemMB = totalMem / 1048576d;
double usedMemPercentage = usedMemMB / totalMemMB * 100;
System.out.printf("Memory usage: %.2fM used (%.1f%% of %.2fM total)\n",
usedMemMB, usedMemPercentage, totalMemMB);
// crawler counters
long submitted = CrawlerMonitor.this.submittedCounter.get();
long completed = CrawlerMonitor.this.completedCounter.get();
long waiting = submitted - completed;
long distinct = CrawlerMonitor.this.distinctCounter.get();
System.out.printf("Crawler counters: %d distinct, " +
"%d submitted, %d completed, %d waiting\n",
distinct, submitted, completed, waiting);
}
}
private AtomicLong submittedCounter, completedCounter, distinctCounter;
private AtomicInteger monitorReportCounter;
private ScheduledExecutorService monitorService;
@Override
public void beforeCrawl() {
this.submittedCounter = new AtomicLong();
this.completedCounter = new AtomicLong();
this.distinctCounter = new AtomicLong();
this.monitorReportCounter = new AtomicInteger();
this.monitorService = Executors.newSingleThreadScheduledExecutor();
this.monitorService.scheduleAtFixedRate(new MonitorReportTask(),
5, 30, TimeUnit.SECONDS);
int[] time = CrawlerMonitor.getTime();
System.out.printf("[%02d:%02d:%02d] Crawler monitor started.\n",
time[0], time[1], time[2]);
}
@Override
public void afterCrawl() {
this.monitorService.shutdown();
try {
this.monitorService.awaitTermination(10, TimeUnit.SECONDS);
} catch (InterruptedException ignored) {
}
int[] time = CrawlerMonitor.getTime();
System.out.printf("[%02d:%02d:%02d] Crawler monitor terminated.\n",
time[0], time[1], time[2]);
long completed = this.completedCounter.get();
long distinct = this.distinctCounter.get();
System.out.printf("Total crawler counters: %d distinct, %d completed\n",
distinct, completed);
this.submittedCounter = null;
this.completedCounter = null;
this.distinctCounter = null;
this.monitorReportCounter = null;
}
@Override
public void onSubmit(String url) {
this.submittedCounter.incrementAndGet();
}
@Override
public void onComplete(String url) {
this.completedCounter.incrementAndGet();
}
@Override
public void onDistinctPage(String url) {
this.distinctCounter.incrementAndGet();
}
private static int[] getTime() {
Calendar calendar = Calendar.getInstance();
int hours = calendar.get(Calendar.HOUR_OF_DAY);
int minutes = calendar.get(Calendar.MINUTE);
int seconds = calendar.get(Calendar.SECOND);
return new int[]{hours, minutes, seconds};
}
}
ICrawlerRecord、ITaskQueue实现
再来看ICrawlerRecord和ITaskQueue的实现:
最早的DB版CrawlerRecordDB.java:
package com.std4453.crawlerlab.crawler;
import com.std4453.crawlerlab.concurrent.PreparedStatementFactory;
import com.std4453.crawlerlab.db.DB;
import org.apache.commons.pool2.impl.GenericObjectPool;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
/**
* Implementation of {@link ICrawlerRecord} using a database.
*/
public class CrawlerRecordDB implements ICrawlerRecord {
private DB db;
private GenericObjectPool<PreparedStatement> selectStmtPool;
private GenericObjectPool<PreparedStatement> insertStmtPool;
public CrawlerRecordDB(DB db) {
this.db = db;
}
@Override
public boolean pageCrawled(String url) {
try {
PreparedStatement selectStmt = this.selectStmtPool.borrowObject();
selectStmt.setString(1, url);
ResultSet result = selectStmt.executeQuery();
boolean hasNext = result.next();
result.close();
this.selectStmtPool.returnObject(selectStmt);
return hasNext;
} catch (Exception e) {
e.printStackTrace();
return true;
}
}
@Override
public void insertPage(String url) {
try {
PreparedStatement insertStmt = this.insertStmtPool.borrowObject();
insertStmt.setString(1, url);
insertStmt.execute();
this.insertStmtPool.returnObject(insertStmt);
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public void beforeCrawl() {
try {
this.db.runSQL2("TRUNCATE Record;");
String selectSQL = "SELECT * FROM Record WHERE URL = ?;";
PreparedStatementFactory selectFactory =
new PreparedStatementFactory(this.db, selectSQL);
this.selectStmtPool = new GenericObjectPool<>(selectFactory);
String insertSQL = "INSERT INTO record (URL) VALUES (?);";
PreparedStatementFactory insertFactory =
new PreparedStatementFactory(this.db, insertSQL);
this.insertStmtPool = new GenericObjectPool<>(insertFactory);
System.out.println("DB initialized.");
} catch (SQLException e) {
e.printStackTrace();
}
}
@Override
public void afterCrawl() {
this.db.close();
}
}
现在在用的HashSet版CrawlerRecordSet.java:
package com.std4453.crawlerlab.crawler;
import java.util.HashSet;
import java.util.Set;
/**
* Implementation of {@link ICrawlerRecord} using a {@link HashSet} as storage.
*/
public class CrawlerRecordSet implements ICrawlerRecord {
private Set<String> urlRecord;
@Override
public void beforeCrawl() {
this.urlRecord = new HashSet<>();
}
@Override
public void afterCrawl() {
this.urlRecord = null;
}
@Override
public synchronized boolean pageCrawled(String url) {
return this.urlRecord.contains(url);
}
@Override
public synchronized void insertPage(String url) {
this.urlRecord.add(url);
}
}
备用的ArrayDeque版的TaskQueue.java:
package com.std4453.crawlerlab.crawler;
import java.util.ArrayDeque;
import java.util.Queue;
/**
* Implementation of {@link ITaskQueue} using an {@link ArrayDeque} as storage.
*/
public class TaskQueueQueue implements ITaskQueue {
private Queue<String> queue;
@Override
public boolean submitPage(String url) {
synchronized (this) {
this.queue.offer(url);
return true;
}
}
@Override
public String pollTask() {
synchronized (this) {
return this.queue.poll();
}
}
@Override
public void beforeCrawl() {
this.queue = new ArrayDeque<>();
}
@Override
public void afterCrawl() {
this.queue = null;
}
}
现在在用的TreeSet版TaskQueueSet.java,可以在submit新的任务是直接去重,减少队列中积攒的任务数量,降低时间和空间消耗:
package com.std4453.crawlerlab.crawler;
import java.util.SortedSet;
import java.util.TreeSet;
/**
* Implementation of {@link ITaskQueue} using {@link TreeSet} as storage.
*/
public class TaskQueueSet implements ITaskQueue {
private SortedSet<String> data;
@Override
public void beforeCrawl() {
this.data = new TreeSet<>();
}
@Override
public void afterCrawl() {
this.data = null;
}
@Override
public synchronized boolean submitPage(String url) {
return this.data.add(url);
}
@Override
public synchronized String pollTask() {
if (this.data.isEmpty()) return null;
String first = this.data.first();
this.data.remove(first);
return first;
}
}
配置类CrawlerConfig
由于大部分爬虫的ICrawlerRecord和ITaskQueue都是一样的(因为它们主要决定的是性能,应该在内部迭代开发),只有ICrawlerRule不同,因此决定将选用的ICrawlerRecord、ITaskQueue和其他的一些设置整理成一个config配置类,叫做CrawlerConfig,然后构造GenericCrawler的时候传入一个CrawlerConfig实例和不同的ICrawlerRule:
package com.std4453.crawlerlab.crawler;
/**
* Config class for {@link GenericCrawler}
*/
public class CrawlerConfig {
public int numThreads = 16;
public boolean startMonitorThread = true;
public ICrawlerRecord crawlerRecord;
public ITaskQueue taskQueue;
public CrawlerConfig(ICrawlerRecord crawlerRecord, ITaskQueue taskQueue) {
this.crawlerRecord = crawlerRecord;
this.taskQueue = taskQueue;
}
public void setNumThreads(int numThreads) {
this.numThreads = numThreads;
}
public void setStartMonitorThread(boolean startMonitorThread) {
this.startMonitorThread = startMonitorThread;
}
}
再提供一个静态的默认方法用于取得默认的ITaskQueue和ICrawlerRecord:
public static CrawlerConfig defaultConfig() {
return new CrawlerConfig(new CrawlerRecordSet(), new TaskQueueSet());
}
应用到GenericCrawler.java中(也包括之前很多更改在这里的体现)(省略一部分不必要的内容,加了中文注释):
package com.std4453.crawlerlab.crawler;
// 省略import
public class GenericCrawler {
private class CrawlerThread extends Thread {
// 省略,跟上一篇中一样
}
private ICrawlerRules rules;
private CrawlerConfig config; // config类
private InverseSemaphore semaphore; // 任务完成计数器(见本系列第二弹)
// 事件监听器列表
private List<ICrawlerStateListener> stateListeners;
private List<ICrawlerActionListener> actionListeners;
public GenericCrawler(ICrawlerRules rules, CrawlerConfig config) {
this.rules = rules;
this.config = config == null ? CrawlerConfig.defaultConfig() : config;
this.semaphore = new InverseSemaphore();
this.actionListeners = new Vector<>();
this.stateListeners = new Vector<>();
// 将内部事件监听器加入列表
this.listen(this.config.crawlerRecord);
this.listen(this.config.taskQueue);
this.listen(this.rules);
if (this.config.startMonitorThread) {
CrawlerMonitor monitor = new CrawlerMonitor();
this.listen((ICrawlerActionListener) monitor);
this.listen((ICrawlerStateListener) monitor);
}
}
public GenericCrawler(ICrawlerRules rules) {
// 使用默认config
this(rules, CrawlerConfig.defaultConfig());
}
public void run() throws InterruptedException {
this.stateListeners.forEach(ICrawlerStateListener::beforeCrawl);
// 本来这里初始化monitor的一堆代码去耦合后就用不着了
CrawlerThread[] workers = new CrawlerThread[this.config.numThreads];
for (int i = 0; i < workers.length; ++i)
workers[i] = new CrawlerThread();
this.rules.rootPages().forEach(this::submit);
for (Thread thread : workers) thread.start();
System.out.println("Crawler started.");
// semaphore计数器的await()比不停sleep()不知道高到哪里去了
this.semaphore.await();
for (CrawlerThread thread : workers) thread.shutdown();
for (Thread thread : workers) thread.join();
this.stateListeners.forEach(ICrawlerStateListener::afterCrawl);
System.out.println("Crawler terminated.");
}
// 在外部也可以用来添加事件监听器
public void listen(ICrawlerActionListener actionListener) {
this.actionListeners.add(actionListener);
}
public void listen(ICrawlerStateListener stateListener) {
this.stateListeners.add(stateListener);
}
private void processPage(String url) {
// 这里大量使用了接口中的方法
try {
if (this.config.crawlerRecord.pageCrawled(url)) return;
for (ICrawlerActionListener actionListener : this.actionListeners)
actionListener.onDistinctPage(url);
this.config.crawlerRecord.insertPage(url);
// fetch page
Document doc = this.parse(url);
this.rules.pageCrawled(url, doc);
// crawl
Elements links = doc.select("a[href]");
for (Element link : links) {
String href = link.attr("abs:href");
if (this.rules.shouldContinue(url, href))
this.submit(href);
}
} catch (IOException | IllegalArgumentException e) {
System.err.println("Unable to fetch url: " + url + " - " + e.getMessage());
} catch (Exception e) {
e.printStackTrace();
} finally {
for (ICrawlerActionListener actionListener : this.actionListeners)
actionListener.onComplete(url);
this.semaphore.onComplete();
}
}
private Document parse(String url) throws IOException {
// 跟之前一样,省略
}
private void submit(final String url) {
// 尝试submit之前会检查是否已经爬过,因为程序主要耗时其实
// 都在网络上,这一步花不了多少时间,所以为了减少队列长度
// 和空间用量,多做一些数据库select也是划算的,何况现在用的
// CrawlerRecordSet内部是HashSet,字符串存在性验证非常快,
// 因此更加不用担心
boolean succeed = !this.config.crawlerRecord.pageCrawled(url) &&
this.config.taskQueue.submitPage(url);
if (succeed) {
for (ICrawlerActionListener actionListener : this.actionListeners)
actionListener.onSubmit(url);
this.semaphore.onSubmit();
}
}
}
附上(更原文比起来)稍有修改的InverseSemaphore:
package com.std4453.crawlerlab.concurrent;
/**
*
*/
public class InverseSemaphore {
private int value = 0;
private final Object lock = new Object();
public InverseSemaphore() {
}
public void onSubmit() {
synchronized (lock) {
++value;
}
}
public void onComplete() {
synchronized (lock) {
--value;
if (value == 0)
lock.notifyAll();
}
}
public void await() throws InterruptedException {
synchronized (lock) {
while (value > 0) lock.wait();
}
}
}
到此为止,本次的重构也差不多接近尾声了,看一眼迄今为止的所有包和类的列表:
测试代码
程序入口兼测试类CrawlerTest.java:
package com.std4453.crawlerlab.tests;
import com.std4453.crawlerlab.crawler.GenericCrawler;
import com.std4453.crawlerlab.crawler.ICrawlerRules;
import org.jsoup.nodes.Document;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintWriter;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;
/**
*
*/
public class CrawlerTest {
public static void main(String[] args) throws Exception {
ICrawlerRules rules = new ICrawlerRules() {
private PrintWriter outputWriter;
@Override
public List<String> rootPages() {
List<String> list = new ArrayList<>();
list.add("http://www.sfls.cn/");
return list;
}
@Override
public void pageCrawled(String url, Document doc) {
this.outputWriter.println(url + " | " + doc.title());
}
@Override
public boolean shouldContinue(String from, String to) {
try {
return new URL(to).getHost().contains("www.sfls.cn");
} catch (Exception e) {
return false;
}
}
@Override
public void beforeCrawl() {
try {
File outputFile = new File("output.txt");
FileOutputStream outputStream = new FileOutputStream(outputFile);
this.outputWriter = new PrintWriter(outputStream);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
@Override
public void afterCrawl() {
this.outputWriter.close();
}
};
GenericCrawler crawler = new GenericCrawler(rules);
crawler.run();
}
}
代码很简单,就不加注释了。运行结果就像昨天文章结尾那里的那样,没有什么特别值得说明的,各位也可以自己在电脑上尝试尝试。
小结
本期文章只有一节,也是因为程序重构是一件费心费力的活儿,代码量比较大,文章也写的比较长,就这么一节就结束了。明天的内容大概是进一步的优化和除bug,内容比较杂,还请各位多多支持哈。
(晚上有事,下午这个点就停笔了,诸位莫怪)