java网络爬虫开发笔记(4)

8 篇文章 0 订阅
5 篇文章 0 订阅

有什么不开心的,说出来让大家开心开心啊!

(本期只有一节,编号接上期,内容较长)

0x04 重构!重构!

要说Java最为人熟知也最臭名昭著的黑点是什么?巨大无比的JRE?设计糟糕的Date API?No!
当然是——其无比严谨而又无比冗长的架构设计。
从这个角度上来说,Java是许许多多其他语言、其他编程思想、其他程序架构的试验田,但同时也是他们的综合体,因此常常会有这样的吐槽:

不解释

好,长话短说,在多次迭代终于成功地把CrawlerTest写成一团糟之后,我毅然决然地决定做一件每一个程序猿最引(tou)以(teng)为(yu)傲(lie)的事:重构。

观察到这几次所有的爬虫实验的步骤其实都非常相似(扩充自本系列的第一篇):

  1. 将一个/多个起始页面压入队列(比如http://www.sfls.cn
  2. 从队列中提取链接(原来是ExecutorService自带,现在改成TreeSet)
  3. 检查页面是否已经被爬过(原来是MySQL,现在改成内存里的HashSet)
  4. 未爬过则插入爬虫记录以防重复爬(现在是HashSet)
  5. 拉取网页内容+解析网页(HttpClient+Jsoup)
  6. 对解析好的网页进行处理(比如打印出链接+标题)
  7. 遍历<a>标签,提出href属性对应的绝对url(Jsoup)
  8. 判断是否应该进一步爬url(比如是否在给定的域下)
  9. 是则将此url压入队列(现在是TreeSet)
  10. 队列不为空则跳转到2

从这张更加详细的步骤列表中可以看出,除掉每个爬虫都相同的部分,整个爬虫剩余的执行逻辑可以被分为三个部分:

  1. 爬过的url记录(步骤3&4),可选用数据库/内存中的HashSet,用途是防止已经爬过的页面被重复爬到。
  2. 待爬url队列(步骤2&9),可选用ExecutorService内部的任务队列或者TreeSet(去重),用途是记录将要爬但是还没有爬到的页面。
  3. 页面处理逻辑(步骤1&6&7),决定爬虫的起点、路径和爬到页面后的输出。

接口设计

因此我们可以提取出三个接口:

ICrawlerModel.java:

package com.std4453.crawlerlab.crawler;

/**
 *
 */
public interface ICrawlerRecord {
    /**
     * @param url url to check
     * @return whether the url is already crawled
     */
    boolean pageCrawled(String url);

    /**
     * Insert the given url to the crawler's Record
     * @param url url to insert
     */
    void insertPage(String url);
}

ITaskQueue.java:

package com.std4453.crawlerlab.crawler;

/**
 *
 */
public interface ITaskQueue {
    /**
     * Submit page to task queue, returns {@code true} if page is successfully
     * submitted, which will control the increment of {@code submittedCounter}
     *
     * @param url
     *      url to submit
     *
     * @return whether the page is successfully submitted
     */
    boolean submitPage(String url);

    /**
     * Retrieve ( and remove ) a url from the task queue, returns {@code null} if the
     * queue is empty.
     *
     * @return url from task queue
     */
    String pollTask();
}

ICrawlerRules.java:

package com.std4453.crawlerlab.crawler;

import org.jsoup.nodes.Document;

import java.util.List;

/**
 *
 */
public interface ICrawlerRules {
    /**
     * @return the pages from where the crawler start.
     */
    List<String> rootPages();

    /**
     * Called when crawler reaches the given page.
     *
     * @param url
     *      address of the given page.
     * @param doc
     *      parsed document of the page.
     */
    void pageCrawled(String url, Document doc);

    /**
     * Whether the crawler should continue from source page (specified by {@code from})
     * to target page (specified by {@code to}).
     *
     * @param from
     *      absolute url of source page
     * @param to
     *      absolute url of target page
     *
     * @return whether the crawler should continue
     */
    boolean shouldContinue(String from, String to);
}

又注意到这三个部分都需要某种程度上的初始化和清理环境(建立/关闭数据库连接,打开/关闭文件等),因此我们在爬虫开始前和结束后都设一个事件,也提取成一个接口,并让另外三个接口继承它:
ICrawlerListener:

package com.std4453.crawlerlab.crawler;

/**
 *
 */
public interface ICrawlerListener {
    /**
     * Called before actual crawl started
     */
    void beforeCrawl();

    /**
     * Called after whole crawling stopped
     */
    void afterCrawl();
}
public interface ICrawlerRecord extends ICrawlerListener {
public interface ITaskQueue extends ICrawlerListener {
public interface ICrawlerRules extends ICrawlerListener {

这样一提,原来的CrawlerTest就能脱离ball of mud的泥潭,自然也可以改一个名字,叫做GenericCrawler:

public class GenericCrawler {

CrawlerMonitor去耦合

下一步则是给爬虫的monitor正名——之前的CrawlerMonitor于CrawlerTest(GenericCrawler)耦合相当紧密,对此,可以用一个爬虫事件的listener来去耦合:
ICrawlerActionListener:

package com.std4453.crawlerlab.crawler;

/**
 *
 */
public interface ICrawlerActionListener {
    /**
     * Called when a new page is successfully submitted to the task queue.
     *
     * @param url
     *      the submitted url
     */
    void onSubmit(String url);

    /**
     * Called when parsing of a page is completed.
     *
     * @param url
     *      url of the completed page
     */
    void onComplete(String url);

    /**
     * Called when a new distinct page is found ( before being inserted into the
     * crawler's record ).
     *
     * @param url
     *      url of the new distinct page
     */
    void onDistinctPage(String url);
}

CrawlerMonitor也可以脱离内部类的仆人地位,成为一个堂堂正正的public class:

package com.std4453.crawlerlab.crawler;

import java.util.Calendar;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;

/**
 * A monitor that regularly reports the state of the crawler.
 */
public class CrawlerMonitor implements ICrawlerActionListener, ICrawlerStateListener {
    private class MonitorReportTask implements Runnable {
        @Override
        public void run() {
            int reportId = CrawlerMonitor.this.monitorReportCounter.incrementAndGet();
            int[] time = CrawlerMonitor.getTime();
            System.out.printf("[%02d:%02d:%02d] Monitor report #%d:\n",
                    time[0], time[1], time[2], reportId);

            // memory usage
            Runtime runtime = Runtime.getRuntime();
            long freeMem = runtime.freeMemory();
            long totalMem = runtime.totalMemory();
            long usedMem = totalMem - freeMem;
            double usedMemMB = usedMem / 1048576d;
            double totalMemMB = totalMem / 1048576d;
            double usedMemPercentage = usedMemMB / totalMemMB * 100;
            System.out.printf("Memory usage: %.2fM used (%.1f%% of %.2fM total)\n",
                    usedMemMB, usedMemPercentage, totalMemMB);

            // crawler counters
            long submitted = CrawlerMonitor.this.submittedCounter.get();
            long completed = CrawlerMonitor.this.completedCounter.get();
            long waiting = submitted - completed;
            long distinct = CrawlerMonitor.this.distinctCounter.get();
            System.out.printf("Crawler counters: %d distinct, " +
                            "%d submitted, %d completed, %d waiting\n",
                    distinct, submitted, completed, waiting);
        }
    }

    private AtomicLong submittedCounter, completedCounter, distinctCounter;
    private AtomicInteger monitorReportCounter;
    private ScheduledExecutorService monitorService;

    @Override
    public void beforeCrawl() {
        this.submittedCounter = new AtomicLong();
        this.completedCounter = new AtomicLong();
        this.distinctCounter = new AtomicLong();
        this.monitorReportCounter = new AtomicInteger();

        this.monitorService = Executors.newSingleThreadScheduledExecutor();
        this.monitorService.scheduleAtFixedRate(new MonitorReportTask(),
                5, 30, TimeUnit.SECONDS);

        int[] time = CrawlerMonitor.getTime();
        System.out.printf("[%02d:%02d:%02d] Crawler monitor started.\n",
                time[0], time[1], time[2]);
    }

    @Override
    public void afterCrawl() {
        this.monitorService.shutdown();
        try {
            this.monitorService.awaitTermination(10, TimeUnit.SECONDS);
        } catch (InterruptedException ignored) {
        }

        int[] time = CrawlerMonitor.getTime();
        System.out.printf("[%02d:%02d:%02d] Crawler monitor terminated.\n",
                time[0], time[1], time[2]);

        long completed = this.completedCounter.get();
        long distinct = this.distinctCounter.get();
        System.out.printf("Total crawler counters: %d distinct, %d completed\n",
                distinct, completed);

        this.submittedCounter = null;
        this.completedCounter = null;
        this.distinctCounter = null;
        this.monitorReportCounter = null;
    }

    @Override
    public void onSubmit(String url) {
        this.submittedCounter.incrementAndGet();
    }

    @Override
    public void onComplete(String url) {
        this.completedCounter.incrementAndGet();
    }

    @Override
    public void onDistinctPage(String url) {
        this.distinctCounter.incrementAndGet();
    }

    private static int[] getTime() {
        Calendar calendar = Calendar.getInstance();
        int hours = calendar.get(Calendar.HOUR_OF_DAY);
        int minutes = calendar.get(Calendar.MINUTE);
        int seconds = calendar.get(Calendar.SECOND);
        return new int[]{hours, minutes, seconds};
    }
}

ICrawlerRecord、ITaskQueue实现

再来看ICrawlerRecord和ITaskQueue的实现:
最早的DB版CrawlerRecordDB.java:

package com.std4453.crawlerlab.crawler;

import com.std4453.crawlerlab.concurrent.PreparedStatementFactory;
import com.std4453.crawlerlab.db.DB;
import org.apache.commons.pool2.impl.GenericObjectPool;

import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

/**
 * Implementation of {@link ICrawlerRecord} using a database.
 */
public class CrawlerRecordDB implements ICrawlerRecord {
    private DB db;
    private GenericObjectPool<PreparedStatement> selectStmtPool;
    private GenericObjectPool<PreparedStatement> insertStmtPool;

    public CrawlerRecordDB(DB db) {
        this.db = db;
    }

    @Override
    public boolean pageCrawled(String url) {
        try {
            PreparedStatement selectStmt = this.selectStmtPool.borrowObject();
            selectStmt.setString(1, url);
            ResultSet result = selectStmt.executeQuery();
            boolean hasNext = result.next();
            result.close();
            this.selectStmtPool.returnObject(selectStmt);

            return hasNext;
        } catch (Exception e) {
            e.printStackTrace();
            return true;
        }
    }

    @Override
    public void insertPage(String url) {
        try {
            PreparedStatement insertStmt = this.insertStmtPool.borrowObject();
            insertStmt.setString(1, url);
            insertStmt.execute();
            this.insertStmtPool.returnObject(insertStmt);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    @Override
    public void beforeCrawl() {
        try {
            this.db.runSQL2("TRUNCATE Record;");

            String selectSQL = "SELECT * FROM Record WHERE URL = ?;";
            PreparedStatementFactory selectFactory =
                    new PreparedStatementFactory(this.db, selectSQL);
            this.selectStmtPool = new GenericObjectPool<>(selectFactory);
            String insertSQL = "INSERT INTO record (URL) VALUES (?);";
            PreparedStatementFactory insertFactory =
                    new PreparedStatementFactory(this.db, insertSQL);
            this.insertStmtPool = new GenericObjectPool<>(insertFactory);

            System.out.println("DB initialized.");
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void afterCrawl() {
        this.db.close();
    }
}

现在在用的HashSet版CrawlerRecordSet.java:

package com.std4453.crawlerlab.crawler;

import java.util.HashSet;
import java.util.Set;

/**
 * Implementation of {@link ICrawlerRecord} using a {@link HashSet} as storage.
 */
public class CrawlerRecordSet implements ICrawlerRecord {
    private Set<String> urlRecord;

    @Override
    public void beforeCrawl() {
        this.urlRecord = new HashSet<>();
    }

    @Override
    public void afterCrawl() {
        this.urlRecord = null;
    }

    @Override
    public synchronized boolean pageCrawled(String url) {
        return this.urlRecord.contains(url);
    }

    @Override
    public synchronized void insertPage(String url) {
        this.urlRecord.add(url);
    }
}

备用的ArrayDeque版的TaskQueue.java:

package com.std4453.crawlerlab.crawler;

import java.util.ArrayDeque;
import java.util.Queue;

/**
 * Implementation of {@link ITaskQueue} using an {@link ArrayDeque} as storage.
 */
public class TaskQueueQueue implements ITaskQueue {
    private Queue<String> queue;

    @Override
    public boolean submitPage(String url) {
        synchronized (this) {
            this.queue.offer(url);
            return true;
        }
    }

    @Override
    public String pollTask() {
        synchronized (this) {
            return this.queue.poll();
        }
    }

    @Override
    public void beforeCrawl() {
        this.queue = new ArrayDeque<>();
    }

    @Override
    public void afterCrawl() {
        this.queue = null;
    }
}

现在在用的TreeSet版TaskQueueSet.java,可以在submit新的任务是直接去重,减少队列中积攒的任务数量,降低时间和空间消耗:

package com.std4453.crawlerlab.crawler;

import java.util.SortedSet;
import java.util.TreeSet;

/**
 * Implementation of {@link ITaskQueue} using {@link TreeSet} as storage.
 */
public class TaskQueueSet implements ITaskQueue {
    private SortedSet<String> data;

    @Override
    public void beforeCrawl() {
        this.data = new TreeSet<>();
    }

    @Override
    public void afterCrawl() {
        this.data = null;
    }

    @Override
    public synchronized boolean submitPage(String url) {
        return this.data.add(url);
    }

    @Override
    public synchronized String pollTask() {
        if (this.data.isEmpty()) return null;
        String first = this.data.first();
        this.data.remove(first);
        return first;
    }
}

配置类CrawlerConfig

由于大部分爬虫的ICrawlerRecord和ITaskQueue都是一样的(因为它们主要决定的是性能,应该在内部迭代开发),只有ICrawlerRule不同,因此决定将选用的ICrawlerRecord、ITaskQueue和其他的一些设置整理成一个config配置类,叫做CrawlerConfig,然后构造GenericCrawler的时候传入一个CrawlerConfig实例和不同的ICrawlerRule:

package com.std4453.crawlerlab.crawler;

/**
 * Config class for {@link GenericCrawler}
 */
public class CrawlerConfig {
    public int numThreads = 16;
    public boolean startMonitorThread = true;
    public ICrawlerRecord crawlerRecord;
    public ITaskQueue taskQueue;

    public CrawlerConfig(ICrawlerRecord crawlerRecord, ITaskQueue taskQueue) {
        this.crawlerRecord = crawlerRecord;
        this.taskQueue = taskQueue;
    }

    public void setNumThreads(int numThreads) {
        this.numThreads = numThreads;
    }

    public void setStartMonitorThread(boolean startMonitorThread) {
        this.startMonitorThread = startMonitorThread;
    }
}

再提供一个静态的默认方法用于取得默认的ITaskQueue和ICrawlerRecord:

    public static CrawlerConfig defaultConfig() {
        return new CrawlerConfig(new CrawlerRecordSet(), new TaskQueueSet());
    }

应用到GenericCrawler.java中(也包括之前很多更改在这里的体现)(省略一部分不必要的内容,加了中文注释):

package com.std4453.crawlerlab.crawler;

// 省略import

public class GenericCrawler {
    private class CrawlerThread extends Thread {
        // 省略,跟上一篇中一样
    }

    private ICrawlerRules rules;
    private CrawlerConfig config; // config类
    private InverseSemaphore semaphore; // 任务完成计数器(见本系列第二弹)
    // 事件监听器列表
    private List<ICrawlerStateListener> stateListeners;
    private List<ICrawlerActionListener> actionListeners;

    public GenericCrawler(ICrawlerRules rules, CrawlerConfig config) {
        this.rules = rules;
        this.config = config == null ? CrawlerConfig.defaultConfig() : config;

        this.semaphore = new InverseSemaphore();
        this.actionListeners = new Vector<>();
        this.stateListeners = new Vector<>();

        // 将内部事件监听器加入列表
        this.listen(this.config.crawlerRecord);
        this.listen(this.config.taskQueue);
        this.listen(this.rules);
        if (this.config.startMonitorThread) {
            CrawlerMonitor monitor = new CrawlerMonitor();
            this.listen((ICrawlerActionListener) monitor);
            this.listen((ICrawlerStateListener) monitor);
        }
    }

    public GenericCrawler(ICrawlerRules rules) {
        // 使用默认config
        this(rules, CrawlerConfig.defaultConfig());
    }

    public void run() throws InterruptedException {
        this.stateListeners.forEach(ICrawlerStateListener::beforeCrawl);
        // 本来这里初始化monitor的一堆代码去耦合后就用不着了

        CrawlerThread[] workers = new CrawlerThread[this.config.numThreads];
        for (int i = 0; i < workers.length; ++i)
            workers[i] = new CrawlerThread();
        this.rules.rootPages().forEach(this::submit);
        for (Thread thread : workers) thread.start();
        System.out.println("Crawler started.");
        // semaphore计数器的await()比不停sleep()不知道高到哪里去了
        this.semaphore.await();
        for (CrawlerThread thread : workers) thread.shutdown();
        for (Thread thread : workers) thread.join();

        this.stateListeners.forEach(ICrawlerStateListener::afterCrawl);
        System.out.println("Crawler terminated.");
    }

    // 在外部也可以用来添加事件监听器
    public void listen(ICrawlerActionListener actionListener) {
        this.actionListeners.add(actionListener);
    }

    public void listen(ICrawlerStateListener stateListener) {
        this.stateListeners.add(stateListener);
    }

    private void processPage(String url) {
        // 这里大量使用了接口中的方法
        try {
            if (this.config.crawlerRecord.pageCrawled(url)) return;
            for (ICrawlerActionListener actionListener : this.actionListeners)
                actionListener.onDistinctPage(url);
            this.config.crawlerRecord.insertPage(url);

            // fetch page
            Document doc = this.parse(url);
            this.rules.pageCrawled(url, doc);

            // crawl
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                String href = link.attr("abs:href");
                if (this.rules.shouldContinue(url, href))
                    this.submit(href);
            }
        } catch (IOException | IllegalArgumentException e) {
            System.err.println("Unable to fetch url: " + url + " - " + e.getMessage());
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            for (ICrawlerActionListener actionListener : this.actionListeners)
                actionListener.onComplete(url);
            this.semaphore.onComplete();
        }
    }

    private Document parse(String url) throws IOException {
        // 跟之前一样,省略
    }

    private void submit(final String url) {
        // 尝试submit之前会检查是否已经爬过,因为程序主要耗时其实
        // 都在网络上,这一步花不了多少时间,所以为了减少队列长度
        // 和空间用量,多做一些数据库select也是划算的,何况现在用的
        // CrawlerRecordSet内部是HashSet,字符串存在性验证非常快,
        // 因此更加不用担心
        boolean succeed = !this.config.crawlerRecord.pageCrawled(url) &&
                this.config.taskQueue.submitPage(url);
        if (succeed) {
            for (ICrawlerActionListener actionListener : this.actionListeners)
                actionListener.onSubmit(url);
            this.semaphore.onSubmit();
        }
    }
}

附上(更原文比起来)稍有修改的InverseSemaphore:

package com.std4453.crawlerlab.concurrent;

/**
 *
 */
public class InverseSemaphore {
    private int value = 0;
    private final Object lock = new Object();

    public InverseSemaphore() {
    }

    public void onSubmit() {
        synchronized (lock) {
            ++value;
        }
    }

    public void onComplete() {
        synchronized (lock) {
            --value;
            if (value == 0)
                lock.notifyAll();
        }
    }

    public void await() throws InterruptedException {
        synchronized (lock) {
            while (value > 0) lock.wait();
        }
    }
}

到此为止,本次的重构也差不多接近尾声了,看一眼迄今为止的所有包和类的列表:
包+类列表截图

测试代码

程序入口兼测试类CrawlerTest.java:

package com.std4453.crawlerlab.tests;

import com.std4453.crawlerlab.crawler.GenericCrawler;
import com.std4453.crawlerlab.crawler.ICrawlerRules;
import org.jsoup.nodes.Document;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintWriter;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

/**
 *
 */
public class CrawlerTest {
    public static void main(String[] args) throws Exception {
        ICrawlerRules rules = new ICrawlerRules() {
            private PrintWriter outputWriter;

            @Override
            public List<String> rootPages() {
                List<String> list = new ArrayList<>();
                list.add("http://www.sfls.cn/");
                return list;
            }

            @Override
            public void pageCrawled(String url, Document doc) {
                this.outputWriter.println(url + " | " + doc.title());
            }

            @Override
            public boolean shouldContinue(String from, String to) {
                try {
                    return new URL(to).getHost().contains("www.sfls.cn");
                } catch (Exception e) {
                    return false;
                }
            }

            @Override
            public void beforeCrawl() {
                try {
                    File outputFile = new File("output.txt");
                    FileOutputStream outputStream = new FileOutputStream(outputFile);
                    this.outputWriter = new PrintWriter(outputStream);
                } catch (FileNotFoundException e) {
                    e.printStackTrace();
                }
            }

            @Override
            public void afterCrawl() {
                this.outputWriter.close();
            }
        };

        GenericCrawler crawler = new GenericCrawler(rules);
        crawler.run();
    }
}

代码很简单,就不加注释了。运行结果就像昨天文章结尾那里的那样,没有什么特别值得说明的,各位也可以自己在电脑上尝试尝试。

小结

本期文章只有一节,也是因为程序重构是一件费心费力的活儿,代码量比较大,文章也写的比较长,就这么一节就结束了。明天的内容大概是进一步的优化和除bug,内容比较杂,还请各位多多支持哈。
(晚上有事,下午这个点就停笔了,诸位莫怪)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值