java网络爬虫开发笔记（4）

最新推荐文章于 2024-08-04 07:00:00 发布

std4453

最新推荐文章于 2024-08-04 07:00:00 发布

阅读量389

点赞数

分类专栏： Java Web 网络爬虫文章标签： java 网络爬虫

本文链接：https://blog.csdn.net/std4453/article/details/54729595

版权

Java 同时被 3 个专栏收录

13 篇文章 0 订阅

订阅专栏

Web

8 篇文章 0 订阅

订阅专栏

网络爬虫

5 篇文章 0 订阅

订阅专栏

有什么不开心的，说出来让大家开心开心啊！

（本期只有一节，编号接上期，内容较长）

0x04 重构！重构！

要说Java最为人熟知也最臭名昭著的黑点是什么？巨大无比的JRE？设计糟糕的Date API？No！
当然是——其无比严谨而又无比冗长的架构设计。
从这个角度上来说，Java是许许多多其他语言、其他编程思想、其他程序架构的试验田，但同时也是他们的综合体，因此常常会有这样的吐槽：

不解释

好，长话短说，在多次迭代终于成功地把CrawlerTest写成一团糟之后，我毅然决然地决定做一件每一个程序猿最引（tou）以（teng）为（yu）傲（lie）的事：重构。

观察到这几次所有的爬虫实验的步骤其实都非常相似（扩充自本系列的第一篇）：

将一个/多个起始页面压入队列（比如http://www.sfls.cn）
从队列中提取链接（原来是ExecutorService自带，现在改成TreeSet）
检查页面是否已经被爬过（原来是MySQL，现在改成内存里的HashSet）
未爬过则插入爬虫记录以防重复爬（现在是HashSet）
拉取网页内容+解析网页（HttpClient+Jsoup）
对解析好的网页进行处理（比如打印出链接+标题）
遍历<a>标签，提出href属性对应的绝对url（Jsoup）
判断是否应该进一步爬url（比如是否在给定的域下）
是则将此url压入队列（现在是TreeSet）
队列不为空则跳转到2

从这张更加详细的步骤列表中可以看出，除掉每个爬虫都相同的部分，整个爬虫剩余的执行逻辑可以被分为三个部分：

爬过的url记录（步骤3&4），可选用数据库/内存中的HashSet，用途是防止已经爬过的页面被重复爬到。
待爬url队列（步骤2&9），可选用ExecutorService内部的任务队列或者TreeSet（去重），用途是记录将要爬但是还没有爬到的页面。
页面处理逻辑（步骤1&6&7），决定爬虫的起点、路径和爬到页面后的输出。

接口设计

因此我们可以提取出三个接口：

ICrawlerModel.java：

package com.std4453.crawlerlab.crawler;

/**
 *
 */
public interface ICrawlerRecord {
    /**
     * @param url url to check
     * @return whether the url is already crawled
     */
    boolean pageCrawled(String url);

    /**
     * Insert the given url to the crawler's Record
     * @param url url to insert
     */
    void insertPage(String url);
}

ITaskQueue.java：

package com.std4453.crawlerlab.crawler;

/**
 *
 */
public interface ITaskQueue {
    /**
     * Submit page to task queue, returns {@code true} if page is successfully
     * submitted, which will control the increment of {@code submittedCounter}
     *
     * @param url
     *      url to submit
     *
     * @return whether the page is successfully submitted
     */
    boolean submitPage(String url);

    /**
     * Retrieve ( and remove ) a url from the task queue, returns {@code null} if the
     * queue is empty.
     *
     * @return url from task queue
     */
    String pollTask();
}

ICrawlerRules.java：

package com.std4453.crawlerlab.crawler;

import org.jsoup.nodes.Document;

import java.util.List;

/**
 *
 */
public interface ICrawlerRules {
    /**
     * @return the pages from where the crawler start.
     */
    List<String> rootPages();

    /**
     * Called when crawler reaches the given page.
     *
     * @param url
     *      address of the given page.
     * @param doc
     *      parsed document of the page.
     */
    void pageCrawled(String url, Document doc);

    /**
     * Whether the crawler should continue from source page (specified by {@code from})
     * to target page (specified by {@code to}).
     *
     * @param from
     *      absolute url of source page
     * @param to
     *      absolute url of target page
     *
     * @return whether the crawler should continue
     */
    boolean shouldContinue(String from, String to);
}

又注意到这三个部分都需要某种程度上的初始化和清理环境（建立/关闭数据库连接，打开/关闭文件等），因此我们在爬虫开始前和结束后都设一个事件，也提取成一个接口，并让另外三个接口继承它：
ICrawlerListener：

package com.std4453.crawlerlab.crawler;

/**
 *
 */
public interface ICrawlerListener {
    /**
     * Called before actual crawl started
     */
    void beforeCrawl();

    /**
     * Called after whole crawling stopped
     */
    void afterCrawl();
}

public interface ICrawlerRecord extends ICrawlerListener {

public interface ITaskQueue extends ICrawlerListener {

public interface ICrawlerRules extends ICrawlerListener {

这样一提，原来的CrawlerTest就能脱离ball of mud的泥潭，自然也可以改一个名字，叫做GenericCrawler：

public class GenericCrawler {

CrawlerMonitor去耦合

下一步则是给爬虫的monitor正名——之前的CrawlerMonitor于CrawlerTest（GenericCrawler）耦合相当紧密，对此，可以用一个爬虫事件的listener来去耦合：
ICrawlerActionListener：

package com.std4453.crawlerlab.crawler;

/**
 *
 */
public interface ICrawlerActionListener {
    /**
     * Called when a new page is successfully submitted to the task queue.
     *
     * @param url
     *      the submitted url
     */
    void onSubmit(String url);

    /**
     * Called when parsing of a page is completed.
     *
     * @param url
     *      url of the completed page
     */
    void onComplete(String url);

    /**
     * Called when a new distinct page is found ( before being inserted into the
     * crawler's record ).
     *
     * @param url
     *      url of the new distinct page
     */
    void onDistinctPage(String url);
}

CrawlerMonitor也可以脱离内部类的仆人地位，成为一个堂堂正正的public class：

package com.std4453.crawlerlab.crawler;

import java.util.Calendar;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;

/**
 * A monitor that regularly reports the state of the crawler.
 */
public class CrawlerMonitor implements ICrawlerActionListener, ICrawlerStateListener {
    private class MonitorReportTask implements Runnable {
        @Override
        public void run() {
            int reportId = CrawlerMonitor.this.monitorReportCounter.incrementAndGet();
            int[] time = CrawlerMonitor.getTime();
            System.out.printf("[%02d:%02d:%02d] Monitor report #%d:\n",
                    time[0], time[1], time[2], reportId);

            // memory usage
            Runtime runtime = Runtime.getRuntime();
            long freeMem = runtime.freeMemory();
            long totalMem = runtime.totalMemory();
            long usedMem = totalMem - freeMem;
            double usedMemMB = usedMem / 1048576d;
            double totalMemMB = totalMem / 1048576d;
            double usedMemPercentage = usedMemMB / totalMemMB * 100;
            System.out.printf("Memory usage: %.2fM used (%.1f%% of %.2fM total)\n",
                    usedMemMB, usedMemPercentage, totalMemMB);

            // crawler counters
            long submitted = CrawlerMonitor.this.submittedCounter.get();
            long completed = CrawlerMonitor.this.completedCounter.get();
            long waiting = submitted - completed;
            long distinct = CrawlerMonitor.this.distinctCounter.get();
            System.out.printf("Crawler counters: %d distinct, " +
                            "%d submitted, %d completed, %d waiting\n",
                    distinct, submitted, completed, waiting);
        }
    }

    private AtomicLong submittedCounter, completedCounter, distinctCounter;
    private AtomicInteger monitorReportCounter;
    private ScheduledExecutorService monitorService;

    @Override
    public void beforeCrawl() {
        this.submittedCounter = new AtomicLong();
        this.completedCounter = new AtomicLong();
        this.distinctCounter = new AtomicLong();
        this.monitorReportCounter = new AtomicInteger();

        this.monitorService = Executors.newSingleThreadScheduledExecutor();
        this.monitorService.scheduleAtFixedRate(new MonitorReportTask(),
                5, 30, TimeUnit.SECONDS);

        int[] time = CrawlerMonitor.getTime();
        System.out.printf("[%02d:%02d:%02d] Crawler monitor started.\n",
                time[0], time[1], time[2]);
    }

    @Override
    public void afterCrawl() {
        this.monitorService.shutdown();
        try {
            this.monitorService.awaitTermination(10, TimeUnit.SECONDS);
        } catch (InterruptedException ignored) {
        }

        int[] time = CrawlerMonitor.getTime();
        System.out.printf("[%02d:%02d:%02d] Crawler monitor terminated.\n",
                time[0], time[1], time[2]);

        long completed = this.completedCounter.get();
        long distinct = this.distinctCounter.get();
        System.out.printf("Total crawler counters: %d distinct, %d completed\n",
                distinct, completed);

        this.submittedCounter = null;
        this.completedCounter = null;
        this.distinctCounter = null;
        this.monitorReportCounter = null;
    }

    @Override
    public void onSubmit(String url) {
        this.submittedCounter.incrementAndGet();
    }

    @Override
    public void onComplete(String url) {
        this.completedCounter.incrementAndGet();
    }

    @Override
    public void onDistinctPage(String url) {
        this.distinctCounter.incrementAndGet();
    }

    private static int[] getTime() {
        Calendar calendar = Calendar.getInstance();
        int hours = calendar.get(Calendar.HOUR_OF_DAY);
        int minutes = calendar.get(Calendar.MINUTE);
        int seconds = calendar.get(Calendar.SECOND);
        return new int[]{hours, minutes, seconds};
    }
}

ICrawlerRecord、ITaskQueue实现

再来看ICrawlerRecord和ITaskQueue的实现：
最早的DB版CrawlerRecordDB.java：

package com.std4453.crawlerlab.crawler;

import com.std4453.crawlerlab.concurrent.PreparedStatementFactory;
import com.std4453.crawlerlab.db.DB;
import org.apache.commons.pool2.impl.GenericObjectPool;

import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

/**
 * Implementation of {@link ICrawlerRecord} using a database.
 */
public class CrawlerRecordDB implements ICrawlerRecord {
    private DB db;
    private GenericObjectPool<PreparedStatement> selectStmtPool;
    private GenericObjectPool<PreparedStatement> insertStmtPool;

    public CrawlerRecordDB(DB db) {
        this.db = db;
    }

    @Override
    public boolean pageCrawled(String url) {
        try {
            PreparedStatement selectStmt = this.selectStmtPool.borrowObject();
            selectStmt.setString(1, url);
            ResultSet result = selectStmt.executeQuery();
            boolean hasNext = result.next();
            result.close();
            this.selectStmtPool.returnObject(selectStmt);

            return hasNext;
        } catch (Exception e) {
            e.printStackTrace();
            return true;
        }
    }

    @Override
    public void insertPage(String url) {
        try {
            PreparedStatement insertStmt = this.insertStmtPool.borrowObject();
            insertStmt.setString(1, url);
            insertStmt.execute();
            this.insertStmtPool.returnObject(insertStmt);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    @Override
    public void beforeCrawl() {
        try {
            this.db.runSQL2("TRUNCATE Record;");

            String selectSQL = "SELECT * FROM Record WHERE URL = ?;";
            PreparedStatementFactory selectFactory =
                    new PreparedStatementFactory(this.db, selectSQL);
            this.selectStmtPool = new GenericObjectPool<>(selectFactory);
            String insertSQL = "INSERT INTO record (URL) VALUES (?);";
            PreparedStatementFactory insertFactory =
                    new PreparedStatementFactory(this.db, insertSQL);
            this.insertStmtPool = new GenericObjectPool<>(insertFactory);

            System.out.println("DB initialized.");
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }

    @Override
    public void afterCrawl() {
        this.db.close();
    }
}

现在在用的HashSet版CrawlerRecordSet.java：

package com.std4453.crawlerlab.crawler;

import java.util.HashSet;
import java.util.Set;

/**
 * Implementation of {@link ICrawlerRecord} using a {@link HashSet} as storage.
 */
public class CrawlerRecordSet implements ICrawlerRecord {
    private Set<String> urlRecord;

    @Override
    public void beforeCrawl() {
        this.urlRecord = new HashSet<>();
    }

    @Override
    public void afterCrawl() {
        this.urlRecord = null;
    }

    @Override
    public synchronized boolean pageCrawled(String url) {
        return this.urlRecord.contains(url);
    }

    @Override
    public synchronized void insertPage(String url) {
        this.urlRecord.add(url);
    }
}

备用的ArrayDeque版的TaskQueue.java：

package com.std4453.crawlerlab.crawler;

import java.util.ArrayDeque;
import java.util.Queue;

/**
 * Implementation of {@link ITaskQueue} using an {@link ArrayDeque} as storage.
 */
public class TaskQueueQueue implements ITaskQueue {
    private Queue<String> queue;

    @Override
    public boolean submitPage(String url) {
        synchronized (this) {
            this.queue.offer(url);
            return true;
        }
    }

    @Override
    public String pollTask() {
        synchronized (this) {
            return this.queue.poll();
        }
    }

    @Override
    public void beforeCrawl() {
        this.queue = new ArrayDeque<>();
    }

    @Override
    public void afterCrawl() {
        this.queue = null;
    }
}

现在在用的TreeSet版TaskQueueSet.java，可以在submit新的任务是直接去重，减少队列中积攒的任务数量，降低时间和空间消耗：

package com.std4453.crawlerlab.crawler;

import java.util.SortedSet;
import java.util.TreeSet;

/**
 * Implementation of {@link ITaskQueue} using {@link TreeSet} as storage.
 */
public class TaskQueueSet implements ITaskQueue {
    private SortedSet<String> data;

    @Override
    public void beforeCrawl() {
        this.data = new TreeSet<>();
    }

    @Override
    public void afterCrawl() {
        this.data = null;
    }

    @Override
    public synchronized boolean submitPage(String url) {
        return this.data.add(url);
    }

    @Override
    public synchronized String pollTask() {
        if (this.data.isEmpty()) return null;
        String first = this.data.first();
        this.data.remove(first);
        return first;
    }
}

配置类CrawlerConfig

由于大部分爬虫的ICrawlerRecord和ITaskQueue都是一样的（因为它们主要决定的是性能，应该在内部迭代开发），只有ICrawlerRule不同，因此决定将选用的ICrawlerRecord、ITaskQueue和其他的一些设置整理成一个config配置类，叫做CrawlerConfig，然后构造GenericCrawler的时候传入一个CrawlerConfig实例和不同的ICrawlerRule：

package com.std4453.crawlerlab.crawler;

/**
 * Config class for {@link GenericCrawler}
 */
public class CrawlerConfig {
    public int numThreads = 16;
    public boolean startMonitorThread = true;
    public ICrawlerRecord crawlerRecord;
    public ITaskQueue taskQueue;

    public CrawlerConfig(ICrawlerRecord crawlerRecord, ITaskQueue taskQueue) {
        this.crawlerRecord = crawlerRecord;
        this.taskQueue = taskQueue;
    }

    public void setNumThreads(int numThreads) {
        this.numThreads = numThreads;
    }

    public void setStartMonitorThread(boolean startMonitorThread) {
        this.startMonitorThread = startMonitorThread;
    }
}

再提供一个静态的默认方法用于取得默认的ITaskQueue和ICrawlerRecord：

    public static CrawlerConfig defaultConfig() {
        return new CrawlerConfig(new CrawlerRecordSet(), new TaskQueueSet());
    }

应用到GenericCrawler.java中（也包括之前很多更改在这里的体现）（省略一部分不必要的内容，加了中文注释）：

package com.std4453.crawlerlab.crawler;

// 省略import

public class GenericCrawler {
    private class CrawlerThread extends Thread {
        // 省略，跟上一篇中一样
    }

    private ICrawlerRules rules;
    private CrawlerConfig config; // config类
    private InverseSemaphore semaphore; // 任务完成计数器（见本系列第二弹）
    // 事件监听器列表
    private List<ICrawlerStateListener> stateListeners;
    private List<ICrawlerActionListener> actionListeners;

    public GenericCrawler(ICrawlerRules rules, CrawlerConfig config) {
        this.rules = rules;
        this.config = config == null ? CrawlerConfig.defaultConfig() : config;

        this.semaphore = new InverseSemaphore();
        this.actionListeners = new Vector<>();
        this.stateListeners = new Vector<>();

        // 将内部事件监听器加入列表
        this.listen(this.config.crawlerRecord);
        this.listen(this.config.taskQueue);
        this.listen(this.rules);
        if (this.config.startMonitorThread) {
            CrawlerMonitor monitor = new CrawlerMonitor();
            this.listen((ICrawlerActionListener) monitor);
            this.listen((ICrawlerStateListener) monitor);
        }
    }

    public GenericCrawler(ICrawlerRules rules) {
        // 使用默认config
        this(rules, CrawlerConfig.defaultConfig());
    }

    public void run() throws InterruptedException {
        this.stateListeners.forEach(ICrawlerStateListener::beforeCrawl);
        // 本来这里初始化monitor的一堆代码去耦合后就用不着了

        CrawlerThread[] workers = new CrawlerThread[this.config.numThreads];
        for (int i = 0; i < workers.length; ++i)
            workers[i] = new CrawlerThread();
        this.rules.rootPages().forEach(this::submit);
        for (Thread thread : workers) thread.start();
        System.out.println("Crawler started.");
        // semaphore计数器的await()比不停sleep()不知道高到哪里去了
        this.semaphore.await();
        for (CrawlerThread thread : workers) thread.shutdown();
        for (Thread thread : workers) thread.join();

        this.stateListeners.forEach(ICrawlerStateListener::afterCrawl);
        System.out.println("Crawler terminated.");
    }

    // 在外部也可以用来添加事件监听器
    public void listen(ICrawlerActionListener actionListener) {
        this.actionListeners.add(actionListener);
    }

    public void listen(ICrawlerStateListener stateListener) {
        this.stateListeners.add(stateListener);
    }

    private void processPage(String url) {
        // 这里大量使用了接口中的方法
        try {
            if (this.config.crawlerRecord.pageCrawled(url)) return;
            for (ICrawlerActionListener actionListener : this.actionListeners)
                actionListener.onDistinctPage(url);
            this.config.crawlerRecord.insertPage(url);

            // fetch page
            Document doc = this.parse(url);
            this.rules.pageCrawled(url, doc);

            // crawl
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                String href = link.attr("abs:href");
                if (this.rules.shouldContinue(url, href))
                    this.submit(href);
            }
        } catch (IOException | IllegalArgumentException e) {
            System.err.println("Unable to fetch url: " + url + " - " + e.getMessage());
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            for (ICrawlerActionListener actionListener : this.actionListeners)
                actionListener.onComplete(url);
            this.semaphore.onComplete();
        }
    }

    private Document parse(String url) throws IOException {
        // 跟之前一样，省略
    }

    private void submit(final String url) {
        // 尝试submit之前会检查是否已经爬过，因为程序主要耗时其实
        // 都在网络上，这一步花不了多少时间，所以为了减少队列长度
        // 和空间用量，多做一些数据库select也是划算的，何况现在用的
        // CrawlerRecordSet内部是HashSet，字符串存在性验证非常快，
        // 因此更加不用担心
        boolean succeed = !this.config.crawlerRecord.pageCrawled(url) &&
                this.config.taskQueue.submitPage(url);
        if (succeed) {
            for (ICrawlerActionListener actionListener : this.actionListeners)
                actionListener.onSubmit(url);
            this.semaphore.onSubmit();
        }
    }
}

附上（更原文比起来）稍有修改的InverseSemaphore：

package com.std4453.crawlerlab.concurrent;

/**
 *
 */
public class InverseSemaphore {
    private int value = 0;
    private final Object lock = new Object();

    public InverseSemaphore() {
    }

    public void onSubmit() {
        synchronized (lock) {
            ++value;
        }
    }

    public void onComplete() {
        synchronized (lock) {
            --value;
            if (value == 0)
                lock.notifyAll();
        }
    }

    public void await() throws InterruptedException {
        synchronized (lock) {
            while (value > 0) lock.wait();
        }
    }
}

到此为止，本次的重构也差不多接近尾声了，看一眼迄今为止的所有包和类的列表：
包+类列表截图

测试代码

程序入口兼测试类CrawlerTest.java：

package com.std4453.crawlerlab.tests;

import com.std4453.crawlerlab.crawler.GenericCrawler;
import com.std4453.crawlerlab.crawler.ICrawlerRules;
import org.jsoup.nodes.Document;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintWriter;
import java.net.URL;
import java.util.ArrayList;
import java.util.List;

/**
 *
 */
public class CrawlerTest {
    public static void main(String[] args) throws Exception {
        ICrawlerRules rules = new ICrawlerRules() {
            private PrintWriter outputWriter;

            @Override
            public List<String> rootPages() {
                List<String> list = new ArrayList<>();
                list.add("http://www.sfls.cn/");
                return list;
            }

            @Override
            public void pageCrawled(String url, Document doc) {
                this.outputWriter.println(url + " | " + doc.title());
            }

            @Override
            public boolean shouldContinue(String from, String to) {
                try {
                    return new URL(to).getHost().contains("www.sfls.cn");
                } catch (Exception e) {
                    return false;
                }
            }

            @Override
            public void beforeCrawl() {
                try {
                    File outputFile = new File("output.txt");
                    FileOutputStream outputStream = new FileOutputStream(outputFile);
                    this.outputWriter = new PrintWriter(outputStream);
                } catch (FileNotFoundException e) {
                    e.printStackTrace();
                }
            }

            @Override
            public void afterCrawl() {
                this.outputWriter.close();
            }
        };

        GenericCrawler crawler = new GenericCrawler(rules);
        crawler.run();
    }
}