java网络爬虫开发笔记（3）

最新推荐文章于 2024-06-27 09:50:35 发布

std4453

最新推荐文章于 2024-06-27 09:50:35 发布

阅读量340

点赞数

分类专栏： Java Web 网络爬虫文章标签： java 网络爬虫

本文链接：https://blog.csdn.net/std4453/article/details/54705797

版权

Java 同时被 3 个专栏收录

13 篇文章 0 订阅

订阅专栏

Web

8 篇文章 0 订阅

订阅专栏

网络爬虫

5 篇文章 0 订阅

订阅专栏

麻烦年年有，今年特别多。

0x01 OOM与对象池

首先是代码跑了一晚上，结果早上起来发现报OOM了：

    java.sql.SQLException: java.lang.OutOfMemoryError: Java heap space
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:964)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:897)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:886)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:860)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:877)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:873)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:443)
    at com.mysql.jdbc.PreparedStatement.getInstance(PreparedStatement.java:761)
    at com.mysql.jdbc.ConnectionImpl.clientPrepareStatement(ConnectionImpl.java:1471)
    at com.mysql.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:4167)
    at com.mysql.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:4071)
    at com.mysql.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:4078)
    at com.std4453.crawerlab.main.CrawlerTest.processPage(CrawlerTest.java:92)
    at com.std4453.crawerlab.main.CrawlerTest.lambda$submit$0(CrawlerTest.java:136)
    at com.std4453.crawerlab.main.CrawlerTest$$Lambda$1/1586270964.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

对策：提高jvm的堆大小，加上-Xms512M -Xmx1024M再次运行。
此外几乎所有的OOM都是爆在new PreparedStatement的时候，看javadoc发现PreparedStatement虽然有状态，但并非一次性，而是可以复用的。也就是说，不用每次processPage()都new PreparedStatement()，只要准备一个对象池，在ExecutorService的各线程之间共享就行了。
刚好翻译了一篇文章：java实现对象池，根据里面说的添加commons pool 2包后，如下修改代码：
在CrawlerTest的类定义中加入：

    private GenericObjectPool<PreparedStatement> insertStmtPool;

run()方法的//before部分中加入：

    String insertSQL = "INSERT INTO record (URL) VALUES (?);";
    PreparedStatementFactory factory = new PreparedStatementFactory(this.db, insertSQL);
    GenericObjectPoolConfig poolConfig = new GenericObjectPoolConfig();
    this.insertStmtPool = new GenericObjectPool<>(factory, poolConfig);

processPage()里面改成：

    PreparedStatement statement = this.insertStmtPool.borrowObject();
    // execute ...
    this.insertStmtPool.returnObject(statement);

还有显而易见的PreparedStatementFactory类：

package com.std4453.crawerlab.main;

import com.std4453.crawlerlab.db.DB;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;

import java.sql.PreparedStatement;
import java.sql.Statement;

public class PreparedStatementFactory extends BasePooledObjectFactory<PreparedStatement> {
    private DB db;
    private String sql;

    public PreparedStatementFactory(DB db, String sql) {
        this.db = db;
        this.sql = sql;
    }

    @Override
    public PreparedStatement create() throws Exception {
        return this.db.connection.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
    }

    @Override
    public PooledObject<PreparedStatement> wrap(PreparedStatement obj) {
        return new DefaultPooledObject<>(obj);
    }
}

还有之前的select的SQL语句也是直接拼接的，可能会出问题（实际上在url链接带有单引号'的时候就会报错），也要换成PreparedStatement：

    // CrawlerTest定义
    private GenericObjectPool<PreparedStatement> selectStmtPool;

    // run()方法中
    String selectSQL = "SELECT * FROM Record WHERE URL = ?";
    PreparedStatementFactory selectFactory = new PreparedStatementFactory(this.db, selectSQL);
    this.selectStmtPool = new GenericObjectPool<>(factory);

    // processPage()中
    PreparedStatement selectStmt = this.selectStmtPool.borrowObject();
    selectStmt.setString(1, url);
    ResultSet result = selectStmt.executeQuery();
    boolean hasNext = result.next();
    result.close(); // 要记着关闭
    this.selectStmtPool.returnObject(selectStmt);

0x02 Big Brother is watching you

为了能够更加完整清晰地把握住程序的运行状况，我决定引入一个监视器（monitor）线程来每隔一段时间将一些状态信息打印到控制台：
CrawlerTest中的定义：

    private AtomicLong submittedCounter, completedCounter, distinctCounter;
    private AtomicInteger monitorReportCounter;

run()中：

    ScheduledExecutorService monitor = null;
    // 计数器
    this.submittedCounter = new AtomicLong(0);
    this.completedCounter = new AtomicLong(0);
    this.distinctCounter = new AtomicLong(0);
    this.monitorReportCounter = new AtomicInteger(0);
    monitor = Executors.newSingleThreadScheduledExecutor();
    monitor.scheduleAtFixedRate(new CrawlerMonitor(), 30, 30, TimeUnit.SECONDS); // 30秒输出一次
    System.out.println("Monitor thread started.");

    // ...

    monitor.shutdown();
    monitor.awaitTermination(10, TimeUnit.MINUTES);
    System.out.println("Monitor thread terminated.");

内部类CrawlMonitor：

    /**
     * A monitor that regularly reports the state of the crawler.
     */
    private class CrawlerMonitor implements Runnable {
        @Override
        public void run() {
            // 时间+报告序号
            int reportId = CrawlerTest.this.monitorReportCounter.incrementAndGet();
            Calendar calendar = Calendar.getInstance();
            int hours = calendar.get(Calendar.HOUR_OF_DAY);
            int minutes = calendar.get(Calendar.MINUTE);
            int seconds = calendar.get(Calendar.SECOND);
            System.out.println(String.format("[%02d:%02d:%02d] Crawler report #%d:", hours, minutes, seconds, reportId));

            // memory usage 内存使用情况
            Runtime runtime = Runtime.getRuntime();
            long freeMem = runtime.freeMemory();
            long totalMem = runtime.totalMemory();
            long usedMem = totalMem - freeMem;
            double usedMemMB = usedMem / 1048576d;
            double totalMemMB = totalMem / 1048576d;
            double usedMemPercentage = usedMemMB / totalMemMB * 100;
            System.out.println(String.format("Memory usage: %.2fM used (%.1f%% of %.2fM total)", usedMemMB, usedMemPercentage, totalMemMB));

            // crawler counters 爬虫计数器
            long submitted = CrawlerTest.this.submittedCounter.get();
            long completed = CrawlerTest.this.completedCounter.get();
            long waiting = submitted - completed;
            long distinct = CrawlerTest.this.distinctCounter.get();
            System.out.println(String.format("Crawler counters: %d distinct, " + "%d submitted, %d completed, %d waiting", distinct, submitted, completed, waiting));
        }
    }

monitor线程随着爬虫的启动而启动，也随着爬虫的停止而停止，其输出示例如下：

[18:40:24] Crawler report #28:
Memory usage: 131.77M used (94.1% of 140.00M total)
Crawler counters: 2591 distinct, 345645 submitted, 35213 completed, 310432 waiting
[18:49:54] Crawler report #47:
Memory usage: 115.40M used (83.6% of 138.00M total)
Crawler counters: 4202 distinct, 522416 submitted, 110677 completed, 411739 waiting

这样就可以比较清楚地看到整个程序的内存使用状况，以及爬过的页面数量、积累的任务队列长度等等，也能够大致估测程序的运行速度。

0x03 性能调优（1）

看之前monitor输出的数据：

Crawler counters: 4202 distinct, 522416 submitted, 110677 completed, 411739 waiting
Crawler counters: 7546 distinct, 835781 submitted, 332357 completed, 503424 waiting

跑的是一个小的个人博客站（https://www.phodal.com），全站跑完大概7000多个页面，但是submitted这一项有八十多万。理论上来说，应该只有符合一定条件（链接在phodal.com域下）的链接才会被submit到任务队列中，那么为什么7000多个页面却有将近百万的内链呢？
简单的计算告诉我们，这些页面之间的交叉引用现象十分严重，平均每个页面在站内有一百多个链接。（这一点在博客站中格外明显，因为常见的博文都有多种索引方式，首页的排序、多个tag、按月归档、加起来每篇博文被引用的次数如此之多也不足为奇）而这些重复链接又导致了两个严重的问题：

内存不足。解决PreparedStatement的OOM之后我跑了一遍profiler，发现ExecutorService中的任务队列（RunnableWrapper）占掉了内存的30%，其中大部分都是重复的。
速度捉急。爬虫执行到最后阶段，站点上绝大多数的页面都已经被爬过了，distinct这一项增长很慢，但是waiting的量仍然无比巨大（数十万），于是接下来有很长一段时间都是在一遍遍地跑过这些早就覆盖到过的页面，浪费时间。

又因为7000左右的页面数量，假设每个url的平均长度是60个字符，内存占用也不过几百K而已，用HashSet去重的话速度完全可以接受，根本用不着数据库。（好吧我承认一开始上数据库纯粹是受了最初那篇文章的误导，正常人的思维应该都是一上来先用内存顶，顶不住再搞数据库，不过数据库也有它的好处，比如说方便保存进度或者分布式协作之类的，只是在现在这个体量下还用不着）
因此我决定抛弃数据库，修改代码。
初版代码比较丑，抛弃了InverseSemaphore和ExecutorService，而是自己实现了一个worker线程（内部类）：

private class CrawlerThread extends Thread {
    private AtomicBoolean shutdownRequested;
    public void shutdown() {
        this.shutdownRequested.set(true);
    }

    @Override
    public void run() {
        this.shutdownRequested = new AtomicBoolean(false);

        while (!this.shutdownRequested.get()) {
            String url = CrawlerTest.this.pollTask();
            if (url == null) {
                try {
                    Thread.sleep(500);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            } else {
                CrawlerTest.this.processPage(url);
            }
        }
    }
}

CrawlerTest内（把域的定义和run()等方法并列）：

    private Set<String> crawled;
    private SortedSet<String> queue;
    private PrintWriter outputWriter;

    public void run() throws SQLException, IOException, InterruptedException {
        this.crawled = new HashSet<>();
        this.queue = new TreeSet<>();
        try {
            File outputFile = new File("output.txt");
            FileOutputStream outputStream = new FileOutputStream(outputFile);
            this.outputWriter = new PrintWriter(outputStream);
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

        CrawlerThread[] workers = new CrawlerThread[10];
        for (int i = 0; i < workers.length; ++i)
            workers[i] = new CrawlerThread();

        ScheduledExecutorService monitor = null;
        this.submittedCounter = new AtomicLong(0);
        this.completedCounter = new AtomicLong(0);
        this.distinctCounter = new AtomicLong(0);
        if (this.config.startMonitorThread) {
            this.monitorReportCounter = new AtomicInteger(0);
            monitor = Executors.newSingleThreadScheduledExecutor();
            monitor.scheduleAtFixedRate(new CrawlerMonitor(), 30, 30, TimeUnit.SECONDS);
            System.out.println("Monitor thread started.");
        }

        this.submit(this.startPage());
        System.out.println("Crawler started.");
        for (Thread thread : workers) thread.start();

        while (true) {
            try {
                long completed = this.completedCounter.get();
                long submitted = this.submittedCounter.get();
                if (completed != submitted)
                    Thread.sleep(500);
                else break;
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        for (CrawlerThread thread : workers)
            thread.shutdown();
        for (Thread thread : workers)
            thread.join();

        if (monitor != null) {
            monitor.shutdown();
            monitor.awaitTermination(10, TimeUnit.MINUTES);
            System.out.println("Monitor thread terminated.");
        }

        this.outputWriter.close();

        System.out.println("Crawler terminated.");
    }

    private void submit(String url) {
        if (this.crawled.contains(url))
            return;
        if (this.queue.add(url))
            this.submittedCounter.incrementAndGet();
    }

    private void complete() {
        this.completedCounter.incrementAndGet();
    }

    private void processPage(String url) {
        try {
            if (this.crawled.contains(url)) return;
            this.distinctCounter.incrementAndGet();
            this.crawled.add(url);

            // fetch page
            Document doc = this.parse(url);
            this.outputWriter.println(url + " | " + doc.title());

            // crawl
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                String href = link.attr("abs:href");
                if (this.rules.shouldContinue(url, href))
                    this.submit(href);
            }
        } catch (IOException | IllegalArgumentException e) {
            System.err.println("Unable to fetch url: " + url + " - " + e.getMessage());
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            this.complete();
        }
    }

改了以后速度大概有1.5~3倍的提高。
（貌似前面的phodal站被我DDoS了？当然也有可能是他的网关把我给封了，无论如何，抱歉抱歉）
（吐槽一下翁天信的个人博客很坑，评论页超过上限（现在是52页）之后的链接不会报404而是生成一个带有“下一页”链接的默认页面，结果爬虫就死循环了。。）
顺便提一句母校上外附中是一个不错的初学者（比如我）用来爬的站点。爬www.sfls.cn域下的全站，（我这里）耗时两分钟，共有5592个纯HTML页面（貌似都是静态页，所以特别快）。
控制台输出如下：

Monitor thread started.
Crawler started.
[00:55:21] Crawler report #1:
Memory usage: 17.63M used (2.9% of 601.50M total)
Crawler counters: 1938 distinct, 2000 submitted, 1922 completed, 78 waiting
[00:55:51] Crawler report #2:
Memory usage: 39.20M used (6.9% of 572.00M total)
Crawler counters: 4118 distinct, 4182 submitted, 4102 completed, 80 waiting
[00:56:21] Crawler report #3:
Memory usage: 42.82M used (7.9% of 543.00M total)
Crawler counters: 5949 distinct, 5999 submitted, 5933 completed, 66 waiting
Monitor thread terminated.
Crawler terminated.

Process finished with exit code 0