记一次多线程读取文件并进行URL爬取的代码设计

最新推荐文章于 2022-03-18 10:58:11 发布

zhiwei0701

最新推荐文章于 2022-03-18 10:58:11 发布

阅读量482

点赞数

分类专栏：爬虫设计文章标签：爬虫设计

本文链接：https://blog.csdn.net/tao_jiayun/article/details/89943811

版权

爬虫设计专栏收录该内容

2 篇文章 0 订阅

订阅专栏

记一次多线程读取文件并进行爬虫爬取URL的代码设计

近期入职新公司，两周内知识的输入是以往的好几倍，值得将近期学到的东西记录一下。
第一个完成的任务是从HAWQ的数据表中拉取千万级别数量的URL到文件中，程序对不同URL进行爬取相关内容，爬取内容很简单，无非是文章标题、摘要等，比较复杂的是千万级别数量URL的爬取速度以及文件读取方式，保证在速度快的情况下，爬取正确率稳定在80%以上。

设计思路

首先需要明确几个点：

几千万的URL肯定不能直接全部从文件中读取到内存中
代码设计分为3步，（从HAWQ中读取写入文件）读文件->URL爬取->写文件（之后写入HAWQ），三步简单来说一定是3个不同线程完成，并且具有相互依赖关系

实现

代码采用LinkedBlockingQueue作为文件读取以及写入的缓存队列，每次从文件中读取100条，读完后由爬取线程进行爬取，爬取的同时进行文件的写入。
爬取线程数量可自动配置，但必须保证，爬取完成后所有线程正常退出，所以必须有一个完全保障的机制。这里使用CountDownLatch作为一个同步工具，程序启动后立即调用await()方法阻塞，每一个爬取线程完成后，调用countDown()方法减1，直至所有爬取线程完成后都调用此方法，count等于0，然后主线程通过await()方法恢复自己的任务。
不论的文件的读取、爬取线程、文件的写入都应该有安全的退出机制。文件的读取和写入分别设置标志位，文件的写入要在文件读取完成后，爬取线程完全退出后，并且队列为空这三种情况下，才能最后退出。以下为具体实现，值得记录并学习这种处理思维，严谨并考虑多种情况。
特别注意一点，所有异常都要处理或者抛出，或者会导致线程异常退出，那么主线程就永远无法退出。

文件读取线程：

/**
     * 读取文件线程
     *
     * @return thread
     */
    private Thread readFileThread() {
        return new Thread(() -> {
            File file = new File(inputFile);
            try {
                BufferedReader reader = new BufferedReader(new FileReader(file));
                String tempString;
                while ((tempString = reader.readLine()) != null) {
                    try {
                        urlReadQueue.put(tempString);
                    } catch (Exception e) {
                        LOG.warn("Failed put url {}", tempString, e);
                    }
                }
                readFileFinished = true;
            } catch (Exception e) {
                e.printStackTrace();
            }
        });
    }

爬取线程：

/**
     * 爬取线程
     *
     * @return thread
     */
    private Thread crawlerThread() {
        return new Thread(() -> {
            StringBuffer sb;
            //文件未读取完成并且读取队列不为空
            RequestConfig requestConfig = RequestConfig.custom()
                    .setSocketTimeout(1000)
                    .setConnectionRequestTimeout(1000)
                    .setConnectTimeout(1000).build();
            CloseableHttpClient client = HttpClients.custom()
                    .setDefaultSocketConfig(SocketConfig
                            .custom()
                            .setSoTimeout(2000)
                            .build())
                    .setDefaultRequestConfig(requestConfig)
                    .build();
            while (!readFileFinished || !urlReadQueue.isEmpty()) {
                sb = new StringBuffer();
                String readString = urlReadQueue.poll();
                String crawlUrl = null;
                if (readString == null) continue;
                try {
                    crawlUrl = readString.split("\t")[1];
                } catch (Exception e) {
                    LOG.info("Failed split url {}", readString, e);
                }
                try {
                    HttpGet get = new HttpGet(crawlUrl);
                    get.setConfig(requestConfig);
                    try (CloseableHttpResponse response = client.execute(get)) {
                        //获取爬取文件字节码
                        String content=response.getEntity().getContent();        
                        Document document = Jsoup.parse(content);           
                        urlWriteQueue.put(content);
                    }
                    LOG.info("result is {}",content);
                } catch (Exception e) {
                    LOG.info("Download url {} faild {}", crawlUrl, e.getMessage());
                }
            }
            //线程完成后-1
            LOG.info("Thread exist");
            countDownLatch.countDown();
        });
    }

文件写入线程：

/**
     * 写入文件线程
     *
     * @return thread
     */
    private Thread writeFileThread() {
        return new Thread(() -> {
            File file = new File(outputFile);
            if (!file.exists()) {
                try {
                    boolean create = file.createNewFile();
                    if (!create) {
                        LOG.info("Create file faild");
                    }
                } catch (IOException e) {
                    LOG.info("Create file failed", e);
                }
            }
            try (BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter(file.getAbsoluteFile()))) {
                while (!crawlerThreadExit || urlWriteQueue.size() != 0) {
                    String content = urlWriteQueue.poll();
                    if (content != null) {
                        bufferedWriter.write(content);
                    }
                }
            } catch (IOException e) {
                LOG.warn("Write file failed", e);
            }
        });
    }

等待所有其他线程结束的线程：

/**
     * 等待所有其他线程结束的线程
     *
     * @return thread
     */
    private Thread waitThread() {
        return new Thread(() -> {
            try {
                countDownLatch.await();
                crawlerThreadExit = true;
            } catch (InterruptedException e) {
                LOG.info("Interrupt!", e);
            }
        });
    }

zhiwei0701

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
记一次多线程读取文件并进行URL爬取的代码设计

记一次多线程读取文件并进行URL爬取的代码设计近期入职新公司，两周内知识的输入是以往的好几倍，值得将近期学到的东西记录一下。第一个完成的任务是从HAWQ的数据表中拉取千万级别数量的URL到文件中，程序对不同URL进行爬取相关内容，爬取内容很简单，无非是文章标题、摘要等，比较复杂的是千万级别数量URL的爬取速度以及文件读取方式，保证在速度快的情况下，爬取正确率稳定在80%以上。设计思路首先需...
复制链接

扫一扫