Niocchi - Java crawl library implementing synchronous I/O multiplexing

Niocchi - Java crawl library implementing synchronous I/O multiplexing

Niocchi is a java crawler library implementing synchronous I/O multiplexing.
This specific type of implementation allows crawling tens of thousands of hosts in parallel on a single low end server. Niocchi has been designed for big search engines that need to crawl massive amount of data, but can also be used to write no frills crawlers. It is currently used in production by Enormo and Vitalprix.

javadoc

Index

  1. Introduction
  2. Requirements
  3. License
  4. Package organization
  5. Architecture
  6. Usage
  7. Caveats
  8. To Do
  9. Download
  10. Change history
  11. About the authors

Introduction

Most of the java crawling libraries use standard java IO package.
That means crawling N documents in parallel requires at least N running
threads. Even if each thread is not taking a lot of resources while
fetching the content, that approach becomes costly when crawling at a
large scale. On the contrary, doing synchronous I/O multiplexing by using the NIO
package introduced in java 1.4 allows the crawling of many documents in
parallel using one single thread.

Requirements

Niocchi requires java 1.5 or above.

License

This software is licensed under the Apache license version 2.0.

Package organization

  • org.niocchi.core holds the library itself.
  • org.niocchi.gc holds an implementation example of a very simple crawler that reads the URL to crawl from a file and saves the crawled documents.
  • org.niocchi.monitor holds a utility thread that can be used by the crawler to provide real time information through a telnet connexion.
  • org.niocchi.rc holds an implementation example of a RedirectionController.
  • org.niocchi.resources holds a few implementation examples of the Resource and ResourceCreator classes.
  • org.niocchi.urlpools holds a few implementation examples of the URLPool class.

Architecture

  • A Query encapsulates a URL and implements methods to check its
    status after being crawled.
  • A Resource holds the crawled content and implements methods to
    save it.
  • Each Query is associated to a Resource. To crawl one URL, one
    Resource needs to be taken from the pool of resources. Once the URL is
    crawled and its content processed, the Resource is returned to the
    pool. The number of available Resources is fixed and controls how many
    URL can be crawled in parallel at any time. This number is set through
    the ResourcePool constructor.
  • When a Query is crawled, its associated Resource will be
    processed by one of the workers.
  • The URLPool acts as a source of URLs to crawl into which the
    crawled taps. It's an interface that must be implemented to provide
    URLs to the crawler.
  • The crawler has been designed as "active", meaning it consumes
    URLs from the URLPool, as opposed to being "passive" and waiting to be
    given URL. When the crawler starts, it will get URLs to crawl from the
    URLPool until all resources are consumed or hasNextQuery() return false
    or getNextQuery() returns null. Each time a Query is crawled and
    processed and its Resource returned to the ResourcePool, the crawler
    requests other URLs to crawl from the URLPool until all resources are
    consumed or hasNextQuery() return false or getNextQuery() returns null.
    If all URLs have been crawled and no more are immediately available to
    crawl, the crawler will recheck every second for available URLs to
    crawl.
  • When a Query has been crawled, it is put into a FIFO of queries
    to be processed. One of the Workers will take it and process the
    content of its associated Resource. The work is done in the
    processResource() method. The Query is then returned to the URLPool
    which can examine the crawl status and the result of the processing.
    Lastly, the Query associated Resource is returned to the Resource pool.
  • In order to not block during host name resolution, the Crawler
    uses two additional threads. ResolverQueue resolves the URL coming from
    the URLPool and RedirectionResolverQueue resolves the URLs gotten from
    redirections.
posted on 2012-05-22 11:15  lexus 阅读( ...) 评论( ...) 编辑 收藏

转载于:https://www.cnblogs.com/lexus/archive/2012/05/22/2513003.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值