最近一直在写爬虫,于是写了以下一个简单的线程池。在这,抛砖引玉,忘大家多多指点。
项目开始初,在查阅javaeye论坛中,曾看到一句这样的提示性设计:
线程是被动,由别人(监工)分配,激发
有一个主线程--“监工”:负责查看任务队列中是否有任务,如果有,取出一个任务,设置到一个“空闲”的线程中,并notify该线程。(http://www.iteye.com/topic/104432)
在这,我也运用这个思想,详见下面的具体实现。
先来看池程池类:
package jk.spider.core.task.threading;
import jk.spider.core.task.WorkerTask;
import jk.spider.core.task.dispatch.DispatcherTask;
import org.apache.log4j.Logger;
/**
* 线程池
* @author kqy
* @date 2008-12-30
* @version 2.0
*/
public class WorkerThreadPool extends ThreadGroup {
protected static final Logger log = Logger.getLogger(WorkerThreadPool.class);
protected DispatcherThread dispatcherThread;
protected WorkerThread[] pool;
protected int poolSize;
public WorkerThreadPool(String poolName, String threadName, int poolSize) {
super(poolName);
this.poolSize = poolSize;
//事件分发线程
dispatcherThread = new DispatcherThread(this, threadName + " dispathcer");
pool = new WorkerThread[poolSize];
for (int i = 0; i < poolSize; i++) {
pool[i] = new WorkerThread(this, threadName, i);
synchronized (this) {
try {
//启动线程,随即让其休眠
pool[i].start();
wait();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
}
/**
* 分配工作任务至线程池,线程池将选择工作线程执行任务
* 首先检查池程池中是否有空闲的工作线程,
* 如果存在,唤醒该线程; 否,休眠,直到有空闲线程时唤醒
* @param task
*/
public synchronized void assign(WorkerTask task) {
while (true) {
for (int i = 0; i < poolSize; i++) {
if (pool[i].isAvailable()) { //判断该线程是否可以分配任务
pool[i].assign(task); //唤醒该线程
return;
}
}
try {
wait();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
/**
* 分配分发任务至线程池
*
* @param task
*/
public void assignGroupTask(DispatcherTask task) {
dispatcherThread.assign(task);
}
/**
* 关闭线程池
*
*/
public void stopAll() {
for (int i = 0; i < pool.length; i++) {
WorkerThread thread = pool[i];
thread.stopRunning();
}
}
public int getSize() {
return poolSize;
}
}
下面再看工作线程:
package jk.spider.core.task.threading;
import jk.spider.core.task.WorkerTask;
import org.apache.log4j.Logger;
/**
* 工作线程
* @author kqy
* @date 2008-12-30
* @version 2.0
*/
public class WorkerThread extends Thread {
protected static final Logger log = Logger.getLogger(WorkerThread.class);
//空闲的线程
public static final int WORKERTHREAD_IDLE = 0;
//阻塞的
public static final int WORKERTHREAD_BLOCKED = 1;
//繁忙的
public static final int WORKERHTREAD_BUSY = 2;
protected int state;
//是否有工作任务分配至该线程
protected boolean assigned;
//该线程是否被激活,正在运行
protected boolean running;
protected WorkerThreadPool pool;
protected WorkerTask task;
public WorkerThread(WorkerThreadPool pool, String name, int i) {
super(pool, name + " " + i);
this.pool = pool;
running = false;
assigned = false;
state = WORKERTHREAD_IDLE;
}
/**
* 判断是否还可分配任务至该线程
* @return
*/
public boolean isAvailable() {
return (!assigned) && running;
}
public boolean isOccupied() {
return assigned;
}
/**
* 分配一个新的任务,并告知这个线程不接受任何新的任务
* @param task
*/
public synchronized void assign(WorkerTask task) {
if(!running) {
throw new RuntimeException("THREAD NOT RUNNING, CANNOT ASSIGN TASK !!!");
}
if(assigned) {
throw new RuntimeException("THREAD ALREADY ASSIGNED !!!");
}
this.task = task;
assigned = true;
notify();
}
public int getStates() {
return state;
}
public synchronized void run() {
running = true;
log.info("Worker thread ( " + this.getName() + " ) born...");
synchronized(pool) {
pool.notify(); //唤醒下一个线程
}
while(running) {
if(assigned) {
state = WORKERTHREAD_BLOCKED;
task.prepare(); //前期准备
state = WORKERHTREAD_BUSY;
try {
task.execute();//执行任务
} catch (Exception e) {
log.fatal("PANIC! Task " + task + " threw an excpetion!", e);
}
synchronized(pool) {
assigned = false;
task = null;
state = WORKERTHREAD_IDLE;
pool.notify(); //唤醒池程池中待分配的任务
this.notify();
}
}
try {
wait();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
log.info("Worker thread (" + this.getName() + ") dying");
}
/**
* 关闭所有线程
*/
public synchronized void stopRunning() {
if( !running ) {
throw new RuntimeException ("THREAD NOT RUNNING - CANNOT STOP !");
}
if ( assigned ) {
try {
this.wait();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
running = false;
notify();
}
}
现还缺一个监工,负责分配,激发线程,这个就很简单的,主要是调用WorkerThreadPool 中的assign方法。
package jk.spider.core.task.threading;
import jk.spider.core.task.dispatch.DispatcherTask;
public class DispatcherThread extends Thread {
protected DispatcherTask task;
public DispatcherThread(ThreadGroup group, String name) {
super(group, name);
}
public void assign(DispatcherTask task) {
this.task = task;
start();
}
public void run() {
synchronized(task) {
task.execute();
task.notify();
}
}
}
DispatcherTask 类,抽象类,因为在爬虫,我设计了任务分发器,都是从DispatcherTask中继承下来,在这,具体看负责分发抓取的任务分发类。
package jk.spider.core.task.dispatch;
import jk.spider.core.SpiderController;
import jk.spider.core.task.Task;
import jk.spider.core.task.WorkerTask;
import jk.spider.core.task.threading.WorkerThreadPool;
public abstract class DispatcherTask implements WorkerTask {
protected SpiderController controller;
protected WorkerThreadPool pool;
protected boolean running;
public DispatcherTask(SpiderController controller, WorkerThreadPool pool) {
this.controller = controller;
this.pool = pool;
this.running = true;
}
public SpiderController getSpiderController() {
return this.controller;
}
public void shutdown() {
this.running = false;
}
}
DispatchSpiderTasks.java 具体分发任务,唤醒线程,化被动通知线程池执行为主动激发线程池中线程处理
package jk.spider.core.task.dispatch;
import jk.spider.core.SpiderController;
import jk.spider.core.task.WorkerTask;
import jk.spider.core.task.threading.WorkerThreadPool;
import org.apache.log4j.Logger;
public class DispatchSpiderTasks extends DispatcherTask {
protected static final Logger log = Logger.getLogger(DispatchSpiderTasks.class);
protected WorkerThreadPool spiders;
public DispatchSpiderTasks(SpiderController controller, WorkerThreadPool spiders) {
super(controller, spiders);
}
public int getType() {
return WorkerTask.WORKERTASK_SPIDERTASK;
}
public void prepare() { }
public void execute() {
log.info("Spider task dispatcher running ...");
while(running) {
try {
//从Scheduler中得到一个任务,队列是使用JDK1.5中的LinkedBlockingQueue
spiders.assign(controller.getContent().getSpiderTask());
} catch (InterruptedException e) {
log.warn("DispatchSpiderTasks InterruptedException -> ", e);
running = false;
}
}
log.info("Spider task dispatcher dying ...");
}
}
Task.java 这个类,大家一看就会明白,我就不细说了。
package jk.spider.core.task;
/**
* Spider Task 提供统一接口,并将任务添加至Scheduler
* 由线程池从Scheduler中取任务
* @author kqy
*
*/
public interface Task {
/**
* 执行任务,线程池将会调用该方法执行任务
*/
public void execute();
}
写了一大篇幅,而现在JDK1.5以上都提供了线程池的实现,使用起来更加方法,但为了更扎实自己对线程的理解,在参考了一些资料自己现实该简单的线程池,收获不少,不再被其中的wait(),notify(),notifyAll()搞得晕头转向。
实践才是真理啊...
经过一段时间,爬虫也已完工。其中慢慢的不断改善,已具体爬虫必备的良好扩展及可配置。
现回顾,总结,又是一次学习,发现其中很多的设计是那么的傻。
水平未到家,继续需努力...