对于多线程的使用,一般对于起多少个线程数目,对于这点我一般会考虑CPU核心数,消耗的资源以及是否是瓶颈
下面我用一个示例大致解释下我的思路
public class Demo {
BlockingQueue<String> urlQueue = new ArrayBlockingQueue<String>(1024);
BlockingQueue<Html> htmlQueue = new ArrayBlockingQueue<Html>(1024);
BlockingQueue<Meta> metaQueue = new ArrayBlockingQueue<Meta>(1024);
public void execute() throws InterruptedException {
new Thread(new QueryThread()).start();
Thread[] spiders = new Thread[5];
for (int x = 0; x < spiders.length; x++) {
spiders[x] = new Thread(new SpiderThread());
spiders[x].start();
}
Thread[] parsers = new Thread[5];
for (int x = 0; x < parsers.length; x++) {
parsers[x] = new Thread(new ParserThread());
parsers[x].start();
}
Thread[] writers = new Thread[3];
for (int x = 0; x < writers.length; x++) {
writers[x] = new Thread(new WriteThread());
writers[x].start();
}
//等待Spider线程结束
for (int x = 0; x < spiders.length; x++) {
spiders[x].join();
}
//往htmlQueue通讯队列中放入结束信号
putEmptySingeleToHtmlQueue();
//等待Parser线程结束
for (int x = 0; x < parsers.length; x++) {
parsers[x].join();
}
//往metaQueue通讯队列中放入结束信号
putEmptySingeleToMetaQueue();
//等待Writer线程结束
for (int x = 0; x < writers.length; x++) {
writers[x].join();
}
//Writer线程全部结束,程序结束
}
private void putEmptySingeleToMetaQueue() throws InterruptedException {
Meta meta = new Meta();
meta.setEmpty(true);
metaQueue.put(meta);
}
private void putEmptySingeleToHtmlQueue() throws InterruptedException {
Html empty = new Html();
empty.setEmpty(true);
htmlQueue.put(empty);
}
class QueryThread implements Runnable{
@Override
public void run() {
try {
String url = null;
while ((url = getUrl()) != null) {
if (url.length() == 0) {
continue;
}
urlQueue.put(url);
}
urlQueue.put("");
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private String getUrl() {
return null;
}
}
class SpiderThread implements Runnable {
@Override
public void run() {
try {
while (true) {
String url = urlQueue.take();
if (url.length() == 0) {
//get empty single put back and stop thread
//then the other thread can get the empty single
urlQueue.put(url);
break;
}
Html html = crawl(url);
if (html == null) {
// deal fail
continue;
}
htmlQueue.put(html);
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private Html crawl(String url) {
return null;
}
}
class ParserThread implements Runnable {
@Override
public void run() {
try {
while (true) {
Html take = htmlQueue.take();
if (take.isEmpty()) {
htmlQueue.put(take);
break;
}
Meta meta = translate(take);
metaQueue.put(meta);
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private Meta translate(Html take) {
//parse data
return null;
}
}
class WriteThread implements Runnable {
@Override
public void run() {
try {
while (true) {
Meta take = metaQueue.take();
if (take.isEmpty()) {
metaQueue.put(take);
break;
}
write(take);
}
} catch (InterruptedException e) {
e.printStackTrace();
}
}
private void write(Meta take) {
// write data
}
}
class Html {
private boolean empty;
public boolean isEmpty() {
return empty;
}
public void setEmpty(boolean empty) {
this.empty = empty;
}
}
class Meta {
private boolean empty;
public boolean isEmpty() {
return empty;
}
public void setEmpty(boolean empty) {
this.empty = empty;
}
}
}
这是一个简单的爬取和解析的简单爬虫,其中主要分:1、取数据,2、爬取数据,3、解析数据,4、写数据到硬盘
首先我会分析各部分所耗资源的点
1、取数据:消耗资源的是磁盘,占用读取速度,这里不会是瓶颈。
2、爬取数据:消耗的资源是网络资源,相对于1来说是很大的瓶颈,所以1只用起一个线程足矣。而这部分该起多少线程,当然是越多越好,但是还要考虑爬取站点的通畅性,适可的增加。
3、解析数据:消耗的是CPU资源,哪这里我就会考虑CPU的核心数,一般来说会起和CPU核心数相同的线程数。假如我们多起两个线程,我们可以想想,如果线程数比CPU核心数多,必然会出现两个解析线程争抢一个CPU核心的资源。而这部分又是消耗CPU资源的,从而导致解析这块必定有线程处于阻塞状态,致使降低效率,所以在消耗CPU这块尽量保证线程数超过CPU核心数目。
4、写数据到硬盘:消耗的资源是硬盘,占用写速度,这里视爬取那块而定。但一般不会起太多线程,因为写入速度也是一个瓶颈,起太多不会对效率提高有多大影响。