1 In larbin.conf, close to crawle external sites except in the list of startURL by setting noExternalLinks.
# do you want to follow external links,
noExternalLinks
2 The way to improve the crawlling speed
【引】http://hi.baidu.com/hustwk/blog/item/fd3325dde12598dc8c1029ef.html
1、将larbin.conf里面的waitDuration设置为1,这里不再考虑polite^_^, 设置为1大多数网站其实还能忍受;
2、将types.h里面的maxUrlsBySite修改为254;
3、将main.cc里面的代码做如下修改:
// see if we should read again urls in fifowait
if (( global::now % 30) == 0 ) {
global::readPriorityWait = global::URLsPriorityWait->getLength();
global::readWait = global::URLsDiskWait->getLength();
}
if ((global::now % 30) == 15) {
global::readPriorityWait = 0;
global::readWait = 0;
}
if (( global::now % 30) == 0 ) {
global::readPriorityWait = global::URLsPriorityWait->getLength();
global::readWait = global::URLsDiskWait->getLength();
}
if ((global::now % 30) == 15) {
global::readPriorityWait = 0;
global::readWait = 0;
}