add.seed(url)解析可以参考这边文章:
http://www.programering.com/a/MTM1UzMwATQ.html
controller.start(MyCrawler.class, numberOfCrawlers);
controller为crawler4j的控制类的实例,start方法为crawler4j的执行任务的入口。
MyCrawler为爬虫类,负责控制url抓取规则的制定和页面的解析及数据持久化等等,继承自WebCrawler,然后webcrawler实现了runnable接口,可以开启多线程
public class WebCrawler implements Runnable
下面进入controller.start 方法,看具体是怎么实现的,此处只截取一部分,后半部分是监视线程
public <T extends WebCrawler> void start(final Class<T> _c, final int numberOfCrawlers) {
this.start(_c, numberOfCrawlers, true);
}
protected <T extends WebCrawler> void start(final Class<T> _c, final int numberOfCrawlers, boolean isBlocking) {
try {
finished = false;
crawlersLocalData.clear();
final List<Thread> threads = new ArrayList<>();
final List<T> crawlers = new ArrayList<>();
for (int i = 1; i <= numberOfCrawlers; i++) {
T crawler = _c.newInstance();
Thread thread = new Thread(crawler, "Crawler " + i);//设置爬虫名称
crawler.setThread(thread); //将线程和爬虫绑定
crawler.init(i, this);//初始化爬虫相关信息,如序号,控制器等
thread.start(); //开启爬虫
crawlers.add(crawler);//将爬虫添加到爬虫集合中
threads.add(thread);//增加爬虫线程到线程集合中
logger.info("Crawler " + i + " started.");//打印日志信息
}
此处用到了方法的泛型,T必须是WebCrawler的子类,在for循环中初始化每个爬虫,下面来看thread.start()线程的业务逻辑代码
public void run() {
onStart();
while (true) { //此处是死循环
List<WebURL> assignedURLs = new ArrayList<>(50);
isWaitingForNewURLs = true;
frontier.getNextURLs(50, assignedURLs);//每次取50条数据放入集合中
isWaitingForNewURLs = false;
if (assignedURLs.size() == 0) {
if (frontier.isFinished()) {
return; //退出条件
}
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
} else {
for (WebURL curURL : assignedURLs) {
if (curURL != null) {
processPage(curURL);
frontier.setProcessed(curURL);
}
if (myController.isShuttingDown()) {
logger.info("Exiting because of controller shutdown.");
return;//退出条件
}
}
}
}
}
此线程不断从队列中取出数据,每次取50条,然后对集合中的url进行处理,调用processPage方法,下面看看这个方法的业务逻辑
private void processPage(WebURL curURL) {
if (curURL == null) {
return;
}
PageFetchResult fetchResult = null;
try {
//...........................
try {
visit(page);//调用用户自定义visit方法,此方法是用户在子类中自定义的
} catch (Exception e) {
logger.error("Exception while running the visit method. Message: '" + e.getMessage() + "' at " + e.getStackTrace()[0]);
}
} catch (Exception e) {
logger.error(e.getMessage() + ", while processing: " + curURL.getURL());
} finally {
if (fetchResult != null) {
fetchResult.discardContentIfNotConsumed();
}
}
}