1.Programming a Spider in Java
英文版在这http://www.developer.com/java/other/article.php/1573761,
中文翻译。http://blog.csdn.net/shuidao/archive/2007/09/05/1772512.aspx
2.MyEclipse下配置heritrix 1.14.3步骤
http://blog.163.com/caixinbao1/blog/static/161494162009730103718497/
3.Heritrix相关文章 -Xmx512m
http://www.cnblogs.com/hustcat/category/139956.html
http://atwo.iteye.com/blog/216960
4.Heritrix主页:http://crawler.archive.org/
Heritrix开发文档:http://crawler.archive.org/articles/developer_manual/index.html
Heritrix用户手册:http://crawler.archive.org/articles/user_manual/index.html
Heritrix使用小结:http://www.ruanko.com:9090/uchome/space.php?uid=871&do=blog&id=5773
编程启动Heritrix:http://www.soidc.net/discuss/1/040101/00/615080_1.html
http://lucenebook.spaces.live.com/
http://www.iteye.com/topic/141272
Heritrix yahoo:http://tech.groups.yahoo.com/group/archive-crawler/
无法增加选项的问题:
在Eclipse的Run Dialog中,Classpath标签Table,选中User Entries,然后右边会有Advance选项,选Add External Folder,把你的Conf加进去就行了)。再试,在Modules页面中的功能正常了。 )
5.wsdl文档下载
http://www.webservicex.net/WCF/Default.aspx
6.搜索引擎资料收集:http://wind-bell.iteye.com/blog/81504
package my.processor;
import java.util.logging.Logger;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.archive.crawler.datamodel.CandidateURI;
import org.archive.crawler.postprocessor.FrontierScheduler;
public class FrontierWsdlOnly extends FrontierScheduler
{
final static Logger logger=Logger.getLogger(FrontierWsdlOnly.class.getName());
public FrontierWsdlOnly(String name) {
super(name);
}
protected void schedule(CandidateURI caUri){
String url=caUri.toString();
if(url.endsWith(".jpg")
||url.endsWith(".gif")
||url.endsWith(".doc")
||url.endsWith(".html")
||url.contains("/images/"))
{
return;
}
getController().getFrontier().schedule(caUri);
}
}
切记切记要添加1.12.1-src的conf而不是1.12.1的conf