今天在MyEclipse7.5配置好Heritrix,可以在MyEclipse中启动。
主要步骤如下:
1. 下载heritrix-1.14.4.zip和heritrix-1.14.4-src.zip,分别解压到heritrix-1.14.4和heritrix-1.14.4-src;
2.新建空的Java Project, 命名为Heritrix(路径为%MYECLIPSE_HOME%/workspace/Heritrix); (注:Eclipse在创建工程有两种选择,可不用把代码放进src目录,默认的话,会生成bin和src文件夹的,在下图选择,第一个就是不用放在src,第二个是默认的。创建Heritrix工程时要选择根目录是源码目录,否则默认要把源码放到src目录,如下图,要选第一项)
3. 把heritrix-1.14.4-src/src/java/目录下的org文件夹和st文件夹拷贝到Heritrix根目录下;
把heritrix-1.14.4/src下的webapps文件夹拷贝到Heritrix根目录下;
把heritrix-1.14.4-src下的lib目录拷贝到Heritrix根目录下;
4. 解压缩heritrix-1.14.4目录下的heritrix-1.14.4.jar文件到heritrix_jar文件夹,把heritrix_jar目录下的modules、profiles、selftest三个文件夹以及arcMetaheaderBody.xsl、heritrix.properties、jndi.properties拷贝到Heritrix根目录下;
5. 在项目Herirtix的Propertries->Java Build Path->Liabraries->Add External JARs 引入F:/Heritrix/heritrix-1.14.4-src/lib的jar包
6. 打开Heritrix /heritrix.properties文件,找到“heritrix.cmdline.admin =”,修改为“heritrix.cmdline.admin = admin:admin”;
7. 找到org.archive.crawler包,运行Heritrix.java中的main函数,run as Java Application。得到下面的提示信息:
.515 EVENT Starting Jetty/4.2.23
11:49:17.578 WARN!! Delete existing temp dir C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/Jetty_127_0_0_1_8080__ for WebApplicationContext[/,jar:file:/F:/workspace/Heritrix/webapps/admin.war!/]
11:49:17.796 EVENT Started WebApplicationContext[/,Heritrix Console]
11:49:17.890 EVENT Started SocketListener on 127.0.0.1:8080
11:49:17.890 EVENT Started org.mortbay.jetty.Server@1113708
2010-06-01 11:49:17.968 严重 thread-10 org.archive.util.ArchiveUtils.<clinit>() TLD list unavailable
java.lang.NullPointerException
at java.io.Reader.<init>(Unknown Source)
at java.io.InputStreamReader.<init>(Unknown Source)
at org.archive.util.ArchiveUtils.<clinit>(ArchiveUtils.java:759)
at org.archive.crawler.settings.CrawlSettingsSAXHandler$DateHandler.endElement(CrawlSettingsSAXHandler.java:385)
at org.archive.crawler.settings.CrawlSettingsSAXHandler.endElement(CrawlSettingsSAXHandler.java:248)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at org.archive.crawler.settings.XMLSettingsHandler.readSettingsObject(XMLSettingsHandler.java:298)
at org.archive.crawler.settings.XMLSettingsHandler.readSettingsObject(XMLSettingsHandler.java:339)
at org.archive.crawler.settings.SettingsHandler.initialize(SettingsHandler.java:130)
at org.archive.crawler.settings.XMLSettingsHandler.initialize(XMLSettingsHandler.java:124)
at org.archive.crawler.admin.CrawlJobHandler.loadProfile(CrawlJobHandler.java:385)
at org.archive.crawler.admin.CrawlJobHandler.loadProfiles(CrawlJobHandler.java:348)
at org.archive.crawler.admin.CrawlJobHandler.<init>(CrawlJobHandler.java:217)
at org.archive.crawler.admin.CrawlJobHandler.<init>(CrawlJobHandler.java:186)
at org.archive.crawler.Heritrix.<init>(Heritrix.java:405)
at org.archive.crawler.Heritrix.<init>(Heritrix.java:393)
at org.archive.crawler.Heritrix.doCmdLineArgs(Heritrix.java:718)
at org.archive.crawler.Heritrix.main(Heritrix.java:556)
Heritrix version: 1.14.4
7. 在IE浏览器地址栏中输入http://127.0.0.1:8080,然后输入用户名密码 admin, admin.
不知道为什么启动时有一堆异常,但还是能正常启动任务爬网站。
后来我下载了heritrix1.14.3在MyEclipse里面按以上步骤进行配置,能正常启动:
07:12:17.984 EVENT Starting Jetty/4.2.23
07:12:18.031 WARN!! Delete existing temp dir C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/Jetty_127_0_0_1_8080__ for WebApplicationContext[/,jar:file:/F:/workspace/Heritrix3/webapps/admin.war!/]
07:12:18.234 EVENT Started WebApplicationContext[/,Heritrix Console]
07:12:18.312 EVENT Started SocketListener on 127.0.0.1:8080
07:12:18.312 EVENT Started org.mortbay.jetty.Server@133f1d7
Heritrix version: 1.14.3
可能是1.14.4版本有问题所以启动时会报异常吧。。。