在MyEclipse7.5配置Heritrix1.14.4

今天在MyEclipse7.5配置好Heritrix,可以在MyEclipse中启动。

主要步骤如下:

1. 下载heritrix-1.14.4.zipheritrix-1.14.4-src.zip,分别解压到heritrix-1.14.4heritrix-1.14.4-src

2.新建空的Java Project, 命名为Heritrix(路径为%MYECLIPSE_HOME%/workspace/Heritrix; (注:Eclipse在创建工程有两种选择,可不用把代码放进src目录,默认的话,会生成binsrc文件夹的在下图选择,第一个就是不用放在src,第二个是默认的。创建Heritrix工程时要选择根目录是源码目录,否则默认要把源码放到src目录,如下图,要选第一项)

3.   heritrix-1.14.4-src/src/java/目录下的org文件夹和st文件夹拷贝到Heritrix根目录下;

        heritrix-1.14.4/src下的webapps文件夹拷贝到Heritrix根目录下;

        heritrix-1.14.4-src下的lib目录拷贝到Heritrix根目录下;

4. 解压缩heritrix-1.14.4目录下的heritrix-1.14.4.jar文件到heritrix_jar文件夹,把heritrix_jar目录下的modulesprofilesselftest三个文件夹以及arcMetaheaderBody.xslheritrix.propertiesjndi.properties拷贝到Heritrix根目录下;

5. 在项目HerirtixPropertries->Java Build Path->Liabraries->Add External JARs 引入F:/Heritrix/heritrix-1.14.4-src/libjar

6. 打开Heritrix /heritrix.properties文件,找到“heritrix.cmdline.admin =”,修改为“heritrix.cmdline.admin = admin:admin”;

7. 找到org.archive.crawler包,运行Heritrix.java中的main函数,run as Java Application。得到下面的提示信息:

.515 EVENT  Starting Jetty/4.2.23

11:49:17.578 WARN!! Delete existing temp dir C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/Jetty_127_0_0_1_8080__ for WebApplicationContext[/,jar:file:/F:/workspace/Heritrix/webapps/admin.war!/]

11:49:17.796 EVENT  Started WebApplicationContext[/,Heritrix Console]

11:49:17.890 EVENT  Started SocketListener on 127.0.0.1:8080

11:49:17.890 EVENT  Started org.mortbay.jetty.Server@1113708

2010-06-01 11:49:17.968 严重 thread-10 org.archive.util.ArchiveUtils.<clinit>() TLD list unavailable

java.lang.NullPointerException

      at java.io.Reader.<init>(Unknown Source)

      at java.io.InputStreamReader.<init>(Unknown Source)

      at org.archive.util.ArchiveUtils.<clinit>(ArchiveUtils.java:759)

      at org.archive.crawler.settings.CrawlSettingsSAXHandler$DateHandler.endElement(CrawlSettingsSAXHandler.java:385)

      at org.archive.crawler.settings.CrawlSettingsSAXHandler.endElement(CrawlSettingsSAXHandler.java:248)

      at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown Source)

      at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)

      at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)

      at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)

      at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)

      at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)

      at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)

      at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)

      at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)

      at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)

      at org.archive.crawler.settings.XMLSettingsHandler.readSettingsObject(XMLSettingsHandler.java:298)

      at org.archive.crawler.settings.XMLSettingsHandler.readSettingsObject(XMLSettingsHandler.java:339)

      at org.archive.crawler.settings.SettingsHandler.initialize(SettingsHandler.java:130)

      at org.archive.crawler.settings.XMLSettingsHandler.initialize(XMLSettingsHandler.java:124)

      at org.archive.crawler.admin.CrawlJobHandler.loadProfile(CrawlJobHandler.java:385)

      at org.archive.crawler.admin.CrawlJobHandler.loadProfiles(CrawlJobHandler.java:348)

      at org.archive.crawler.admin.CrawlJobHandler.<init>(CrawlJobHandler.java:217)

      at org.archive.crawler.admin.CrawlJobHandler.<init>(CrawlJobHandler.java:186)

      at org.archive.crawler.Heritrix.<init>(Heritrix.java:405)

      at org.archive.crawler.Heritrix.<init>(Heritrix.java:393)

      at org.archive.crawler.Heritrix.doCmdLineArgs(Heritrix.java:718)

      at org.archive.crawler.Heritrix.main(Heritrix.java:556)

Heritrix version: 1.14.4

7. IE浏览器地址栏中输入http://127.0.0.1:8080,然后输入用户名密码 admin, admin. 

不知道为什么启动时有一堆异常,但还是能正常启动任务爬网站。

 

 

后来我下载了heritrix1.14.3在MyEclipse里面按以上步骤进行配置,能正常启动:

07:12:17.984 EVENT  Starting Jetty/4.2.23

07:12:18.031 WARN!! Delete existing temp dir C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/Jetty_127_0_0_1_8080__ for WebApplicationContext[/,jar:file:/F:/workspace/Heritrix3/webapps/admin.war!/]

07:12:18.234 EVENT  Started WebApplicationContext[/,Heritrix Console]

07:12:18.312 EVENT  Started SocketListener on 127.0.0.1:8080

07:12:18.312 EVENT  Started org.mortbay.jetty.Server@133f1d7

Heritrix version: 1.14.3

可能是1.14.4版本有问题所以启动时会报异常吧。。。

 

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值