在Eclipse中运行Nutch1.0

Run Nutch In Eclipse on Linux and Windows nutch version 1.0

Tested with

  • Nutch release 1.0
  • Eclipse 3.3
  • Java 1.6
  • Ubuntu (should work on most platforms though)
  • Windows XP

Steps

For Windows Users

If you are running Windows (tested on Windows XP) you must first install cygwin

Download cygwin from http://www.cygwin.com/setup.exe

You can learn how to install cygwin from Internet, I will omit the steps of installing here.

After installing cygwin, you can follow rest of these steps.

Install Nutch
  • Grab a fresh release of nutch 1.0 - http://lucene.apache.org/nutch/version_control.html
  • Set NUTCH_HOME(the location you download the nutch1.0) in environment variables.
  • Set NUTCH_JAVA_HOME(the same place as JDK1.6) in environment variables.
  • Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory
Create a new java project in Eclipse
  • File > New > Project > Java project > click Next
  • Name the project (Nutch for instance)
  • Select "Create project from existing source" and use the location where you downloaded nutch-1.0
  • Click on Next, and wait while Eclipse is scanning the folders
  • Add the folder "conf" to the classpath (third tab and then add class folder)
  • Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top.
  • Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
  • Set output dir to "tmp_build", create it if necessary
  • DO NOT add "build" to classpath
Configure Nutch
  1. Open up $NUTCH_HOME/conf/nutch-site.xml file , add the following content in it:

 

<configuration>
        <property>
                <name>http.agent.name</name>
                <value>my nutch agent</value>
        </property>


        <property>
                <name>http.agent.version</name>
                <value>1.0</value>
        </property>

 

<property>

         <name>plugin.folders</name>

         <value>E:/nutch-1.0/src/plugin</value>

  </property>

</configuration>

 

Note: Here I set the value of “plugin.floders” with absolute path, you can also use a relative path.

2. Optionally you may also set http.agent.url and http.agent.email properties.

3. Make sure Nutch is configured correctly before testing it into Eclipse

 

Missing org.farng and com.etranslate

Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.

Download them here:

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually).

Build Nutch

If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.

Create Eclipse launcher

.Menu Run->Open Run Dialog.., choose the right project name, and

Set the main class

org.apache.nutch.crawl.Crawl

on tab Arguments, Program Arguments

urls -dir crawl -depth 3 -topN 50 -threads 10

Here: “urls” is the directory in which we write the webpages we want to crawl

·         -dir dir names the directory to put the crawl in.

·         -threads threads determines the number of threads that will fetch in parallel.

·         -depth depth indicates the link depth from the root page that should be crawled.

·         -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

 

in VM arguments

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

Java Heap Size problem

If you find in hadoop.log line similar to this:

2009-05-09 14:03:09,640 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space

You should increase amount of RAM for running applications from eclipse.

Just set it in:

Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments

I've set mine to

-Xms5m -Xmx150m 

-Xms (minimum ammount of RAM memory for running applications) -Xmx (maximum)

 

References:

http://wiki.apache.org/nutch/RunNutchInEclipse0.9

http://wiki.apache.org/nutch/NutchTutorial

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
nutch javax.net.ssl.sslexception : could not generate dh keypair 是一个SSL异常,意味着Nutch无法生成DH密钥对。 TLS(Transport Layer Security)是一种加密协议,用于保护在网络上进行的通信。在TLS握手期间,服务器和客户端会协商加密算法和生成共享密钥对。 DH(Diffie-Hellman)密钥交换是TLS协议常用的一种加密算法。它允许服务器和客户端在不直接传递密钥的情况下,通过交换公钥来生成共享密钥。 nutch javax.net.ssl.sslexception : could not generate dh keypair 错误意味着Nutch无法生成DH密钥对。这可能是由于以下几个原因导致的: 1. Java安全性策略限制:Java默认情况下,限制了密钥长度。您可以尝试通过修改Java安全性策略文件来解决此问题。 2. 加密算法不受支持:您使用的Java版本可能不支持所需的加密算法。您可以尝试升级到较新的Java版本。 3. 随机数生成器问题:DH密钥对需要使用随机数生成器生成随机数。但是,如果随机数生成器不可用或出现故障,就会出现此错误。您可以尝试重新配置随机数生成器或更换可靠的实现。 4. SSL证书问题:此错误可能是由于证书问题引起的。您可以检查证书是否过期或不匹配,并尝试更新或更换证书。 针对这个错误,您可以逐一排查上述情况,并尝试相应的解决方法来解决该问题。如果问题仍然存在,您可能需要进一步的调查和故障排除来确定准确的原因并解决问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值