在windows下 跑nutch crawl必须使用cygwin,没办法的事情,目前nutch只有shell脚本驱动,这多多少少给windows开发者带来一些麻烦,虽然通常情况下,java应用总会部署在unix机器上,即使如此,对于众多在windows上开发 java 应用的开发者来说,不需要装什么其他东西,直接在window环境中就能跑 nutch crawl显得更省力,下面将nutch-0.7.1 bin目录下的nutch shell 脚本直接转换成ant脚本,nutch玩家直接将ant脚本放在nutch-0.7.1下直接运行即可,当然你应该根据你的需求设置一些脚本元素
<
project
name
="nutch-crawl"
default
="crawl"
basedir
="."
>
< property name ="lib.dir" location ="lib" />
< property name ="conf.dir" location ="conf" />
< path id ="project.classpath" >
< fileset dir ="." includes ="nutch-*.jar" />
< fileset dir ="lib" />
< pathelement path ="." />
< pathelement path ="${conf.dir}" />
</ path >
< target name ="crawl" >
< echo > crwaling starting... </ echo >
< property name ="JVM.extra.args" value ="-Xmx1000m" />
< java classname ="org.apache.nutch.tools.CrawlTool" classpathref ="project.classpath" fork ="true" >
< jvmarg line ="${JVM.extra.args}" />
< arg value ="e:/nutch-0.7.1/urls" />
< arg value ="-dir" />
< arg value ="e:/xxcrawled" />
< arg value ="-depth" />
< arg value ="2" />
< arg value ="-threads" />
< arg value ="10" />
</ java >
< echo > crwaling finished... </ echo >
</ target >
</ project >
< property name ="lib.dir" location ="lib" />
< property name ="conf.dir" location ="conf" />
< path id ="project.classpath" >
< fileset dir ="." includes ="nutch-*.jar" />
< fileset dir ="lib" />
< pathelement path ="." />
< pathelement path ="${conf.dir}" />
</ path >
< target name ="crawl" >
< echo > crwaling starting... </ echo >
< property name ="JVM.extra.args" value ="-Xmx1000m" />
< java classname ="org.apache.nutch.tools.CrawlTool" classpathref ="project.classpath" fork ="true" >
< jvmarg line ="${JVM.extra.args}" />
< arg value ="e:/nutch-0.7.1/urls" />
< arg value ="-dir" />
< arg value ="e:/xxcrawled" />
< arg value ="-depth" />
< arg value ="2" />
< arg value ="-threads" />
< arg value ="10" />
</ java >
< echo > crwaling finished... </ echo >
</ target >
</ project >
注意上面代码中的<arg>,你可以根据要求设置.