<project name="nutch-crawl" default="crawl" basedir=".">
<property name="lib.dir" location="lib"/>
<property name="conf.dir" location="conf"/>
<property name="urls.dir" location="urls"/>

<path id="project.classpath">
<fileset dir="${lib.dir}" />
<pathelement path="${conf.dir}"/>
<fileset dir="." includes="nutch-*.jar"/>
</path>
<target name="crawl" >
<echo>crwaling starting...</echo>
<property name="JVM.extra.args" value="-Xmx1000m" />
<java classname="org.apache.nutch.crawl.Crawl" classpathref="project.classpath" fork="true">
<jvmarg line="${JVM.extra.args}"/>
<arg value="${urls.dir}"/>
<arg value="-dir"/>
<arg value="e:/xxcrawled20"/>
<arg value="-depth"/>
<arg value="2"/>
<arg value="-threads"/>
<arg value="10"/>
</java>
<echo>crwaling finished...</echo>
</target>
</project>
<fileset dir="." includes="nutch-*.jar"/>之前。否则jar中的那个空的nutch-site.xml会取代conf目录下你修改好的nutch-site.xml
<property>
<name>searcher.dir</name>
<value>E:\xxcrawled2</value>
</property>发表于 @ 2006年08月23日 23:49:00 | 评论( loading... ) | 举报| 收藏