一直留意Lucene,Nutch的进展,最近这两个项目都发展得非常快,Lucne已发展到 2.1,Nutch已发展到 0.9,改进了很多,令人欣喜。
今天小试了一下Nutch-0.9,笔记如下:
1、解压Nutch包,在Nutch根目录下建目录urls,里面建一些包含URL的文本如urlt.txt,一行一个URL,内容如:http://www.blogjava.net
http://www.iteye.com/
2、修改conf目录下的 crawl-urlfilter.txt,片断如下:
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://www.blogjava.net/
+^http://www.iteye.com/
+^http://lucene.apache.org/
3、修改conf目录下的 nutch-site.xml,内容如下:
<?
xml version="1.0"
?>
<?
xml-stylesheet type="text/xsl" href="configuration.xsl"
?>
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
<!--
Put site-specific property overrides in this file.
-->
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
<
configuration
>
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
<
property
>
<
name
>
http.agent.name
</
name
>
<
value
>
Nutch
</
value
>
<
description
>
HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
NOTE: You should also check other related properties:
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
and set their values appropriately.
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
</
description
>
</
property
>
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
<
property
>
<
name
>
http.robots.agents
</
name
>
<
value
>
Nutch,*
</
value
>
<
description
>
The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</
description
>
</
property
>
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
<
property
>
<
name
>
http.agent.description
</
name
>
<
value
>
Nutch Search Engineer
</
value
>
<
description
>
Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</
description
>
</
property
>
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
<
property
>
<
name
>
http.agent.url
</
name
>
<
value
>
http://lucene.apache.org/nutch/bot.html
</
value
>
<
description
>
A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</
description
>
</
property
>
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
<
property
>
<
name
>
http.agent.email
</
name
>
<
value
>
nutch-agent@lucene.apache.org
</
value
>
<
description
>
An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</
description
>
</
property
>
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
</
configuration
>
注意:在nutch-0.9.jar里面已包含nutch-site.xml, conf目录下的文件都复制过到classpath根下,如果是在WEB环境下运行classpath下的nutch-site.xml会优先加载,如果在在Application环境运行,应把如上nutch-site.xml打入到nutch-0.9.jar包里,否则,上面的一些属性为空不能运行。
4、在Windows下运行Nutch,很简单,只要你能执行Crawl这个类就行,写一个Ant脚本放在Nuthc的根目录下执行它就OK,内容如下:
<
project
name
="nutch-crawl"
default
="crawl"
basedir
="."
>
<
property
name
="lib.dir"
location
="lib"
/>
<
property
name
="conf.dir"
location
="conf"
/>
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
<
path
id
="project.classpath"
>
<
fileset
dir
="."
includes
="nutch-*.jar"
/>
<
fileset
dir
="lib"
/>
<
pathelement
path
="."
/>
<
pathelement
path
="${conf.dir}"
/>
</
path
>
<
target
name
="crawl"
>
<
echo
>
crwaling starting
</
echo
>
<
property
name
="JVM.extra.args"
value
="-Xmx512m"
/>
<
java
classname
="org.apache.nutch.crawl.Crawl"
classpathref
="project.classpath"
fork
="true"
>
<
jvmarg
line
="${JVM.extra.args}"
/>
<
arg
value
="C:/dev-tools/nutch-0.9/urls"
/>
<
arg
value
="-dir"
/>
<
arg
value
="C:/dev-tools/nutch-0.9/crawl"
/>
<
arg
value
="-depth"
/>
<
arg
value
="3"
/>
<
arg
value
="-threads"
/>
<
arg
value
="15"
/>
</
java
>
<
echo
>
crwaling finished
</
echo
>
</
target
>
</
project
>
至此,如无意外,Nutch已经欢快地运行起来,最后在crawl目录下你会发现你想要的东西,Enjoy it!
今天小试了一下Nutch-0.9,笔记如下:
1、解压Nutch包,在Nutch根目录下建目录urls,里面建一些包含URL的文本如urlt.txt,一行一个URL,内容如:http://www.blogjava.net
http://www.iteye.com/
2、修改conf目录下的 crawl-urlfilter.txt,片断如下:
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://www.blogjava.net/
+^http://www.iteye.com/
+^http://lucene.apache.org/
3、修改conf目录下的 nutch-site.xml,内容如下:
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
注意:在nutch-0.9.jar里面已包含nutch-site.xml, conf目录下的文件都复制过到classpath根下,如果是在WEB环境下运行classpath下的nutch-site.xml会优先加载,如果在在Application环境运行,应把如上nutch-site.xml打入到nutch-0.9.jar包里,否则,上面的一些属性为空不能运行。
4、在Windows下运行Nutch,很简单,只要你能执行Crawl这个类就行,写一个Ant脚本放在Nuthc的根目录下执行它就OK,内容如下:
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![dot.gif](https://i-blog.csdnimg.cn/blog_migrate/9b8a8a44dd1c74ae49c20a7cd451974e.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![dot.gif](https://i-blog.csdnimg.cn/blog_migrate/9b8a8a44dd1c74ae49c20a7cd451974e.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
![None.gif](https://i-blog.csdnimg.cn/blog_migrate/4f1150b881333f12a311ae9ef34da474.gif)
至此,如无意外,Nutch已经欢快地运行起来,最后在crawl目录下你会发现你想要的东西,Enjoy it!
![114022.html](https://i-blog.csdnimg.cn/blog_migrate/7a7d777747974c1fd645f45b40b1c972.jpeg)