Nutch是一个完整的开源全文检索软件,它是建立在lucene java之上增加,增加了一些web特性,
如网络爬虫,link-graph数据库,HTML文本解析和其他格式文档解析,等等。
[b][size=large]下载nutch[/size][/b]
1.选择安装nutch的目录,我就直接安装到/home/admin下
[code="java"]# cd /home/admin/[/code]
2.下载nutch-1.0:
[code="java"]# wget "http://labs.xiaonei.com/apache-mirror/lucene/nutch/nutch-1.0.tar.gz"[/code]
3.解压nutch-1.0.war,建立软链
[code="java"]# tar -zxf nutch-1.0.tar.gz
# ln -s nutch-1.0 nutch[/code]
/home/admin下nutch的目录列表
[code="java"]# ll|grep 'nutch'
lrwxrwxrwx 1 root root 9 01-12 14:57 nutch -> nutch-1.0
drwxr-xr-x 9 root root 4096 2009-03-24 nutch-1.0
-rw-r--r-- 1 root root 86557549 2009-03-28 nutch-1.0.tar.gz[/code]
[size=large][b]内部爬虫的配置[/b][/size]
1.在/home/admin/nutch下建立一个urls目录,在urls下建立一个taizhou.txt,爬台州的一个网站(很多大的网站对这中野爬虫都做了屏蔽,最后才选择了taizhou.com)。
[code="java"]
# mkdir /home/admin/nutch/urls;touch /home/admin/nutch/urls/taizhou.txt
.....
# cat /home/admin/nutch/urls/taizhou.txt
http://www.taizhou.com[/code]
2.编辑conf/crawl-urlfilter.txt,替换“MY.DOMAIN.NAME”为“taizhou.com”,如下所示:
3.编辑conf/nutch-site.xml,配置爬虫携带的http头的信息,这里只是部分属性
[code="java"]# cat nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>8qiu-spider</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>this is a crawler of 8qiu</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>www.8qiu.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>javalover@yeah.net</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>[/code]
4.启动爬虫程序
[size=large][b]安装Web运行环境[/b][/size]
1.安装tomcat,我的tomcat目录/usr/local/tomcat
2.把nutch.1.0的war包移到webapp目录下
3.启动tomcat
[code="java"]# /usr/local/tomcat/bin/startup.sh
Using CATALINA_BASE: /usr/local/tomcat
Using CATALINA_HOME: /usr/local/tomcat
Using CATALINA_TMPDIR: /usr/local/tomcat/temp
Using JRE_HOME: /usr/local/jdk1.6.0_10[/code]
必须要在/home/admin/nutch下敲如下命令,切记,否则它会找不到/home/admin/nutch/crawl目录
启动完成之后,检查一下tomcat的日子:/usr/local/tomcat/logs/catalina.out
如果一切都正常, http://192.168.110.12:8080/nutch-1.0/search.jsp,就能搜索到结果了
如网络爬虫,link-graph数据库,HTML文本解析和其他格式文档解析,等等。
[b][size=large]下载nutch[/size][/b]
1.选择安装nutch的目录,我就直接安装到/home/admin下
[code="java"]# cd /home/admin/[/code]
2.下载nutch-1.0:
[code="java"]# wget "http://labs.xiaonei.com/apache-mirror/lucene/nutch/nutch-1.0.tar.gz"[/code]
3.解压nutch-1.0.war,建立软链
[code="java"]# tar -zxf nutch-1.0.tar.gz
# ln -s nutch-1.0 nutch[/code]
/home/admin下nutch的目录列表
[code="java"]# ll|grep 'nutch'
lrwxrwxrwx 1 root root 9 01-12 14:57 nutch -> nutch-1.0
drwxr-xr-x 9 root root 4096 2009-03-24 nutch-1.0
-rw-r--r-- 1 root root 86557549 2009-03-28 nutch-1.0.tar.gz[/code]
[size=large][b]内部爬虫的配置[/b][/size]
1.在/home/admin/nutch下建立一个urls目录,在urls下建立一个taizhou.txt,爬台州的一个网站(很多大的网站对这中野爬虫都做了屏蔽,最后才选择了taizhou.com)。
[code="java"]
# mkdir /home/admin/nutch/urls;touch /home/admin/nutch/urls/taizhou.txt
.....
# cat /home/admin/nutch/urls/taizhou.txt
http://www.taizhou.com[/code]
2.编辑conf/crawl-urlfilter.txt,替换“MY.DOMAIN.NAME”为“taizhou.com”,如下所示:
+^http://([a-z0-9]*\.)*taizhou.com/
3.编辑conf/nutch-site.xml,配置爬虫携带的http头的信息,这里只是部分属性
[code="java"]# cat nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>8qiu-spider</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>this is a crawler of 8qiu</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>www.8qiu.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>javalover@yeah.net</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
</configuration>[/code]
4.启动爬虫程序
/home/admin/nutch/bin/nutch crawl /home/admin/nutch/urls/ -dir /home/admin/nutch/crawl -depth 3 -topN 100
[size=large][b]安装Web运行环境[/b][/size]
1.安装tomcat,我的tomcat目录/usr/local/tomcat
2.把nutch.1.0的war包移到webapp目录下
mv nutch-1.0.jar /usr/local/tomcat/webapps/
3.启动tomcat
[code="java"]# /usr/local/tomcat/bin/startup.sh
Using CATALINA_BASE: /usr/local/tomcat
Using CATALINA_HOME: /usr/local/tomcat
Using CATALINA_TMPDIR: /usr/local/tomcat/temp
Using JRE_HOME: /usr/local/jdk1.6.0_10[/code]
必须要在/home/admin/nutch下敲如下命令,切记,否则它会找不到/home/admin/nutch/crawl目录
启动完成之后,检查一下tomcat的日子:/usr/local/tomcat/logs/catalina.out
如果一切都正常, http://192.168.110.12:8080/nutch-1.0/search.jsp,就能搜索到结果了