1.需要安装的软件
(1)jdk1.6
(2)Cygwin
(3)nutch1.0
(4)tomcat 6.0
2.安装过程。
1.jdk1.6的安装就像用文本编写java代码那样,需要配置环境变量
PATH ,JAVA_HOME, CLASSPATH都要配置。
我的配置如下:
JAVA_HOME=C:\Program Files\Java\jdk1.6.0_06
Path=;%JAVA_HOME%\bin;
CLASSPATH=.;%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar
注意:一定要使用jdk1.6。因为nutch1.0是在1.6下开发的(自己猜的。。),因为使 用1.5会有一个提示version不匹配的错误发生:
-----------------------------------------------------------------------------------------------------------------------------------
java.lang.UnsupportedClassVersionError: Bad version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
Exception in thread "main"
-----------------------------------------------------------------------------------------------------------------------------------
3.Cygwin安装
Cygwin是在windows下执行linux脚本的工具。
安装过程参考:http://www.wangchao.net.cn/bbsdetail_1759714.html
我安装过程中并没有看上面的,我选择的是 install from internet。安装这个应该不成问题。
4.安装tomcat6.0
设置TOMCAT_HOME环境变量 C:\Program Files\Apache Software Foundation\Tomcat 6.0
5.安装nutch1.0
(1)下载nutch包。
(2)将包nutch-1.0.tar.gz放到cygwin的安装目录根目录下。
打开Cygwin的快捷方式,退到根目录,运行dir会看到nutch-1.0.tar.gz.
(3)运行tar xvf nutch-0.9.tar.gz进行解包,会在根目录下面生成nutch-0.9文件夹。
(4)将该文件改名, mv nutch-0.9 nutch
(5)在nutch目录下,建立urls目录,然后建立一个url(不带后缀名哦)文件,在url文件内写入一个希望爬行的url,例如:http://www.jlu.edu.cn/ (后面的/不能丢)
(6)打开nutch\conf\crawl-urlfilter.txt文件.
将
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
改为
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*jlu.edu.cn/
7)修改nutch/conf/nutch-default.xml
将文件里面的对应内容,修改成如下样子,其实你完全可以根据自己的实际情况修改,比如http.agent.name它要求MUST NOT be empty,你就随便写上点东西。http.robots.agents它要求put the value of http.agent.name as the first agent name, and keep the default * at the end of the list.对于http.agent.url 只是个advertise,随便写个网址就可以。http.agent.email的要求是>An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.随便写一个就可以。有的人说可以像这样修改nutch/conf/nutch-site.xml.一样可以配置好nutch。虽然我没 试,但应该可以。因为从文件名可以看出,这个default文件是个默认文件,那个site文件是一个个性化文件,修改site应该是灵活的一种表现。
----------------------------------------nutch-default.xml内容---------------------------------------------------------------
<property>
<name>http.agent.name</name>
<value>guoliqiang</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>guoliqiang,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.robots.403.allow</name>
<value>true</value>
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
</property>
<property>
<name>http.agent.description</name>
<value>jlu</value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value>www.baidu.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value>guoliqiang2006@126.com</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
-----------------------------------------------------------------------------------------------------------------------------------
6.用nutch进行爬行
进入nutch目录
$ sh ./bin/nutch crawl urls -dir mydir -depth 2 -threads 4 -topN 50 >& ./log.txt
crawl:通知nutch.jar,执行crawl的main方法。
urls:存放需要爬行的url.txt文件的目录
-dir mydir 爬行后文件保存的位置
-depth 2:爬行次数,或者成为深度,不过还是觉得次数更贴切,建议测试时改为1。
-threads 指定并发的进程 这是设定为 4
-topN 50:一个网站保存的最大页面数。
-log.txt :是记录日志的,如果有错误发生可以在里面找到。
注意爬网的时候mydir目录不能存在,要不然会出错
7.配置tomcat
(1)将nutch-1.0.war改名nutch.war ,并复制到到Tomcat 6.0\webapps目录下。
(2)启动Tomcat,等nutch.war解压后,打开nutch\WEB-INF\classes\nutch-site.xml
修改:
<nutch-conf><property><name>searcher.dir</name> <value> C:\cygwin\nutch\mydir </value></property></nutch-conf>
(3)在Tomcat 6.0\webapps\nutch\zh\include 下面新建header.jsp,内容就是复制header.html,但是
前面加上
<%@ page contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%>
在D:\tomcat\webapps\nutch\search.jsp里面,找到并修改为
<jsp:include page="<%= language + "/include/header.jsp"%>"/>
顺便把下面js注释掉
function queryfocus() {
//search.query.focus(); }
(4)在Tomcat 6.0\conf\server.xml 找到以下段,并修改
<Connector port="8080" maxThreads="150" minSpareThreads="25" maxSpareThreads="75" enableLookups="false" redirectPort="8443" acceptCount="100" debug="0" connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTF-8" useBodyEncodingForURI="true" />
(5)重启tomcat,访问 http://localhost:8080/nutch/ 就可以看到搜索主页了,而且搜索支持中文和分词。
也可以放到将nutch目录下内容放于webapps/ROOT目录下,通过http://localhost:8080/ 即可直接访问。
注意启动tomcat时可能遇到如下错误:
-----------------------------------------------------------------------------------------------------------------------------------
2009-04-09 17:09:02,984 INFO NutchBean - creating new bean
2009-04-09 17:09:03,093 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:89)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,125 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.LuceneSearchBean.<init>(LuceneSearchBean.java:50)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,125 INFO SearchBean - opening merged index in C:/cygwin/nutch/mydir/index
2009-04-09 17:09:03,140 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.IndexSearcher.<init>(IndexSearcher.java:70)
at org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:58)
at org.apache.nutch.searcher.LuceneSearchBean.<init>(LuceneSearchBean.java:51)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,156 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:191)
at org.apache.nutch.searcher.IndexSearcher.getDirectory(IndexSearcher.java:84)
at org.apache.nutch.searcher.IndexSearcher.<init>(IndexSearcher.java:71)
at org.apache.nutch.searcher.LuceneSearchBean.init(LuceneSearchBean.java:58)
at org.apache.nutch.searcher.LuceneSearchBean.<init>(LuceneSearchBean.java:51)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:102)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,203 INFO PluginRepository - Plugins: looking in: C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\nutch\WEB-INF\classes\plugins
2009-04-09 17:09:03,312 INFO PluginRepository - Plugin Auto-activation mode: [true]
2009-04-09 17:09:03,312 INFO PluginRepository - Registered Plugins:
2009-04-09 17:09:03,312 INFO PluginRepository - the nutch core extension points (nutch-extensionpoints)
2009-04-09 17:09:03,312 INFO PluginRepository - Basic Query Filter (query-basic)
2009-04-09 17:09:03,312 INFO PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2009-04-09 17:09:03,312 INFO PluginRepository - Html Parse Plug-in (parse-html)
2009-04-09 17:09:03,312 INFO PluginRepository - Basic Indexing Filter (index-basic)
2009-04-09 17:09:03,312 INFO PluginRepository - Site Query Filter (query-site)
2009-04-09 17:09:03,312 INFO PluginRepository - Basic Summarizer Plug-in (summary-basic)
2009-04-09 17:09:03,312 INFO PluginRepository - HTTP Framework (lib-http)
2009-04-09 17:09:03,312 INFO PluginRepository - Text Parse Plug-in (parse-text)
2009-04-09 17:09:03,312 INFO PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2009-04-09 17:09:03,312 INFO PluginRepository - Regex URL Filter (urlfilter-regex)
2009-04-09 17:09:03,312 INFO PluginRepository - Http Protocol Plug-in (protocol-http)
2009-04-09 17:09:03,312 INFO PluginRepository - XML Response Writer Plug-in (response-xml)
2009-04-09 17:09:03,312 INFO PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2009-04-09 17:09:03,312 INFO PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2009-04-09 17:09:03,312 INFO PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2009-04-09 17:09:03,312 INFO PluginRepository - Anchor Indexing Filter (index-anchor)
2009-04-09 17:09:03,312 INFO PluginRepository - JavaScript Parser (parse-js)
2009-04-09 17:09:03,312 INFO PluginRepository - URL Query Filter (query-url)
2009-04-09 17:09:03,312 INFO PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2009-04-09 17:09:03,312 INFO PluginRepository - JSON Response Writer Plug-in (response-json)
2009-04-09 17:09:03,312 INFO PluginRepository - Registered Extension-Points:
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Field Filter (org.apache.nutch.indexer.field.FieldFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Search Results Response Writer (org.apache.nutch.searcher.response.ResponseWriter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-04-09 17:09:03,312 INFO PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-04-09 17:09:03,312 INFO PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-04-09 17:09:03,343 INFO Configuration - found resource common-terms.utf8 at file:/C:/Program%20Files/Apache%20Software%20Foundation/Tomcat%206.0/webapps/nutch/WEB-INF/classes/common-terms.utf8
2009-04-09 17:09:03,359 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.searcher.FetchedSegments.<init>(FetchedSegments.java:204)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:110)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
2009-04-09 17:09:03,375 INFO SummarizerFactory - Using the first summarizer extension found: Basic Summarizer
2009-04-09 17:09:03,375 WARN FileSystem - uri=file:///
javax.security.auth.login.LoginException: Login failed: Cannot run program "whoami": CreateProcess error=2, ?????????
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:250)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
at org.apache.hadoop.security.UserGroupInformation.login(UserGroupInformation.java:67)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1438)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1376)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
at org.apache.nutch.crawl.LinkDbReader.init(LinkDbReader.java:59)
at org.apache.nutch.crawl.LinkDbReader.<init>(LinkDbReader.java:55)
at org.apache.nutch.searcher.LinkDbInlinks.<init>(LinkDbInlinks.java:42)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:113)
at org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:77)
at org.apache.nutch.searcher.NutchBean$NutchBeanConstructor.contextInitialized(NutchBean.java:425)
at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:3843)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4342)
at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:525)
at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:926)
at org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:889)
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:492)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1149)
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:117)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:719)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at org.apache.catalina.core.StandardService.start(StandardService.java:516)
at org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:578)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
-----------------------------------------------------------------------------------------------------------------------------------
它的意思是命令whoami无法运行,原因是您用得是windows不是linux解决方法就是用cygwin,将环境变量path中加入:C:\cygwin\bin然后重启tomcat。
我试过nutch-0.9有一个很麻烦的错误,我就直接换1.0了:
---------------------------------------------------------------------------------------------------------------------------------
2007-06-09 12:37:28,187 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter
2007-06-09 12:37:28,281 INFO indexer.Indexer - Optimizing index.
2007-06-09 12:37:28,421 INFO indexer.Indexer - Indexer: done
2007-06-09 12:37:28,421 INFO indexer.DeleteDuplicates - Dedup: starting
2007-06-09 12:37:28,453 INFO indexer.DeleteDuplicates - Dedup: adding indexes in: mydir/indexes
2007-06-09 12:37:28,750 WARN mapred.LocalJobRunner - job_hlqfpx
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)
at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
-----------------------------------------------------------------------------------------------------------------------------------
原因参考:http://blog.sina.com.cn/s/blog_537c07f6010009t9.html
虽然我有试,但是我见过说法最权威的,那些说是配置有问题,纯粹是在che。