在官网上下了nutch1.2,配置到windows中,然后又将nutch-1.2.war配置到tomcat里。
经过一系列的配置,成功爬怪到了网页内容,用命令【bin/nutch org.apache.nutch.searcher.NutchBean 关键字】可以看到
爬取了多少URL,但是通过访问tomcat页面却搜索不到关键字相关的任何结果。
于是查了tomcat的log日志,原来报了如下错误:
---------------------------------------------------------------------
org.apache.catalina.core.StandardWrapperValve invoke
重大: Servlet.service() for servlet [jsp] in context with path [/nutch-1.2] threw exception [java.io.FileNotFoundException: The requested resource (/nutch-1.2/include/header.html) is not available] with root cause
java.io.FileNotFoundException: The requested resource (/nutch-1.2/include/header.html) is not available
at org.apache.catalina.servlets.DefaultServlet.serveResource(DefaultServlet.java:776)
at org.apache.catalina.servlets.DefaultServlet.doGet(DefaultServlet.java:411)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:621)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at org.apache.catalina.core.ApplicationDispatcher.invoke(ApplicationDispatcher.java:749)
at org.apache.catalina.core.ApplicationDispatcher.doInclude(ApplicationDispatcher.java:605)
at org.apache.catalina.core.ApplicationDispatcher.include(ApplicationDispatcher.java:544)
at org.apache.jasper.runtime.JspRuntimeLibrary.include(JspRuntimeLibrary.java:954)
at org.apache.jsp.search_jsp._jspService(search_jsp.java:242)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:432)
at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:390)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:334)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:728)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:305)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:502)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1041)
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603)
at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2430)
at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2419)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
---------------------------------------------------------------------
说是找不到/nutch-1.2/include/header.html,检查了nutch1.2.war里的文件,发现“/include/header.html”并不是
配在根目录下的,而是分别放在多国语言的文件夹下的,即为/nutch-1.2/language/include/header.html。
于是干脆去将“nutch-1.2\en\include\header.html”拷贝到“/nutch-1.2/include/”目录下,
再打开tomcat页面输入关键字就可以查到爬取结果了。
nutch1.2下载地址:
apache-nutch-1.2-bin.zip
http://archive.apache.org/dist/nutch/