首先,nutch是可以同时抓取多个网站的,只要在列表里面指定就可以了,在这里遇到一个问题:错误提示如下:
070517 140633 fetch of http://www.21cn.com/ failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out
070517 140633 fetch of http://www.21cn.com/ failed with: java.lang.NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser
Exception in thread "fetcher0" java.lang.NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:230) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:213)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:156)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:254)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:204)
070517 140635 fetch of http://www.gdsq.org.cn/ failed with: java.lang.Exception: java.net.SocketTimeoutException: Read timed out
070517 140635 fetch of http://www.gdsq.org.cn/ failed with: java.lang.NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser
Exception in thread "fetcher3" java.lang.NoClassDefFoundError: org/cyberneko/htm
问题根源:是因为biuld文件夹里面的有些类是w n do w s 下面编译过来的,所以在lin x 用an t 编译的
时候没有覆盖掉,所以就出现了这个错误
解决方法:把biuld文件夹删除,重新用an t 编译即可了,呵呵
070517 140633 fetch of http://www.21cn.com/ failed with: java.lang.Exception: java.net.SocketTimeoutException: connect timed out
070517 140633 fetch of http://www.21cn.com/ failed with: java.lang.NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser
Exception in thread "fetcher0" java.lang.NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:230) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:213)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:156)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:254)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:204)
070517 140635 fetch of http://www.gdsq.org.cn/ failed with: java.lang.Exception: java.net.SocketTimeoutException: Read timed out
070517 140635 fetch of http://www.gdsq.org.cn/ failed with: java.lang.NoClassDefFoundError: org/cyberneko/html/parsers/DOMFragmentParser
Exception in thread "fetcher3" java.lang.NoClassDefFoundError: org/cyberneko/htm
问题根源:是因为biuld文件夹里面的有些类是w n do w s 下面编译过来的,所以在lin x 用an t 编译的
时候没有覆盖掉,所以就出现了这个错误
解决方法:把biuld文件夹删除,重新用an t 编译即可了,呵呵