nutch安装及配置

最新推荐文章于 2019-10-10 22:25:16 发布

samttsch

最新推荐文章于 2019-10-10 22:25:16 发布

阅读量1.7k

点赞数

分类专栏：安装或配置文章标签： class deprecated mapreduce merge duplicates command

本文链接：https://blog.csdn.net/samttsch/article/details/4269614

版权

安装或配置专栏收录该内容

0 篇文章

订阅专栏

1 Cygwin

Cygwin是一个运行于Windows下的免费的UNIX的子系统,使用一个Dll(动态链接库)来实现。（windows下的一个模拟UNIX的系统）

安装：http://cppforever.spaces.live.com/Blog/cns!9DB38506DFDBED76!998.entry (在线安装)

http://119.147.41.16/down?cid=AEF05C0F232AB1BE526E0A0DAAA50CDE7E9BB909&t=2&fmt=&usrinput=cygwin&dt=2002000&ps=0_0&rt=0kbs&plt=0

（直接下载）

2 nutch : web搜索引擎

Nutch主要分为两个部分：爬虫crawler和查询searcher。Crawler主要用于从网络上抓取网页并为这些网页建立索引。Searcher主要利用这些索引检索用户的查找关键词来产生查找结果。两者之间的接口是索引，所以除去索引部分，两者之间的耦合度很低。

http://baike.baidu.com/view/46642.htm nutch介绍。

解压nutch到d盘

启动cygwin cd /cygdrive/d/nutch

crawl-urlfilter.txt

+^http://([a-z0-9]*/.)*csdn.net/

nutch-site.xml

    <property>
    <name>http.agent.name</name>
    <value>csdn.net</value>
    <description>csdn.net</description>
    </property>

mkdir urls

echo http://www.csdn.net/ > urls/csdn

export JAVA_HOME = /cygdrive/c/Program Files/Java/jdk1.6.0_07

bin/nutch crawl urls -dir csdn -depth 3 -threads 4 >& crawl.log

生成索引文件放到 nutch/csdn下

（可以使用 luke来查看索引数据库 http://www.getopt.org/luke/）

将nutch-1.0.war放到 webapps 下

修改class/nutch-site.xml 添加

<name>searcher.dir</name>
<value>D:/nutch/csdn</value>

指定搜索使用的索引文件

参照：http://billion318.spaces.live.com/blog/cns!36A80F7663730159!109.entry

4 使用paoding分词

将paoding.jar中properties添加到nutch classpath下

修改paoding-dic-home.properties

paoding.dic.home=D:/paoding/dic （注意 /）

在nutch 分析器 NutchDocumentAnalyzer 中使用PaodingAnalyzer

private static Analyzer PAODING_ANALYZER;

public NutchDocumentAnalyzer(Configuration conf) {

PAODING_ANALYZER = new PaodingAnalyzer();

}

public TokenStream tokenStream(String fieldName, Reader reader) {

analyzer = PAODING_ANALYZER;

}

5 搜索 pdf|doc|xls|ppt|txt

默认情况下可以搜索txt内容

pdf|doc|xls|ppt 需要配置：

parse-plugins.xml: 指定文件使用的parser

nutch-default.xml:

<name>plugin.includes</name>//指定include那些文件类型

<name>http.content.limit</name> //指定文件大小

<value>-1</value>//表示无限制不然可能会有 parse incomplete exception

regex-urlfilter.txt 将ppt xls从过滤列表中去掉（pdf,doc默认不过滤）

crawl-urlfilter.txt 将ppt xls从过滤列表中去掉（pdf,doc默认不过滤）

6 window 调用crawl

配置环境变量： NUTCH_HOME=D:/nutch

PATH: %PATH%+d:/cygwin/bin;d:/cygwin/usr/bin;

nutch.bat crawl d:/nutch/csdn -dir d:/nutch/csdn -depth 2 -topN 5000 (这里路径要用绝对路径)

(该bat参考 http://wangxuliangboy.javaeye.com/blog/279552)

@echo off
set JAVA_HEAP_MAX="-Xmx512M"

if not "%1"=="" goto INIT else goto echoMSG
:echoMSG
echo Title: Nutch
echo Usage: nutch COMMAND
echo where COMMAND is one of:
echo   crawl             one-step crawler for intranets
echo   inject            inject new urls into the database
echo   generate          generate new segments to fetch
echo   fetchlist         print the fetchlist of a segment
echo   fetch             fetch a segment's pages
echo   parse             parse a segment's pages
echo   index             run the indexer on a segment's fetcher output
echo   merge             merge several segment indexes
echo   dedup             remove duplicates from a set of segment indexes
echo   updatedb          update db from segments after fetching
echo   updatesegs        update segments with link data from the db
echo   mergesegs         merge multiple segments into a single segment
echo   analyze           adjust database link-analysis scoring
echo   segread           read, fix and dump segment data
echo   segslice          append, join and slice segment data
echo   server            run a search server
echo   namenode          run the NDFS namenode
echo   datanode          run an NDFS datanode
echo   ndfs              run an NDFS admin client
echo   jobtracker        run the MapReduce job Tracker node
echo   tasktracker       run a MapReduce task Tracker node
echo or
echo   CLASSNAME         run the class named CLASSNAME
echo Most commands print help when invoked w/o parameters.
goto end;
:INIT
set NUTCH_HOME=%NUTCH_HOME%
if "%NUTCH_HOME%"=="" echo NUTCH_HOME IN PATH ONT FOUND
set CLASSPATH=%NUTCH_HOME%;%NUTCH_HOME%/conf;%NUTCH_HOME%/plugin;%NUTCH_HOME%/lib
@echo @echo off>setclasspath.bat
for %%i in (%NUTCH_HOME%/nutch-*.jar) do @echo set CLASSPATH=%%CLASSPATH%%;%%i>>setclasspath.bat;& for %%i in (%NUTCH_HOME%/lib/*.jar) do @echo set CLASSPATH=%%CLASSPATH%%;%%i>>setclasspath.bat;
goto EXEC
:EXEC
call setclasspath
if "%1" == "crawl" set CLASS=org.apache.nutch.crawl.Crawl
if "%1" == "inject" set CLASS=org.apache.nutch.crawl.Injector
if "%1" == "generate" set CLASS=org.apache.nutch.crawl.Generator
if "%1" == "fetchlist" set CLASS=org.apache.nutch.pagedb.FetchListEntry
if "%1" == "fetch" set CLASS=org.apache.nutch.fetcher.Fetcher

if "%1" == "fetch2" set CLASS=org.apache.nutch.fetcher.Fetcher2
if "%1" == "convdb" set CLASS=org.apache.nutch.tools.compat.CrawlDbConverter
if "%1" == "parse" set CLASS=org.apache.nutch.parse.ParseSegment
if "%1" == "index" set CLASS=org.apache.nutch.indexer.Indexer
if "%1" == "merge" set CLASS=org.apache.nutch.indexer.IndexMerger
if "%1" == "dedup" set CLASS=org.apache.nutch.indexer.DeleteDuplicates
if "%1" == "updatedb" set CLASS=org.apache.nutch.crawl.CrawlDb
if "%1" == "mergesegs" set CLASS=org.apache.nutch.segment.SegmentMerger
if "%1" == "readdb" set CLASS=org.apache.nutch.crawl.CrawlDbReader
if "%1" == "segread" echo "[DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead." set CLASS=org.apache.nutch.segment.SegmentReader
if "%1" == "server" set CLASS=org.apache.nutch.searcher.DistributedSearch$Server
echo %CLASSPATH%
call "%JAVA_HOME%/bin/java" %JAVA_HEAP_MAX% -classpath "%CLASSPATH%" %CLASS% %2 %3 %4 %5 %6 %7 %8 %9
:end