nutch安装及配置

1 Cygwin

   Cygwin是一个运行于Windows下的免费的UNIX的子系统,使用一个Dll(动态链接库)来实现。(windows下的一个模拟UNIX的系统)

安装:http://cppforever.spaces.live.com/Blog/cns!9DB38506DFDBED76!998.entry (在线安装)

http://119.147.41.16/down?cid=AEF05C0F232AB1BE526E0A0DAAA50CDE7E9BB909&t=2&fmt=&usrinput=cygwin&dt=2002000&ps=0_0&rt=0kbs&plt=0

(直接下载)

 

2 nutch : web搜索引擎

   Nutch主要分为两个部分:爬虫crawler和查询searcher。Crawler主要用于从网络上抓取网页并为这些网页建立索引。Searcher主要利用这些索引检索用户的查找关键词来产生查找结果。两者之间的接口是索引,所以除去索引部分,两者之间的耦合度很低。

http://baike.baidu.com/view/46642.htm nutch介绍。

 

3

 解压nutch到d盘

 启动cygwin cd /cygdrive/d/nutch 

 crawl-urlfilter.txt

    +^http://([a-z0-9]*/.)*csdn.net/

 nutch-site.xml

    <property>
    <name>http.agent.name</name>
    <value>csdn.net</value>
    <description>csdn.net</description>
    </property>

mkdir urls

echo http://www.csdn.net/ > urls/csdn

export JAVA_HOME = /cygdrive/c/Program Files/Java/jdk1.6.0_07 

 

bin/nutch crawl urls -dir csdn -depth 3 -threads 4 >& crawl.log

   生成索引文件放到 nutch/csdn下

   (可以使用 luke来查看索引数据库 http://www.getopt.org/luke/

将nutch-1.0.war放到 webapps 下

修改class/nutch-site.xml 添加

   <name>searcher.dir</name>
   <value>D:/nutch/csdn</value>

   指定搜索使用的索引文件

参照:http://billion318.spaces.live.com/blog/cns!36A80F7663730159!109.entry

4 使用paoding分词

  将paoding.jar中properties添加到nutch classpath下

  修改paoding-dic-home.properties

     paoding.dic.home=D:/paoding/dic  (注意 /)

  在nutch 分析器 NutchDocumentAnalyzer 中使用PaodingAnalyzer

     private static Analyzer PAODING_ANALYZER;

     public NutchDocumentAnalyzer(Configuration conf) {

          PAODING_ANALYZER = new PaodingAnalyzer();

     }

     public TokenStream tokenStream(String fieldName, Reader reader) {

          analyzer = PAODING_ANALYZER;

     }

5 搜索 pdf|doc|xls|ppt|txt

   默认情况下可以搜索txt内容

   pdf|doc|xls|ppt 需要配置:

   parse-plugins.xml: 指定文件使用的parser

   nutch-default.xml:

     <name>plugin.includes</name>//指定include那些文件类型

     <value>.....|parse-(text|html|js|pdf|msexcel|mspowerpoint|msword)|...

     <name>http.content.limit</name> //指定文件大小

     <value>-1</value>//表示无限制  不然可能会有 parse incomplete exception

   regex-urlfilter.txt 将ppt xls从过滤列表中去掉(pdf,doc默认不过滤)

   crawl-urlfilter.txt 将ppt xls从过滤列表中去掉(pdf,doc默认不过滤)

6 window 调用crawl

   配置环境变量: NUTCH_HOME=D:/nutch

   PATH: %PATH%+d:/cygwin/bin;d:/cygwin/usr/bin;

   nutch.bat crawl d:/nutch/csdn -dir d:/nutch/csdn -depth 2 -topN 5000 (这里路径要用绝对路径)

  

   (该bat参考 http://wangxuliangboy.javaeye.com/blog/279552)

@echo off
set JAVA_HEAP_MAX="-Xmx512M"

if not "%1"=="" goto INIT else goto echoMSG
:echoMSG
  echo Title: Nutch
  echo Usage: nutch COMMAND
  echo where COMMAND is one of:
  echo   crawl             one-step crawler for intranets
  echo   inject            inject new urls into the database
  echo   generate          generate new segments to fetch
  echo   fetchlist         print the fetchlist of a segment
  echo   fetch             fetch a segment's pages
  echo   parse             parse a segment's pages
  echo   index             run the indexer on a segment's fetcher output
  echo   merge             merge several segment indexes
  echo   dedup             remove duplicates from a set of segment indexes
  echo   updatedb          update db from segments after fetching
  echo   updatesegs        update segments with link data from the db
  echo   mergesegs         merge multiple segments into a single segment
  echo   analyze           adjust database link-analysis scoring
  echo   segread           read, fix and dump segment data
  echo   segslice          append, join and slice segment data
  echo   server            run a search server
  echo   namenode          run the NDFS namenode
  echo   datanode          run an NDFS datanode
  echo   ndfs              run an NDFS admin client
  echo   jobtracker        run the MapReduce job Tracker node
  echo   tasktracker       run a MapReduce task Tracker node
  echo  or
  echo   CLASSNAME         run the class named CLASSNAME
  echo Most commands print help when invoked w/o parameters.
  goto end;
:INIT 
  set NUTCH_HOME=%NUTCH_HOME%
  if "%NUTCH_HOME%"=="" echo NUTCH_HOME IN PATH ONT FOUND
  set CLASSPATH=%NUTCH_HOME%;%NUTCH_HOME%/conf;%NUTCH_HOME%/plugin;%NUTCH_HOME%/lib
  @echo @echo off>setclasspath.bat
  for %%i in (%NUTCH_HOME%/nutch-*.jar) do @echo set CLASSPATH=%%CLASSPATH%%;%%i>>setclasspath.bat;& for %%i in (%NUTCH_HOME%/lib/*.jar) do @echo set CLASSPATH=%%CLASSPATH%%;%%i>>setclasspath.bat;
  goto EXEC
:EXEC
  call setclasspath
  if  "%1" == "crawl" set CLASS=org.apache.nutch.crawl.Crawl
  if  "%1" == "inject" set CLASS=org.apache.nutch.crawl.Injector
  if  "%1" == "generate" set CLASS=org.apache.nutch.crawl.Generator
  if  "%1" == "fetchlist" set CLASS=org.apache.nutch.pagedb.FetchListEntry
  if  "%1" == "fetch" set CLASS=org.apache.nutch.fetcher.Fetcher

  if  "%1" == "fetch2" set CLASS=org.apache.nutch.fetcher.Fetcher2
  if  "%1" == "convdb" set CLASS=org.apache.nutch.tools.compat.CrawlDbConverter
  if  "%1" == "parse" set CLASS=org.apache.nutch.parse.ParseSegment
  if  "%1" == "index" set CLASS=org.apache.nutch.indexer.Indexer
  if  "%1" == "merge" set CLASS=org.apache.nutch.indexer.IndexMerger
  if  "%1" == "dedup" set CLASS=org.apache.nutch.indexer.DeleteDuplicates
  if  "%1" == "updatedb" set CLASS=org.apache.nutch.crawl.CrawlDb
  if  "%1" == "mergesegs" set CLASS=org.apache.nutch.segment.SegmentMerger
  if  "%1" == "readdb" set CLASS=org.apache.nutch.crawl.CrawlDbReader
  if  "%1" == "segread" echo "[DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead." set CLASS=org.apache.nutch.segment.SegmentReader
  if  "%1" == "server" set CLASS=org.apache.nutch.searcher.DistributedSearch$Server
  echo %CLASSPATH%
  call "%JAVA_HOME%/bin/java" %JAVA_HEAP_MAX% -classpath "%CLASSPATH%" %CLASS% %2 %3 %4 %5 %6 %7 %8 %9
:end

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值