刚看完crawl脚本,大致理解nutch 2.3的执行过程后,回过头来看看nutch脚本都能干些什么。
> $bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
-inject inject new urls into the database
-hostinject creates or updates an existing host table from a text file
-generate generate new batches to fetch from crawl db
-fetch fetch URLs marked during generate
-parse parse URLs marked during fetch
-updatedb update web table after parsing
-updatehostdb update host table after parsing
-readdb read/dump records from page database
-readhostdb display entries from the hostDB
-index run the plugin-based indexer on parsed batches
-elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
-solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
-solrdedup remove duplicates from solr
-solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
-clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
-parsechecker check the parser for a given url
-indexchecker check the indexing filters for a given url
-plugin load a plugin and run one of its classes main()
-nutchserver run a (local) Nutch server on a user defined port
-webapp run a local Nutch web application
-junit runs the given JUnit test
or
-CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
三件事:
一、爬取
1、指定爬取种子:inject、hostinject
2、爬取:generate、fetch、parse
3、更新爬取库:updatedb、updatehostdb
4、索引:index、elasticindex、solrindex
二、前台搜索服务
1、web服务:nutchserver、webapp
三、工具
1、查看爬取库:readdb、readhostdb
2、加载和执行插件:plugin
3、解析和索引检测:parsechecker、indexchecker
4、索引整理:solrdedup、solrclean、clean
5、其它:junit、CLASSNAME
◆大致ok了。如果你要搭建一个搜索服务,那么nutch可以为你做的事情就上边这些。当然,通过plugin来扩展来完成更多需求是一个很不错的选择。
◆但从架构方面来看,个人觉得nutch提供的前台搜索服务是多余的,或许就该当个tester用吧。