nutch脚本

最新推荐文章于 2015-11-06 12:49:00 发布

cadany

最新推荐文章于 2015-11-06 12:49:00 发布

阅读量332

点赞数

分类专栏： 01_NUTCH

本文链接：https://blog.csdn.net/cadany/article/details/44537793

版权

01_NUTCH 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

刚看完crawl脚本，大致理解nutch 2.3的执行过程后，回过头来看看nutch脚本都能干些什么。

> $bin/nutch

Usage: nutch COMMAND
where COMMAND is one of:
-inject inject new urls into the database
-hostinject creates or updates an existing host table from a text file
-generate generate new batches to fetch from crawl db
-fetch fetch URLs marked during generate
-parse parse URLs marked during fetch
-updatedb update web table after parsing
-updatehostdb update host table after parsing
-readdb read/dump records from page database
-readhostdb display entries from the hostDB
-index run the plugin-based indexer on parsed batches
-elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead
-solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead
-solrdedup remove duplicates from solr
-solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
-clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
-parsechecker check the parser for a given url
-indexchecker check the indexing filters for a given url
-plugin load a plugin and run one of its classes main()
-nutchserver run a (local) Nutch server on a user defined port
-webapp run a local Nutch web application
-junit runs the given JUnit test
or
-CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.