Nutch readseg命令详解

Readseg is an alias for org.apache.nutch.segment.SegmentReader

Readseg 命令对应的类是 org.apache.nutch.segment.SegmentReader

This class is similar to readdb in that it dumps the contents of a segment. There are three ways we can use this class:

这个类与readdb相似,用来将segment内容读出,有如下三种用法:

1st Usage: bin/nutch readseg -dump <segment_dir> <output> [general options] 

-dump: Dumps content of a <segment_dir> as a text file to <output>.

-dump: 将一个 <segment_dir>路径处的segment的内容输出到<output>

[general options]: General options are provided below.

[general options]: General options 在最下面提供说明.

2nd Usage: bin/nutch readseg -list (<segment_dir1> ... | -dir <segments>) [general options] 

-list: This arguement lists a synopsis of segments in specified directories, or all segments in a directory <segments>, and prints details of them to System.out.

-list: 将特定段或者dir路径下所有段的信息输出

<segment_dir1> ...: This should be a list of the paths for individual segment directories to process.

<segment_dir1> ...: 是段文件的一个列表

-dir <segments>: Should be a path to a directory that contains multiple segments.

-dir <segments>: 是包含段文件的路径名

[general options]: General options are provided below.

[general options]: General options 在最下面提供说明.

3rd Usage: bin/nutch readseg -get <segment_dir> <keyValue> [general options] 

-get: This arguement gets a specified record from a segment, and prints it on System.out.

-get: 从段中得到某一特定记录,输出到屏幕

<segment_dir>: Path to the segment directory.

<segment_dir>: segment的路径

<keyValue>: This should be the value of the key (url) we wish to retreive specific information about. N.B. It is essential to put "double-quotes" around strings with spaces.

<keyValue>: 我们想要输出的记录的URL ,注意:当url中有空白时候应该用“”括住

[general options]: General options are provided below.

[general options]: General options 在最下面提供说明.

  • -nocontent: Pass this to ignore the content directory.不输出content 目录下的文件的内容

  • -nofetch: To ignore the crawl_fetch directory.不输出crawl_fetch 目录下的文件的内容

  • -nogenerate: To ignore the crawl_generate directory.不输出crawl_generate 目录下的文件的内容

  • -noparse: To ignore the crawl_parse directory.不输出crawl_parse 目录下的文件的内容

  • -noparsedata: To ignore the parse_data directory.不输出parse_data 目录下的文件的内容

  • -noparsetext: To ignore the parse_text directory.不输出parse_text 目录下的文件的内容

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值