Readseg is an alias for org.apache.nutch.segment.SegmentReader
Readseg 命令对应的类是 org.apache.nutch.segment.SegmentReader
This class is similar to readdb in that it dumps the contents of a segment. There are three ways we can use this class:
这个类与readdb相似,用来将segment内容读出,有如下三种用法:
1st Usage: bin/nutch readseg -dump <segment_dir> <output> [general options]
-dump: Dumps content of a <segment_dir> as a text file to <output>.
-dump: 将一个 <segment_dir>路径处的segment的内容输出到<output>
[general options]: General options are provided below.
[general options]: General options 在最下面提供说明.
2nd Usage: bin/nutch readseg -list (<segment_dir1> ... | -dir <segments>) [general options]
-list: This arguement lists a synopsis of segments in specified directories, or all segments in a directory <segments>, and prints details of them to System.out.
-list: 将特定段或者dir路径下所有段的信息输出
<segment_dir1> ...: This should be a list of the paths for individual segment directories to process.
<segment_dir1> ...: 是段文件的一个列表
-dir <segments>: Should be a path to a directory that contains multiple segments.
-dir <segments>: 是包含段文件的路径名
[general options]: General options are provided below.
[general options]: General options 在最下面提供说明.
3rd Usage: bin/nutch readseg -get <segment_dir> <keyValue> [general options]
-get: This arguement gets a specified record from a segment, and prints it on System.out.
-get: 从段中得到某一特定记录,输出到屏幕
<segment_dir>: Path to the segment directory.
<segment_dir>: segment的路径
<keyValue>: This should be the value of the key (url) we wish to retreive specific information about. N.B. It is essential to put "double-quotes" around strings with spaces.
<keyValue>: 我们想要输出的记录的URL ,注意:当url中有空白时候应该用“”括住
[general options]: General options are provided below.
[general options]: General options 在最下面提供说明.
-
-nocontent: Pass this to ignore the content directory.不输出content 目录下的文件的内容
-
-nofetch: To ignore the crawl_fetch directory.不输出crawl_fetch 目录下的文件的内容
-
-nogenerate: To ignore the crawl_generate directory.不输出crawl_generate 目录下的文件的内容
-
-noparse: To ignore the crawl_parse directory.不输出crawl_parse 目录下的文件的内容
-
-noparsedata: To ignore the parse_data directory.不输出parse_data 目录下的文件的内容
-
-noparsetext: To ignore the parse_text directory.不输出parse_text 目录下的文件的内容