This tool takes several segments and merges their data together. Only the latest versions of data is retained.
Optionally, you can apply current URLFilters to remove prohibited URL-s.
Also, it's possible to slice the resulting segment into chunks of fixed size.
Important Notes
Which parts are merged?
It doesn't make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.
Merging fetchlists
Merging segments, which contain just fetchlists (i.e. prior to fetching) is not recommended, because this tool (unlike the Generator
doesn't ensure that fetchlist parts for each map task are disjoint.
Duplicate content
Merging segments removes older content whenever possible (see below). However, this is NOT the same as de-duplication, which in addition removes identical content found at different URL-s. In other words, running DeleteDuplicates is still necessary.
For some types of data (especially ParseText) it's not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with "higher" names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.
Merging and indexes
Merged segment gets a different name. Since Indexer embeds segment names in indexes, any indexes originally created for the input segments will NOT work with the merged segment. Newly created merged segment(s) need to be indexed afresh. This tool doesn't use existing indexes in any way, so if you plan to merge segments you don't have to index them prior to merging.
译文:
这个类工具把多个segment数据合并在一起。(如果含有雷同)其中只有最新版本的数据将会保存下来。
另外,你可以使用URLFilters来过滤禁止的url在合并的过程中。当然,也可以把segment结果集拆分成多个数据块。
重要提示:
哪些部分会合并?
它不能合并那些在不同处理状态的segment数据集(例如:有一个没有抓取的segment,有一个抓取了但没有解析的segment,还有一个抓取解析都处理完的segment)。因此,在合并前,这个工具类会确定最小的segments公共数据集合,只有这个最小的公共集合会被合并。这可能会产生意想不到的结果:例如 如果有大量的segment被抓取和解析,但是只有一个segment没有被抓取,这个工具就会退化到只合并fetchlists的地步,忽略其他所有的segments数据。
合并fetchlists
一句话概括:不推荐合并那些只包含fetchlists的segment。
文本去重
合并segment尽可能的去除旧版本的内容。然而,这和nutch中的de-duplication去重并不相同,de-duplication 还能去除不同url中的相同内容的功能。(本类只是在合并时把相同url按照版本的新旧去重功能简单)言外之意,运行DeleteDuplicates很 有必要。
对于类似ParseText这种数据并不可能知道那个版本更旧。因此,本工具用segment的名字作为时间戳对所有的输入数据。segment的 名字按照词典顺序比较新旧,来自最新的segment的数据将会被推荐。也就是说segment按照词典顺序递增随着时间的创建这一点很重要。
合并索引
合并的segment会是一个不同的名字。因为索引程序在索引文件中附带了segment的名字,所以最初对segment创建的索引文件将不符合合并后的segment。新合并的segment需要重新创建索引。这个工具类一点也不涉及已经创建的索引的操作,因此如果你打算合并segment那么在合并前你没有必要再对segment索引数据。