Nutch+HBase

最新推荐文章于 2024-10-17 22:17:12 发布

rongrong0206

最新推荐文章于 2024-10-17 22:17:12 发布

阅读量1k

点赞数

分类专栏：搜索引擎/hadoop 文章标签： hbase parsing structure access url methods

搜索引擎/hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

当我们为nutch的架构发愁的时候，nutch的开发人员送来了nutchbase。我一些简单的测试表明，在hadoop0.20.1和hbase0.20.2上，稍加修改可以运行起来。
它的优点很明显：架构合理.

开发者是这样说的，引用自jira
http://issues.apache.org/jira/browse/NUTCH-650

A) Why integrate with hbase?

All your data in a central location
No more segment/crawldb/linkdb merges.
No more "missing" data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration.
A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use.
B) Design
Design is actually rather straightforward.

We store everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) in hbase. I have written a small utility class that creates "webtable" with necessary columns.
So now most jobs just take the name of the table as input.
There are two main classes for interfacing with hbase. ImmutableRowPart wraps around a RowResult and has helper getters (getStatus(), getContent(), etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is that RowPart also wraps RowResult but also keeps a list of updates done to that row. So when getSomething is called, it first checks if Something is already updated (if so then returns the updated version) or returns from RowResult. RowPart can also create a BatchUpdate from its list of updates.
URLs are stores in reversed host order. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same tld/host/domain are stored closer to each other. TableUtil has methods for reversing and unreversing URLs.
CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status.
Jobs:

Each job marks rows so that the next job knows which rows to read. For example, if GeneratorHbase decides that a URL should be generated it marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata field.) When FetcherHbase runs, it skips over anything without this special mark.
InjectorHbase: First, a job runs where injected urls are marked. Then in the next job, if a row has the mark but nothing else (here, I assumed that if a row has "status:" column, that it already exists), InjectorHbase initializes the row.
GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker.
FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase
ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor".
UpdateTable: Does updatedb's and invertlink's job. Also clears any markers.
IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully.