Nutch+HBase

当我们为nutch的架构发愁的时候,nutch的开发人员送来了nutchbase。我一些简单的测试表明,在hadoop0.20.1和hbase0.20.2上,稍加修改可以运行起来。 
它的优点很明显:架构合理. 

开发者是这样说的,引用自jira 
http://issues.apache.org/jira/browse/NUTCH-650 


A) Why integrate with hbase? 

All your data in a central location 
No more segment/crawldb/linkdb merges. 
No more "missing" data in a job. There are a lot of places where we copy data from one structure to another just so that it is available in a later job. For example, during parsing we don't have access to a URL's fetch status. So we copy fetch status into content metadata. This will no longer be necessary with hbase integration. 
A much simpler data model. If you want to update a small part in a single record, now you have to write a MR job that reads the relevant directory, change the single record, remove old directory and rename new directory. With hbase, you can just update that record. Also, hbase gives us access to Yahoo! Pig, which I think, with its SQL-ish language may be easier for people to understand and use. 
B) Design 
Design is actually rather straightforward. 

We store everything (fetch time, status, content, parsed text, outlinks, inlinks, etc.) in hbase. I have written a small utility class that creates "webtable" with necessary columns. 
So now most jobs just take the name of the table as input. 
There are two main classes for interfacing with hbase. ImmutableRowPart wraps around a RowResult and has helper getters (getStatus(), getContent(), etc.). RowPart is similar to ImmutableRowPart but also has setters. The idea is that RowPart also wraps RowResult but also keeps a list of updates done to that row. So when getSomething is called, it first checks if Something is already updated (if so then returns the updated version) or returns from RowResult. RowPart can also create a BatchUpdate from its list of updates. 
URLs are stores in reversed host order. For example, http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar:http:8983/to/index.html?a=b. This way, URLs from the same tld/host/domain are stored closer to each other. TableUtil has methods for reversing and unreversing URLs. 
CrawlDatum Status-es are simplifed. Since everything is in central location now, no point in having a DB and FETCH status. 
Jobs: 

Each job marks rows so that the next job knows which rows to read. For example, if GeneratorHbase decides that a URL should be generated it marks the URL with a TMP_FETCH_MARK (Marking a url is simply creating a special metadata field.) When FetcherHbase runs, it skips over anything without this special mark. 
InjectorHbase: First, a job runs where injected urls are marked. Then in the next job, if a row has the mark but nothing else (here, I assumed that if a row has "status:" column, that it already exists), InjectorHbase initializes the row. 
GeneratorHbase: Supports max-per-host configuration and topN. Marks generated urls with a marker. 
FetcherHbase: Very similar to original Fetcher. Marks urls successfully fetched. Skips over URLs not marked by GeneratorHbase 
ParseTable: Similar to original Parser. Outlinks are stored "outlinks:<fromUrl>" -> "anchor". 
UpdateTable: Does updatedb's and invertlink's job. Also clears any markers. 
IndexerHbase: Indexes the entire table. Skips over URLs not parsed successfully. 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
nutch javax.net.ssl.sslexception : could not generate dh keypair 是一个SSL异常,意味着Nutch无法生成DH密钥对。 TLS(Transport Layer Security)是一种加密协议,用于保护在网络上进行的通信。在TLS握手期间,服务器和客户端会协商加密算法和生成共享密钥对。 DH(Diffie-Hellman)密钥交换是TLS协议中常用的一种加密算法。它允许服务器和客户端在不直接传递密钥的情况下,通过交换公钥来生成共享密钥。 nutch javax.net.ssl.sslexception : could not generate dh keypair 错误意味着Nutch无法生成DH密钥对。这可能是由于以下几个原因导致的: 1. Java安全性策略限制:Java默认情况下,限制了密钥长度。您可以尝试通过修改Java安全性策略文件来解决此问题。 2. 加密算法不受支持:您使用的Java版本可能不支持所需的加密算法。您可以尝试升级到较新的Java版本。 3. 随机数生成器问题:DH密钥对需要使用随机数生成器生成随机数。但是,如果随机数生成器不可用或出现故障,就会出现此错误。您可以尝试重新配置随机数生成器或更换可靠的实现。 4. SSL证书问题:此错误可能是由于证书问题引起的。您可以检查证书是否过期或不匹配,并尝试更新或更换证书。 针对这个错误,您可以逐一排查上述情况,并尝试相应的解决方法来解决该问题。如果问题仍然存在,您可能需要进一步的调查和故障排除来确定准确的原因并解决问题。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值