bigtable ,hbase 学习

1.阅读 http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture
hbase architect. 先阅读 bigtable .http://labs.google.com/papers/bigtable.html

Want asynchronous processes to be continuously updating
different pieces of data
– Want access to most current data at any time

• Need to support:
– Very high read/write rates (millions of ops per second)
– Efficient scans over all or interesting subsets of data
– Efficient joins of large one-to-one and one-to-many datasets

but how?

2. 重要的概念:
column oriented? 关键点是什么?
write optimized: column aggregation .

http://heart.korea.ac.kr/trac/wiki/ReadablePapers

row/column sparse.

3. http://torrez.us/archives/2005/10/24/407/ ,rdf store, it seems good 。
copy to prevent forgot:

his is excellent news for the Semantic Web. Google is building the RDF database we’ve been trying to build and to this date even though conceptually we are on the right track, our implementations do not scale in ways that would even match standard relational models today. Thus, making it very hard for real systems to adopt RDF as their platform today. However, all of this is going to change with BigTable, but let’s pay attention to the details in the description and a summary from Andrew Hitchcock.

* Storing and managing very large amounts of structured data
* Row/column space can be sparse
* Columns are in the form of “family:optional_qualifier”. RDF Properties, Yeah!
* Columns have type information
* Because of the design of the system, columns are easy to create (and are created implicitly)
* Column families can be split into locality groups (Ontologies!)

Why do I think this is an RDF database? Well, in case you might not know one of the problems with existing relational database models is that they are not flexible enough. If a company like Amazon starts carrying a new type of product with attributes not currently built into their systems, they have to jump through hoops to recreate the tables that store and manage product information. RDF, as an extensible description framework answers this problem, because it allows a resource to have unlimited number of properties associated with it. However, when we implement RDF stores atop existing RDBMS, we begin to use a row for each new property/attribute that we would like to store about the resource, thus making it sub-optimal for joins and other operations. Here is where BigTable comes in, because it’s row/column space can be sparse (not all rows/resources contain all the same properties) and columns can be easily created with very little cost. Additionally, you can maintain a locality for families of properties, which we called Ontologies, so if we wanted all properties about a blog entry, we could get them fast enough (i.e. a locality for all Atom metadata columns). Anyways, I have to get back to my school work, but I hope that everyone sees what I’m seeing and further analyze this talk with more attention to the technical details. I think that better times are coming for the SW and we’ll be soon enjoying a whole new class of semantic services on the Internet. One final note or maybe a whole separate post will be Bosworth’s comments on how we should be limiting our SQL queries in order to gain the performance we need in RDF databases.
These icons link to social bookmarking sites where readers can share and discover new web pages.

4. as to crawl ,it seem uitable, but with later process, it seems better.

5.see ,how pig m/r query?
see, how to design schema for bigtable? bigtable suitable?

6. how to get urls of one host? schema what? index? maybe ,yes,if needed
http://www.infoq.com/news/2008/04/hbase-interview

hbase fit in:

The M/R paradigm applies well to batch processing of data. How does Hadoop apply in a more transaction/single request based paradigm?

MapReduce (both Google's and Hadoop's) is ideal for processing huge amounts of data with sizes that would not fit in a traditional database. Neither is appropriate for transaction/single request processing. While HBase uses HDFS from Hadoop Core, it doesn't use MapReduce in its common operations.

However, HBase does support efficient random accesses, so it can be used for some of the transactional elements of your business. You will take a raw performance hit over something like MySQL, but you get the benefit of very good scaling characteristics as your transactional throughput grows.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值