HBase Tutorial: Theory and Practice of a Distributed Data Store(1)

最新推荐文章于 2018-03-21 22:21:04 发布

zmycoco2

最新推荐文章于 2018-03-21 22:21:04 发布

阅读量1.8k

点赞数

分类专栏：分布式数据存储 Hadoop 文章标签： HBase RDBMS Column-O

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/michaelzhou224/article/details/9078803

版权

分布式数据存储 Hadoop 专栏收录该内容

63 篇文章 0 订阅

订阅专栏

Relational Database Management Systems(RDBMS):

Around since 1970s

Countless examples in which they actually do make sense

The dawn of Big Data:

Previously: ignore data sources because no cost-effective way to store everything, one option was to prune, by retaining only data for the last N days.

Today: store everything! Pruning fails in providing a base to build useful mathematical models.

Batch processing

Hadoop and MapReduce:

Excels at storing (semi- and/or un-) structured data

Data interpretation takes place at analysis-time

Flexibility in data classification

Batch Processing: A complement to RDBMS:

Scalable sink for data, processing launched when time is right

Optimized for large file storage

Optimized for “streaming” access

Random access:

Users need to “interact” with data, especially that “crunched” after a MapReduce job

This is historically where RDBMS excel: random access for structured data.

Column-Oriented Databases

Data layout:

(1). Save their data grouped by columns

(2). Subsequent column values are stored contiguously on disk

(3). This is substantially different from traditional RDBMS, which save and store data by row.

Specialized databases for specific workloads:

(1). Reduced I/O

(2). Better suited for compression-> Efficient use of bandwidth

Indeed, column values are often similar and differ little row-by-row

(3). Real-time access to data

Important NOTE:

(1). HBase is not a column-oriented DB in the typical term

(2). HBase uses an on-disk column storage format

(3). Provides key-based access to specific cell of data, or a sequential range of cells.

The Problem with RDBMS

RDBMS are still relevant

(1). Persistence layer for frontend application

(2). Store relational data

(3). Works well for a limited number of records.

Example: Hush

(1). Used throughout this course

(2). URL shortener service

Let’s see the “scalability story” of such a service

(1). Assumption: service must run with a reasonable budget

(2). few thousands users: use a LAMP stack

Normalize data

Use foreign keys

Use Indexes

Find all short URLs for a given user

JOIN user and shorturl tables

Stored Procedures

(1). Consistently update data from multiple clients

(2). Underlying DB system guarantees coherency

Transactions

(1). Make sure you can update tables in an atomic fashion

(2). RDBMS-> Strong Consistency(ACID properties)

(3). Referential Integrity

Scaling up to tens of thousands of users

(1). Increasing pressure on the database server

(2). Adding more application servers is easy: they share their state on the same central DB

(3) CPU and I/O start to be a problem on the DB

Master-Slave architecture

(1). Add DB server so that READS can be served in parallel

(2). Master DB takes all the writes(which are fewer in the Hush Application)

(3). Slaves DB replicate Master DB and serve all reads (but you need a load balancer)

Scaling up to hundreds of thousands

(1). READS are still the bottlenecks

(2). Slave servers begin to fall short in serving clients requests.

Caching

(1). Add a caching layer.e.g. Memcached or Redis

(2). Offload READS to a fast in-memory system

You lose consistency guarantees

Cache invalidation is critical for having DB and Caching layer consistent

(3). WRITES are the bottleneck

(4). The master DB is hit too hard by WRITE load

(5). Vertical scalability: beef up your master server

This becomes costly, as you may also have to replace your RDBMS

SQL JOINS becomes a bottleneck

(1). Schema de-normalization

(2). Cease using stored procedures, as they become slow and eat up a lot of server CPU

(3). Materialized views(they speed up READS)

Drop secondary indexes as they slow down WRITES

What if your application needs to further scale up?

Vertical scalability vs. Horizontal calability

Sharding

Partition your data across multiple databases

Essentially your break horizontally your tables and ship them to different servers

This is done using fixed boundaries

Re-sharding to achieve load-balancing

This is an operational mightmare

Re-sharding takes a huge toll and I/O resources

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。