hbase 权威指南翻译之 chapter1 Introduction

最新推荐文章于 2024-04-22 21:32:11 发布

dragon_fire

最新推荐文章于 2024-04-22 21:32:11 发布

阅读量884

点赞数

分类专栏： habase 权威指南文章标签： hbase

habase 权威指南专栏收录该内容

0 篇文章 0 订阅

订阅专栏

Chapter 1. Introduction
Before we start looking into all the moving parts of HBase, let us
pause to think about why there was a need to come up with yet
another storage architecture. Relational database management
systems (RDBMS) have been around since the early 1970s, and have
helped countless companies and organizations to implement their
solution to given problems. And they are equally helpful today. There
are many use-cases for which the relational model makes perfect
sense. Yet there also seem to be specific problems that do not fit this

model very well. [4]

第一章 hbase 简介

在我们开始调查研究hbase的移动部件之前，让我们停下来思考为什么我们需要提出另外一个存储架构。

关系型数据库管理系统早在1970年就已出现，并且已经帮助无数的公司和组织去实施鉴于他们问题的解决方法。

并且到目前位置它们（RDBMS）也非常有用。许多关系型模型的用例非常有用，但是还是有许多问题不太适合这个模型。

The Dawn of Big Data
We live in an era in where we are all connected over the Internet and
expect to find results instantaneously, whether the question concerns
the best turkey recipe or what to buy mom for her birthday. We also
expect the results to be useful and tailored to our needs.
Because of this, companies have become focused on delivering more
targeted information, such as recommendations or online ads, and
their ability to do so directly influences their success as a business.
Systems like Hadoop5]now enable them to gather and process
petabytes of data, and the need to collect even more data continues
to increase with, for example, the development of new machine
learning algorithms.

Where previously companies had the liberty to ignore certain data
sources because there was no cost-effective way to store all that
information, they now are likely to lose out to the competition. There
is an increasing need to store and analyze every data point they
generate. The results then feed directly back into their e-commerce
platforms and may generate even more data.

大数据的黎明

我们生活在一个互联网的时代并且期望迅速找到我们的答案，

然而无论这个问题关注的是最好的火鸡食谱还是为妈妈的生日准备什么礼物。

我们同样期望这个答案是有用的并且契合我们的需要。

正因如此，很多公司已经关注提供更多有针对性的信息，比如说建议或在线广告。他们的能力可以直接影响他们的成败。

比如说hadoop系统，现在可以使他们有能力收集和处理PB级的数据，并且支持更多数据的持续增加，比如开发一个新的机器学习算法。

在以前公司有自由忽略一些数据源，因为他们没有廉价划算的方式去存储所以的信息，他们现在很可能输给对手。

存储和分析产生的数据的需求在不断增长，分析的结果重新返回至他们的电子商务平台并且可能产生更多的数据。

In that past, the only option to retain all the collected data was by
pruning it, to, for example, retain the last N days. While this is a viable
approach in the short term, it lacks the opportunities having all the
data, which can be months and years, offers: you can build
mathematical models that span the entire time range, or amend an
algorithm to perform better and rerun it with all the previous data.

在过去，保留所有收集到数据的唯一选择是去修剪它，例如，保留最后N天的数据。虽然在短期内这是一个可行的方案，但是却失去了保留多年多月数据的机会。

你可以创建一个数学模型来横跨整个时间范围，或者更好的执行和重新运行它和前边所有的数据。

Dr. Ralph Kimball, for example, states [6] that
"Data assets are [a] major component of the balance sheet, replacing
traditional physical assets of the 20th century"
and that there is a
"Widespread recognition of the value of data even beyond traditional
enterprise boundaries"
Google or Amazon are prominent examples for companies that
realized the value of data and started developing solutions to fit their need。

In a series of technical publications Google, for instance,
described a scalable storage and processing system, based on
commodity hardware. These ideas were then implemented outside of
Google as part of the open-source Hadoop project: HDFS and
MapReduce.

拉尔夫·金伯尔博士,例如,州[6]
“数据资产)就是一个主要组件的资产负债表,取代20世纪的传统的实物资产“和有一个
“普遍价值的认识甚至超越了传统的数据企业边界”
谷歌或亚马逊是突出的例子的公司
实现数据的价值,并开始开发适合他们需求的解决方案。

在google的一系列技术论文中描述了关于在普通廉价硬件设备上存储和处理数据的系统。这些想法被google以外的开源项目hadoop（HDFS 和MapReduce）实现。

Hadoop excels at storing data of arbitrary, semi- or even unstructured
formats, since it lets you decide how to interpret the data at analysis
time, allowing you to change the way you classify the data at any time:
once you have updated the algorithms you simply run the analysis
again.
Hadoop also complements existing database systems of almost any
kind. It offers a limitless pool into which one can sink data and still pull
out what is needed when the time is right. It is optimized for large file
storage and batch oriented, streaming access. This makes analysis
easy and fast, but users also need access to the final data, not in
batch mode but using random access - this is akin to a full table scan
versus using indexes in database system.

We are used to querying databases when it comes to random access
for structured data. RDBMSs are the most prominent, but there are
also quite a few specialized variations and implementations, like
object-oriented databases. Most RDBMSs strive to implement Codd's
12 rules [7] which forces them to comply to very rigid requirements.
The architecture used underneath is well researched and has not
changed significantly in quite some time. The recent advent of
different approaches, like column-oriented or massively parallel
processing (MPP) databases, has shown that we can rethink the
technology to fit specific workloads, but most solutions still implement
all or the majority of Codd's 12 rules in an attempt to not break with
tradition.

Column-Oriented Databases
Column-oriented databases save their data grouped by columns.
Subsequent column values are stored contiguously on disk. This
differs from the usual row-oriented approach of traditional databases,
wh - see Figure 1.1,
“Column-oriented storage layouts differs from the row-oriented ones”
for a visualization of the different physical layouts.
The reason to store values on a per column basis instead is based on
the assumption that for specific queries not all of them are needed.

Especially in analytical databases this is often the case and therefore
they are good candidates for this different storage schema.
Reduced IO is one of the primary reasons for this new layout but it
offers additional advantages playing into the same category: since
the values of one column are often very similar in nature or even vary
only slightly between logical rows they are often much better suited
for compression than the heterogeneous values of a row-oriented
record structure: most compression algorithms only look at a finite
window.
Specialized algorithms, for example delta and/or prefix compression,
selected based on the type of the column (i.e. on the data stored) can
yield huge improvements in compression ratios. Better ratios result in
more efficient bandwidth usage in return.

Hadoop擅长存储任意、半结构化甚至非结构化数据，它允许你决定如何解释数据分析时间，允许你在任何时间改变你分类数据的方式：一旦你有了新的算法，只需要再运行分析一次。

Hadoop 同时也是几乎其他所以数据库系统的补充。它提供一个无限的pool 池，你可以在适当的时候进行数据存储和提取。Hadoop适用于大型文件存储和批量导入，流式访问。这使得分析数据更加简单和迅速。但是用户同时需要访问最终的数据，不是在批处理模式而是使用随机的方式：这是一种类似于全表扫描的方式与使用索引在数据库系统中。

当我们查询数据库中的结构化数据，那么RDBMSs是最优秀的，但是也有一些特殊的实现，比如面向对象的数据库。大多数据的关系型数据库必须遵循科德十二定律。这种体系结构在很长一段时间内没有发生变化，但是最近出现了用于大规模并行处理（MPP）的列式数据库，这表明我们开始重新考虑这些架构来适应特定的工作负载，但是大多数的解决方案为了不打破传统，仍然遵循科德十二定律

列式数据库

列式数据库通过列族保存数据，然后将列值存储在连续的磁盘上。这些不同于通常基于行的传统数据库。图标1-1 列式数据库与行式数据库的不同。基于列的存储假定在特殊的查询，但不是所有的需求都这样。

特别是在分析型数据库，这些需求是常有的，所有非常适合这种不同的存储模型。减少IO支出是这种新技术的一个主要原因，同时他也提供了额外的优势：因为一个列的值通常非常相似，或者稍微有些逻辑上的不同，所以他们更加适合压缩比异构导向（这个地方翻译的不清楚）。

特殊的算法，比如三角函数或前缀压缩，选择基于列的方式可以提高压缩比。更好的压缩比可以增加降低带宽的压力，提供传输的效率。

Note though that HBase is not a column-oriented database in the
typical RDBMS sense, but utilizes an on-disk column storage format.
This is also where the majority of similarities end, because although
HBase stores data on disk in a column-oriented format, it is distinctly
different from traditional columnar databases: whereas columnar
database excel at providing real-time analytical access to data, HBase
excels at providing key-based access to a specific cell of data, or a
sequential range of cells.

请注意尽管在典型关系型数据库的观念中HBase不是一个基于列的数据库，但它可以利用于一个磁盘上的列存储格式。这也是大多数数据库的相似之处，虽然HBase也是通过列存储数据，但它不同于传统的列式数据库：传统的列式数据库优势在于提供real-time 分析访问数据，而HBase优势在于提供基于键来检索一个特定的单元格或一个单元格范围。

dragon_fire

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hbase 权威指南翻译之 chapter1 Introduction

Chapter 1. IntroductionBefore we start looking into all the moving parts of HBase, let uspause to think about why there was a need to come up with yetanother storage architecture. Relational dat
复制链接

扫一扫