
Hbase(Google Bigtable)
文章平均质量分 87
HBase是一个分布式的、面向列的开源数据库,设计用于存储海量数据,并且提供对数据的快速读/写访问。它基于Hadoop分布式文件系统(HDFS)之上,并提供类似于Google的Bigtable的分布式数据存储能力。HBase提供了高可靠性、高性能、可伸缩性和高可用性,并且支持大规模数据存储和处理。
Bol5261
Begin here!
展开
-
HBase 是一个基于列族(Column Family)的 NoSQL 数据库,它作为 Google 的 BigTable 模式的一种开源实现
HBase将所有数据切割成不同的区域(Regions),由RegionServer负责管理这些区域,并且每个RegionServer都是HDFS的一个子进程,可以在集群的不同节点上运行,从而实现了数据的分布存储和处理。Region之间的划分是基于行键(Row Key)的顺序,这意味着相似的行键会被聚集在一起,从而减少跨Region的通信开销。另外,HBase的Region分裂策略也是一个关键特性,当表中的数据量增加超过某个阈值时,会自动将大的Region分割成更小的部分,以保持数据在整个集群内的均衡分布。原创 2024-08-14 22:02:44 · 593 阅读 · 0 评论 -
HBase是一个开源的、分布式的、可扩展的、大容量的、高可靠的存储系统,主要用于存储非结构化和半结构化的松散数据
HBase是一个开源的、分布式的、可扩展的、大容量的、高可靠的存储系统,主要用于存储非结构化和半结构化的松散数据。它是Apache Hadoop生态系统中的一个组件,设计用于处理非常大的表,这些表可以有数十亿行和数百万列。HBase的数据模型是一个稀疏的、分布式的、持久的、多版本的、有序的映射表。在Java中使用HBase,通常会使用HBase的Java API。这个API允许你执行各种操作,如创建表、插入数据、查询数据、更新数据和删除数据。对象,用于管理HBase表,如创建、删除和禁用表。原创 2024-03-21 17:54:07 · 220 阅读 · 0 评论 -
Pig是一个用于大数据处理的数据流语言和执行环境,由Apache软件基金会开发的
此外,Pig Latin还支持自定义函数和复杂的数据转换操作,使用户能够根据自己的需求进行灵活的数据处理。Pig Latin是一种英语的变种语言,它通过对单词进行特定的转换来创建一种有趣的语言。在Pig Latin中,单词的第一个辅音字母被移到单词的末尾,并在末尾添加"ay"。例如,将单词"hello"转换为Pig Latin,首先将第一个辅音字母"h"移到末尾,得到"elloh",然后在末尾添加"ay",最终得到"ellohay"。它可以增加对英语单词的处理和理解的挑战,同时也增加了一些乐趣和创造性。原创 2024-01-31 08:38:14 · 784 阅读 · 0 评论 -
Hive是一个基于Hadoop的数据仓库基础设施,它提供了一种类似于SQL的查询语言,称为HiveQL
Hive是一个基于Hadoop的数据仓库工具,它提供了类似于SQL的查询语言HiveQL,使用户能够使用类似于关系型数据库的方式来处理大规模的分布式数据。总之,Hive是一个强大的数据仓库基础设施,它通过将结构化数据映射到Hadoop上,并提供类似于SQL的查询语言,使得用户可以方便地进行大规模数据的分析和查询。总而言之,Hive是一个用于处理大规模分布式数据的数据仓库工具,它提供了类似于SQL的查询语言和丰富的功能,使用户能够方便地进行数据的存储、查询和分析。原创 2024-01-31 08:35:09 · 589 阅读 · 0 评论 -
YARN(Yet Another Resource Negotiator)是Apache Hadoop生态系统中的一个关键组件
它的主要目标是提供一个可扩展的、高效的集群资源管理框架,以支持大规模数据处理应用程序。YARN的核心思想是将集群资源(如内存、CPU等)划分为多个容器,并将这些容器分配给不同的应用程序。YARN还提供了一个应用程序容器(Container)的概念,它是一个封装了执行应用程序所需资源的虚拟环境。应用程序管理器是每个应用程序的主管,负责与资源管理器进行通信,以获取所需的资源,并监控应用程序的执行状态。资源管理器是整个集群的主要调度器,负责接收应用程序的资源请求,并将资源分配给不同的应用程序。原创 2024-01-31 08:31:57 · 1044 阅读 · 0 评论 -
在HBase中执行查询操作通常使用HBase Shell或编程语言API(如Java或Python)来执行
在HBase中执行查询操作通常使用HBase Shell或编程语言API(如Java或Python)来执行。使用编程语言API,您可以使用相应的HBase客户端库来执行查询操作。这是一个简单的Java代码示例,演示了如何使用HBase Java API进行单行查询。这些示例仅为基本查询操作,HBase Shell还提供其他高级查询功能,如按时间戳过滤,使用正则表达式进行查询等。请注意,这只是HBase查询的基本示例,您可以根据实际需求和HBase的数据模型进行更复杂的查询操作。原创 2024-01-11 09:44:20 · 1083 阅读 · 3 评论 -
Time To Live (TTL)
ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time...转载 2020-04-29 17:47:52 · 300 阅读 · 0 评论 -
Keeping Deleted Cells
By default, delete markers extend back to the beginning of time. Therefore, Get or Scan operations will not see a deleted cell (row or column), even when the Get or Scan operation indicates a time ran...转载 2020-04-29 17:47:42 · 233 阅读 · 1 评论 -
Secondary Indexes and Alternate Query Paths
This section could also be titled “what if my table rowkey looks like this but I also want to query my table like that.” A common example on the dist-list is where a row-key is of the format “user-tim...转载 2020-04-29 17:47:32 · 152 阅读 · 0 评论 -
Schema Design Case Studies
The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached. Note: this is just an illustration of potential approaches, ...转载 2020-04-29 17:47:23 · 111 阅读 · 0 评论 -
Variable Length or Fixed Length Rowkeys?
It is critical to remember that rowkeys are stamped on every column in HBase. If the hostname is a and the event type is e1 then the resulting rowkey would be quite small. However, what if the ingeste...转载 2020-04-29 17:47:14 · 176 阅读 · 0 评论 -
Case Study - Log Data and Timeseries Data on Steroids
This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, an...转载 2020-04-29 17:47:06 · 145 阅读 · 0 评论 -
Case Study - Customer/Order
Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.The Customer record type would include...转载 2020-04-29 17:46:47 · 255 阅读 · 0 评论 -
在数据库设计中,选择使用单表(Single Table)还是多表(Multiple Tables)是一个重要的决策
在数据库设计中,选择使用单表(Single Table)还是多表(Multiple Tables)是一个重要的决策,它取决于应用程序的需求、数据的复杂性以及预期的性能要求。以下是单表和多表设计的优缺点,以及它们适用的场景。单表设计是指将所有相关数据存储在同一个表中,通过字段区分不同类型的数据。多表设计是指将数据分散到多个表中,通过外键关系将它们连接起来。在实际开发中,可以根据需求灵活选择单表或多表设计,甚至可以结合使用,以达到最佳的设计效果。选择单表还是多表设计,需要根据具体需求权衡。原创 2020-04-29 17:46:39 · 404 阅读 · 0 评论 -
Order Object Design
Now we need to address how to model the Order object. Assume that the class structure is as follows:Order(an Order can have multiple ShippingLocationsLineItem(a ShippingLocation can have multiple...转载 2020-04-29 17:46:29 · 248 阅读 · 0 评论 -
Case Study - "Tall/Wide/Middle" Schema Design Smackdown
This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must c...转载 2020-04-29 17:46:20 · 173 阅读 · 0 评论 -
Supported Datatypes
HBase supports a “bytes-in/bytes-out” interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, o...转载 2020-04-29 17:48:03 · 249 阅读 · 0 评论 -
Maximum Number of Versions
The maximum number of row versions to store is configured per column family via HColumnDescriptor. The default for max versions is 1. This is an important parameter because as described in Data Model ...转载 2020-04-29 17:48:12 · 258 阅读 · 0 评论 -
Relationship Between RowKeys and Region Splits
If you pre-split your table, it is critical to understand how your rowkey will be distributed across the region boundaries. As an example of why this is important, consider the example of using displa...转载 2020-04-29 17:48:19 · 164 阅读 · 0 评论 -
Reverse Timestamps
Reverse Scan APIHBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available...转载 2020-04-29 17:48:25 · 378 阅读 · 0 评论 -
Byte Patterns
A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String — presuming a byte per character — you need nearly 3x t...转载 2020-04-29 17:48:40 · 223 阅读 · 0 评论 -
Try to minimize row and column sizes
In HBase, values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp - always. If your rows and column n...转载 2020-04-29 17:48:32 · 219 阅读 · 0 评论 -
Monotonically Increasing Row Keys/Timeseries Data
In the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all cl...转载 2020-04-29 17:48:45 · 281 阅读 · 0 评论 -
Hotspotting
Rows in HBase are sorted lexicographically by row key. This design optimizes for scans, allowing you to store related rows, or rows that will be read together, near each other. However, poorly designe...转载 2020-04-29 17:49:00 · 179 阅读 · 0 评论 -
On the number of column families
HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region...转载 2020-04-29 17:48:52 · 130 阅读 · 0 评论 -
Table Schema Rules Of Thumb
There are many different data sets, with different access patterns and service-level expectations. Therefore, these rules of thumb are only an overview. Read the rest of this chapter to get more detai...转载 2020-04-29 17:49:06 · 202 阅读 · 0 评论 -
Schema Creation
HBase schemas can be created or updated using the The Apache HBase Shell or by using Admin in the Java API.Tables must be disabled when making ColumnFamily modifications, for example:Configuration c...转载 2020-04-29 17:49:13 · 204 阅读 · 0 评论 -
HBase and Schema Design
A good introduction on the strength and weaknesses modelling on the various non-rdbms datastores is to be found in Ian Varley’s Master thesis, No Relation: The Mixed Blessings of Non-Relational Databa...转载 2020-04-29 17:49:19 · 216 阅读 · 0 评论 -
Case Study - Customer/Order
Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.The Customer record type would include...转载 2020-04-29 17:32:01 · 329 阅读 · 0 评论 -
Case Study - Log Data and Timeseries Data on Steroids
This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, an...转载 2020-04-29 17:32:09 · 139 阅读 · 0 评论 -
Case Study - Log Data and Timeseries Data
Assume that the following data elements are being collected.HostnameTimestampLog eventValue/messageWe can store them in an HBase table called LOG_DATA, but what will the rowkey be? From these...转载 2020-04-29 17:32:15 · 269 阅读 · 0 评论 -
Schema Design Case Studies
The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached. Note: this is just an illustration of potential approaches, ...转载 2020-04-29 17:32:23 · 148 阅读 · 0 评论 -
Filter Query
Depending on the case, it may be appropriate to use Client Request Filters. In this case, no secondary index is created. However, don’t try a full-scan on a large table like this from an application (...转载 2020-04-29 17:32:30 · 161 阅读 · 0 评论 -
Secondary Indexes and Alternate Query Paths
This section could also be titled “what if my table rowkey looks like this but I also want to query my table like that.” A common example on the dist-list is where a row-key is of the format “user-tim...转载 2020-04-29 17:32:41 · 141 阅读 · 0 评论 -
Keeping Deleted Cells
By default, delete markers extend back to the beginning of time. Therefore, Get or Scan operations will not see a deleted cell (row or column), even when the Get or Scan operation indicates a time ran...转载 2020-04-29 17:32:50 · 341 阅读 · 0 评论 -
Time To Live (TTL)
ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time...转载 2020-04-29 17:32:58 · 443 阅读 · 0 评论 -
Supported Datatypes
HBase supports a “bytes-in/bytes-out” interface via Put and Result, so anything that can be converted to an array of bytes can be stored as a value. Input could be strings, numbers, complex objects, o...转载 2020-04-29 17:33:05 · 126 阅读 · 0 评论 -
在Hadoop生态系统中,数据模型的设计对于确保数据的有效存储和检索至关重要
Hadoop 的 Schema 设计是处理大规模数据的关键环节。通过合理的 Schema 设计,可以显著提高数据存储和查询效率。以上案例研究和最佳实践为 Hadoop Schema 设计提供了实用的指导。如果需要更详细的信息,可以参考。原创 2020-04-29 17:33:22 · 315 阅读 · 0 评论 -
Secondary Indexes and Alternate Query Paths
This section could also be titled “what if my table rowkey looks like this but I also want to query my table like that.” A common example on the dist-list is where a row-key is of the format “user-tim...转载 2020-04-29 17:33:30 · 199 阅读 · 0 评论 -
Joins
If you have multiple tables, don’t forget to factor in the potential for Joins into the schema design.41. Time To Live (TTL)ColumnFamilies can set a TTL length in seconds, and HBase will automatical...转载 2020-04-29 17:33:40 · 206 阅读 · 0 评论