HBase的Bulk Loading

最新推荐文章于 2023-01-07 17:38:25 发布

宝哥大数据

最新推荐文章于 2023-01-07 17:38:25 发布

阅读量1k

点赞数 1

分类专栏： # hbase 文章标签： hbase

本文链接：https://blog.csdn.net/wuxintdrh/article/details/69489801

版权

hbase 专栏收录该内容

40 篇文章 1 订阅

订阅专栏

参考官网http://hbase.apache.org/book.html#arch.bulk.load
HBase包括将数据加载到表中的几种方法。
最直接的方法：

使用MapReduce作业中的TableOutputFormat类
或者使用普通的客户端API; 然而，这些并不总是最有效的方法。

批量加载功能使用MapReduce作业以HBase的内部数据格式输出表数据，然后将生成的StoreFiles直接加载到正在运行的集群中。与使用HBase API相比，使用批量加载将减少CPU和网络资源。

二、Bulk Load Limitations

As bulk loading bypasses the write path, the WAL doesn’t get written to as part of the process. Replication works by reading the WAL files so it won’t see the bulk loaded data – and the same goes for the edits that use `Put.setDurability(SKIP_WAL)`. One way to handle that is to ship the raw files or the HFiles to the other cluster and do the other processing there.

三、Bulk Load Architecture

The HBase bulk load process 主要有两个组成部分

3.1、 Preparing data via a MapReduce job

批量加载的第一步是使用HFileOutputFormat2从MapReduce作业生成HBase数据文件（StoreFiles）。该输出格式以HBase的内部存储格式写出数据，以便稍后能够非常有效地加载到集群中。
为了有效地运行，必须配置HFileOutputFormat2，使每个输出的HFile within a single region。为了做到job的job输出将被bulk load 到HBase, 使用Hadoop的TotalOrderPartitioner类将mapTask的输出划`partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.`
HFileOutputFormat2包括一个方便的函数configureIncrementalLoad()，它根据表的当前区域边界自动设置一个TotalOrderPartitioner。

3.2、 Completing the data load

在准备好导入数据之后，通过使用importtsv工具与“importtsv.bulk.output”选项或其他MapReduce作业使用HFileOutputFormat，completebulkload工具用于将数据导入运行的集群。该命令行工具遍历准备的数据文件，并且每个文件都确定文件所属的region。然后通知RegionServer采取HFile，将其移动到其存储目录中，并将数据提供给客户端。

如果region 边界在批量加载准备过程中或在准备和完成步骤之间发生变化，则completebulkload 程序将自动将数据文件分割成与新边界相对应的片段。这个过程并不是最佳的效率，所以用户应该注意尽量减少准备批量加载并将其导入到集群中的延迟，尤其是当其他客户端通过其他方式同时加载数据时。

$ hadoop jar hbase-server-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] /user/todd/myoutput mytable

如果CLASSPATH中没有提供-c config-file选项，可以用于指定包含适当的hbase参数（例如，hbase-site.xml）的文件（另外，CLASSPATH必须包含具有zookeeper配置的目录文件如果zookeeper不由HBase管理）。

注意：If the target table does not already exist in HBase, this tool will create the table automatically.

For more information about the referenced utilities, see ImportTsv and CompleteBulkLoad.

See How-to: Use HBase Bulk Loading, and Why for a recent blog on current state of bulk loading.

四、How-to: Use HBase Bulk Loading, and Why

Apache HBase是关于为您提供随机，实时，读/写访问您的大数据，但是如何有效地将数据传输到HBase中？直观地，新用户将尝试通过客户端API或使用TableOutputFormat使用MapReduce作业，但这些方法是有问题的，您将在下面了解到。相反，HBase批量加载功能更容易使用，并且可以更快地插入相同数量的数据。

本博客将介绍大容量加载功能的基本概念，介绍两个用例，并提出两个示例。

4.1、Overview of Bulk Loading

如果您有任何这些症状，大量装载可能是您的正确选择：

您需要调整MemStores才能使用大部分内存。
您需要使用更大的WAL或完全绕过它们。
您的压缩和刷新队列数百。
您的GC无法控制，因为您的插入范围在MB中。
导入数据时，您的延迟会超出您的SLA。

Most of those symptoms are commonly referred to as “growing pains.” Using bulk loading can help you avoid them.

在HBase中，批量加载是将HFiles（HBase自己的文件格式）直接加载到RegionServers中的过程，从而绕过写入路径并完全消除了这些问题。此过程与ETL类似，如下所示：

1.从源提取数据，通常是文本文件或另一个数据库。 HBase没有管理这部分流程。换句话说，你不能通过直接从MySQL读取它们来告诉HBase来准备HFiles，而是用自己的方式来做。例如，您可以在表上运行mysqldump，并将生成的文件上传到HDFS或仅获取Apache HTTP日志文件。无论如何，您的数据需要在下一步之前存在于HDFS。

2.将数据转换为HFiles。此步骤需要MapReduce作业，对于大多数输入类型，您将必须自己编写Mapper。`The job will need to emit the row key as the Key, and either a KeyValue, a Put, or a Delete as the Value.`。 Reducer由HBase处理; 您使用HFileOutputFormat.configureIncrementalLoad（）配置它，它执行以下操作：

Inspects the table to configure a total order partitioner
Uploads the partitions file to the cluster and adds it to the DistributedCache
Sets the number of reduce tasks to match the current number of regions
Sets the output key/value class to match HFileOutputFormat’s requirements
Sets the reducer up to perform the appropriate sorting (either KeyValueSortReducer or PutSortReducer)
在此阶段，将在输出文件夹中的每个区域创建一个HFile。请注意，输入数据几乎完全重写，因此您将至少需要两倍于原始数据集大小的可用磁盘空间。例如，对于100GB的mysqldump，您应该在HDFS中至少有200GB的可用磁盘空间。您可以在进程结束时删除转储文件。
将在输出文件夹中的每个区域创建一个HFile：我们设置每两分钟文件在一个region,下面为输出目录，共30个region

3.将文件加载到HBase中，告诉RegionServers在哪里找到它们。这是最简单的一步。它需要使用LoadIncrementalHFiles（通常称为completebulkload工具），并通过传递一个将文件定位在HDFS中的URL，它将通过服务于它的RegionServer将每个文件加载到相关区域。如果文件创建后区域被拆分，工具将根据新的边界自动拆分。这个过程不是很有效率，所以如果你的表当前被其他进程写入，最好在转换步骤完成后立即加载文件。

这是一个这个过程的例证。数据流从原始来源移动到HDFS，其中RegionServers将简单地将文件移动到其区域的目录。

这里写图片描述

五、Use Cases

原始数据集加载：从另一个数据存储迁移的所有用户都应考虑此用例。首先，您必须完成设计table架构的工作，然后创建table，预先拆分。分割点必须考虑行密钥分发和RegionServers的数量。我建议阅读我的同事Lars George对任何严重用例的高级架构设计的介绍。

这里的优点是直接写入文件比通过RegionServer的写入路径（写入MemStore和WAL）快得多，然后最终刷新，压缩等。这也意味着您无需调整集群以达到写入繁重的工作负载，然后再次调整您的正常工作负载。

增量负载：假设您有一些目前由HBase提供的数据集，但是现在您需要从第三方批量导入更多数据，或者您每天都需要生成几GB才能插入。It’s probably not as large as the dataset that HBase is already serving, but it might affect your latency’s 95th percentile. 通过正常的写入路径将导致在导入期间触发更多的刷新和压缩的不利影响。这种额外的IO压力将与您的延迟敏感查询竞争。

宝哥大数据

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
HBase的Bulk Loading

参考官网http://hbase.apache.org/book.html#arch.bulk.load HBase包括将数据加载到表中的几种方法。最直接的方法：使用MapReduce作业中的TableOutputFormat类或者使用普通的客户端API; 然而，这些并不总是最有效的方法。批量加载功能使用MapReduce作业以HBase的内部数据格式输出表数据，然后将生成的Store
复制链接

扫一扫