Hbase——Bulkload

最新推荐文章于 2023-06-15 10:03:21 发布

原创

最新推荐文章于 2023-06-15 10:03:21 发布 · 358 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Hbase

HBase的Bulkload是一种高效的数据导入方法，包括通过MapReduce任务准备数据和使用completebulkload完成装载两个步骤。它先通过MapReduce和HFileOutputFormat生成HBase内部格式的存储文件，然后使用completebulkload工具将数据导入运行的集群，减少CPU和网络资源消耗。ImportTsv工具可以将TSV数据导入并为Bulkload准备StoreFiles。

Bulkload

HBase 有好几种方法将数据装载到表。最直接的方式即可以通过MapReduce任务，也可以通过普通客户端API。但是这都不是高效方法。
批量装载特性采用 MapReduce 任务，将表数据输出为HBase的内部数据格式，然后可以将产生的存储文件直接装载到运行的集群中。批量装载比简单使用 HBase API 消耗更少的CPU和网络资源。

批量装载架构
HBase 批量装载过程包含两个主要步骤。

通过MapReduce 任务准备数据
批量装载第一步，从MapReduce任务通过HFileOutputFormat产生HBase数据文件(StoreFiles) 。输出数据为HBase的内部数据格式，以便随后装载到集群更高效。
为了处理高效， HFileOutputFormat 必须比配置为每个HFile适合在一个分区内。为了做到这一点，输出将被批量装载到HBase的任务，使用Hadoop 的TotalOrderPartitioner 类来分开map输出为分开的键空间区间。对应于表内每个分区(region)的键空间。HFileOutputFormat 包含一个方便的函数, configureIncrementalLoad(), 可以基于表当前分区边界自动设TotalOrderPartitioner。

完成数据装载
After the data has been prepared using HFileOutputFormat, it is loaded into the cluster using completebulkload. This command line tool iterates through the prepared data files, and for each one determines the region the file belongs to. It then contacts the appropriate Region Server which adopts the HFile, moving it into its storage directory and making the data available to clients.If the region boundaries have changed during the course of bulk load preparation, or between the preparation and completion steps, the completebulkloads utility will automatically split the data files into pieces corresponding to the new boundaries. This process is not optimally efficient, so users should take care to minimize the delay between preparing a bulk load and importing it into the cluster, especially if other clients are simultaneously loading data through o