Hbase 写入流程 读取流程 合并流程

HBase是一个开源的、分布式的、可伸缩的、大数据存储系统,它是Apache Hadoop生态系统的一部分,用于存储非结构化和半结构化的松散数据。HBase是基于Google的Bigtable设计的,提供高可靠性、高性能、面向列的存储和实时读写访问。



  1. 客户端请求:客户端通过HBase RPC(远程过程调用)向HBase Master发送写入请求。

  2. Region分配:Master根据Region的分配策略,确定哪个Region Server包含目标Region,并将写入请求转发给相应的Region Server。

  3. MemStore写入:Region Server接收到请求后,在对应的Region的MemStore(内存中的存储)中写入数据。MemStore是一个排序的日志结构,它按照RowKey的顺序存储数据。

  4. WAL写入:同时,为了保证数据的持久性,Region Server还会将数据写入WAL(Write-Ahead Logging,预写式日志)。WAL是HBase中用于故障恢复的一种机制,如果Region Server崩溃,可以通过WAL中的数据来恢复数据。

  5. 数据刷新:当MemStore的大小达到配置的上限时,Region Server会将其刷新(flush)到HDFS(Hadoop Distributed File System)上,生成一个HFile文件。这个过程通常是异步的,以保证写入性能。

  6. 客户端确认:一旦数据被写入WAL和MemStore(或者在刷新到HDFS之后),Region Server会向客户端发送确认消息,表示写入操作完成。


  1. 客户端请求:客户端通过HBase RPC向HBase Master发送读取请求。

  2. Region分配:Master根据Region的分配策略,确定哪个Region Server包含目标Region,并将读取请求转发给相应的Region Server。

  3. MemStore查询:Region Server首先在对应的Region的MemStore中查询数据。由于MemStore中的数据是最新的,所以这一步是必要的。

  4. HFile查询:如果MemStore中没有找到数据,Region Server会继续在HDFS上的HFile文件中查询数据。HFile文件按照RowKey的顺序存储数据,所以HBase可以高效地定位到目标数据。

  5. 数据返回:Region Server将查询到的数据返回给客户端。



  1. 选择合并文件:HBase会根据一定的策略(如文件大小、年龄等)选择需要合并的HFile文件。

  2. 创建新的HFile:Region Server创建一个新的HFile文件,用于存储合并后的数据。

  3. 数据合并:Region Server将选定的HFile文件中的数据进行合并,按照RowKey的顺序写入新的HFile文件。在这个过程中,HBase会删除重复的数据和过期的数据(TTL过期的数据)。

  4. 替换旧文件:合并完成后,新的HFile文件会替换掉原来的HFile文件。同时,HBase会更新其元数据,以反映这种变化。

  5. 删除旧文件:在替换新文件后,原来的HFile文件会被标记为删除,并在后续的垃圾回收过程中被删除。


  • 40
  • 39
    觉得还不错? 一键收藏
  • 1
自编译tensorflow: 1.python3.5,tensorflow1.12; 2.支持cuda10.0,cudnn7.3.1,TensorRT-; 3.无mkl支持; 软硬件硬件环境:Ubuntu16.04,GeForce GTX 1080 TI 配置信息: hp@dla:~/work/ts_compile/tensorflow$ ./configure WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown". You have bazel 0.19.1 installed. Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python3 Found possible Python library paths: /usr/local/lib/python3.5/dist-packages /usr/lib/python3/dist-packages Please input the desired Python library path to use. Default is [/usr/local/lib/python3.5/dist-packages] Do you wish to build TensorFlow with XLA JIT support? [Y/n]: XLA JIT support will be enabled for TensorFlow. Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: No OpenCL SYCL support will be enabled for TensorFlow. Do you wish to build TensorFlow with ROCm support? [y/N]: No ROCm support will be enabled for TensorFlow. Do you wish to build TensorFlow with CUDA support? [y/N]: y CUDA support will be enabled for TensorFlow. Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 10.0]: Please specify the location where CUDA 10.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /usr/local/cuda-10.0 Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]: 7.3.1 Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]: Do you wish to build TensorFlow with TensorRT support? [y/N]: y TensorRT support will be enabled for TensorFlow. Please specify the location where TensorRT is installed. [Default is /usr/lib/x86_64-linux-gnu]://home/hp/bin/TensorRT- Please specify the locally installed NCCL version you want to use. [Default is to use https://github.com/nvidia/nccl]: Please specify a list of comma-separated Cuda compute capabilities you want to build with. You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus. Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1,6.1,6.1]: Do you want to use clang as CUDA compiler? [y/N]: nvcc will be used as CUDA compiler. Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: Do you wish to build TensorFlow with MPI support? [y/N]: No MPI support will be enabled for TensorFlow. Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]: Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: Not configuring the WORKSPACE for Android builds. Preconfigured Bazel build configs. You can use any of the below by adding "--config=" to your build command. See .bazelrc for more details. --config=mkl # Build with MKL support. --config=monolithic # Config for mostly static monolithic build. --config=gdr # Build with GDR support. --config=verbs # Build with libverbs support. --config=ngraph # Build with Intel nGraph support. --config=dynamic_kernels # (Experimental) Build kernels into separate shared objects. Preconfigured Bazel build configs to DISABLE default on features: --config=noaws # Disable AWS S3 filesystem support. --config=nogcp # Disable GCP support. --config=nohdfs # Disable HDFS support. --config=noignite # Disable Apacha Ignite support. --config=nokafka # Disable Apache Kafka support. --config=nonccl # Disable NVIDIA NCCL support. Configuration finished 编译: bazel build --config=opt --verbose_failures //tensorflow/tools/pip_package:build_pip_package 卸载已有tensorflow: hp@dla:~/temp$ sudo pip3 uninstall tensorflow 安装自己编译的成果: hp@dla:~/temp$ sudo pip3 install tensorflow-1.12.0-cp35-cp35m-linux_x86_64.whl
您可以按照以下步骤进行操作: 1. 在 Spark 中创建一个 HiveContext: ```scala val sparkConf = new SparkConf().setAppName("Spark-Hive-HBase Integration") val sparkContext = new SparkContext(sparkConf) val hiveContext = new HiveContext(sparkContext) ``` 2. 使用 HiveContext 读取 Hive 中的 user 表的数据: ```scala val userData = hiveContext.sql("SELECT * FROM user") ``` 3. 使用 HBase API 读取 HBase 中的 user1 表的数据: ```scala val hbaseConf = HBaseConfiguration.create() val hbaseConnection = ConnectionFactory.createConnection(hbaseConf) val hbaseTable = hbaseConnection.getTable(TableName.valueOf("user1")) val hbaseScanner = hbaseTable.getScanner(new Scan()) val hbaseData = hbaseScanner.iterator().asScala.map(result => { // 在这里将 HBase 表中的数据转换为 SparkSQL 中的 Row 格式 }) ``` 4. 将 SparkSQL 和 HBase 中的数据进行合并: ```scala val mergedData = userData.unionAll(hbaseData) ``` 5. 将合并后的数据写入到 DWD 层的 table1 表中: ```scala mergedData.write.mode(SaveMode.Append).insertInto("dwd.table1") ``` 完整代码示例: ```scala import org.apache.hadoop.hbase.{HBaseConfiguration, TableName} import org.apache.hadoop.hbase.client.{ConnectionFactory, Scan} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.{Row, SaveMode} import scala.collection.JavaConverters._ object SparkHiveHBaseIntegration { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("Spark-Hive-HBase Integration") val sparkContext = new SparkContext(sparkConf) val hiveContext = new HiveContext(sparkContext) val userData = hiveContext.sql("SELECT * FROM user") val hbaseConf = HBaseConfiguration.create() val hbaseConnection = ConnectionFactory.createConnection(hbaseConf) val hbaseTable = hbaseConnection.getTable(TableName.valueOf("user1")) val hbaseScanner = hbaseTable.getScanner(new Scan()) val hbaseData = hbaseScanner.iterator().asScala.map(result => { // 在这里将 HBase 表中的数据转换为 SparkSQL 中的 Row 格式 }) val mergedData = userData.unionAll(hbaseData) mergedData.write.mode(SaveMode.Append).insertInto("dwd.table1") } } ```


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
评论 1




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


