Spark操作外部数据源之Hbase数据源

最新推荐文章于 2021-05-18 11:35:08 发布

大数据老人家i

最新推荐文章于 2021-05-18 11:35:08 发布

阅读量486

点赞数

分类专栏： Spark 文章标签： spark hadoop

原文链接：https://blog.csdn.net/lz6363/article/details/109016389

版权

Spark 专栏收录该内容

26 篇文章 2 订阅

订阅专栏

文章目录

HBase Sink(下沉)
Hbase Source(读取)

概述
Spark可以从HBase表中读写（Read/Write）数据，底层采用 TableInputFormat和
TableOutputFormat方式，与MapReduce与HBase集成完全一样，使用输入格式InputFormat和输
出格式OutputFoamt。

在这里插入图片描述

HBase Sink(下沉)

概述
将Spark中计算的结果数据下沉到Hbase中

注：

回顾 MapReduce 向 HBase 表中写入数据，使用 TableReducer ，其中 OutputFormat 为
TableOutputFormat，读取数据Key：ImmutableBytesWritable，Value：Put。
写入数据时，需要将 RDD 转换为 RDD[(ImmutableBytesWritable, Put)] 类型，调用
saveAsNewAPIHadoopFile方法数据保存至HBase表中。

案例
需求：将词频统计结果保存HBase表

HBase Client连接时，需要设置依赖Zookeeper地址相关信息及表的名称，通过Configuration
设置属性值进行传递。
在这里插入图片描述
Hbase表的设计：

在这里插入图片描述

代码

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.Put
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD

object Wirte {
  def main(args: Array[String]): Unit = {
    // TODO 0.准备Hbase启动
    // TODO 1.创建sc环境
    val sparkConf: SparkConf = new SparkConf()
      .setAppName(this.getClass.getSimpleName.stripSuffix("$"))
      .setMaster("local[2]")
    val sc: SparkContext = new SparkContext(sparkConf)

    // TODO 2.构建RDD,模拟计算结果数据
    val list = List(("hadoop", 2342), ("hive", 1213), ("Spark", 12134))
    val resultRDD: RDD[(String, Int)] = sc.parallelize(list)

    // TODO 3.使用saveAsNewAPIHadoopFile函数(要求RDD是key,value)将数据写入到Hbase中
    /**
     * HBase表的设计：
     * 表的名称：htb_wordcount
     * Rowkey: word
     * 列簇: info
     * 字段名称： count
     */

    val putsRDD: RDD[(ImmutableBytesWritable, Put)] = resultRDD.mapPartitions(iter => {
      iter.map(element => {
        var key = element._1
        var value = element._2
        // 3.1 创建Put实例对象
        val put = new Put(key.getBytes())
        // 3.2 添加列(列簇,列名,列值)
        put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("count"), Bytes.toBytes(value.toString))
        //返回(将 RDD 转 换 为 RDD[(ImmutableBytesWritable, Put)] 类 型)
        (new ImmutableBytesWritable(put.getRow), put)
      })
    })

    //构建Hbase Client配置
    val conf: Configuration = HBaseConfiguration.create()
    // 设置连接Zookeeper属性
    conf.set("hbase.zookeeper.quorum", "node1.itcast.cn")
    conf.set("hbase.zookeeper.property.clientPort", "2181")
    conf.set("zookeeper.znode.parent", "/hbase")
    // 设置将数据保存的HBase表的名称
    conf.set(TableOutputFormat.OUTPUT_TABLE, "htb_wordcount")

    /*
    def saveAsNewAPIHadoopFile(
    path: String,// 保存的路径
    keyClass: Class[_], // Key类型
    valueClass: Class[_], // Value类型
    outputFormatClass: Class[_ <: NewOutputFormat[_, _]], // 输出格式OutputFormat实现
    conf: Configuration = self.context.hadoopConfiguration // 配置信息
    ): Unit
    */
    //将RDD中的数据保存到Hbase中
    putsRDD.saveAsNewAPIHadoopFile(
      "datas/spark/htb-output-"+System.nanoTime(),
      classOf[ImmutableBytesWritable],
      classOf[Put],
      classOf[TableOutputFormat[ImmutableBytesWritable]],
      conf
    )

    // 应用程序运行结束，关闭资源
    sc.stop()
  }
}

在Hbase中查看：
在这里插入图片描述

Hbase Source(读取)

MapReduce 从读 HBase 表中的数据，使用 TableMapper ，其中 InputFormat 为
TableInputFormat，Spark读取数据Key：ImmutableBytesWritable，Value：Result。

从HBase表读取数据时，同样需要设置依赖Zookeeper地址信息和表的名称，使用Configuration
设置属性，形式如下：
在这里插入图片描述
此外，读取的数据封装到RDD中，Key和Value类型分别为：ImmutableBytesWritable和Result,
不支持Java Serializable导致处理数据时报序列化异常。设置Spark Application使用Kryo序列化，性
能要比Java 序列化要好，创建SparkConf对象设置相关属性，如下所示：

在这里插入图片描述
案例
需求：从HBase表读取词频统计结果

代码

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration}
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object Read {
  def main(args: Array[String]): Unit = {
    // TODO 1.创建sc环境
    val sparkConf: SparkConf = new SparkConf()
      .setAppName(this.getClass.getSimpleName.stripSuffix("$"))
      .setMaster("local[2]")
      // TODO: 设置使用Kryo 序列化方式
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    val sc: SparkContext = new SparkContext(sparkConf)

    // TODO 2.读取HBase Client 配置信息
    val conf: Configuration = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.quorum", "node1.itcast.cn")
    conf.set("hbase.zookeeper.property.clientPort", "2181")
    conf.set("zookeeper.znode.parent", "/hbase")
    //设置读取的表的名称
    conf.set(TableInputFormat.INPUT_TABLE, "htb_wordcount")

    /*
      def newAPIHadoopRDD[K, V, F <: NewInputFormat[K, V]](
        conf: Configuration = hadoopConfiguration,
        fClass: Class[F],
        kClass: Class[K],
        vClass: Class[V]
      ): RDD[(K, V)]
    */

    val resultRDD: RDD[(ImmutableBytesWritable, Result)] = sc.newAPIHadoopRDD(
      conf,
      classOf[TableInputFormat],
      classOf[ImmutableBytesWritable],
      classOf[Result]
    )
    //获取总条数
    println(s"总条数=${resultRDD.count()}")

    //查看前五条数据
    resultRDD
      .take(5)
      .foreach { case (_, result) =>
        //获取rowkey
        println(s"RowKey = ${Bytes.toString(result.getRow)}")
        // HBase表中的每条数据封装在result对象中，解析获取每列的值
        result.rawCells().foreach { cell =>
          val cf = Bytes.toString(CellUtil.cloneFamily(cell))
          val column = Bytes.toString(CellUtil.cloneQualifier(cell))
          val value = Bytes.toString(CellUtil.cloneValue(cell))
          val version = cell.getTimestamp
          println(s"\t $cf:$column = $value, version = $version") } }
      // 应用程序运行结束，关闭资源
      sc.stop()
  }
}

大数据老人家i

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark操作外部数据源之Hbase数据源

文章目录HBase Sink(下沉)概述Spark可以从HBase表中读写（Read/Write）数据，底层采用TableInputFormat和TableOutputFormat方式，与MapReduce与HBase集成完全一样，使用输入格式InputFormat和输出格式OutputFoamt。HBase Sink(下沉)概述将Spark中计算的结果数据下沉到Hbase中注：回顾 MapReduce 向 HBase 表中写入数据，使用 TableReducer
复制链接

扫一扫