spark读写HBASE

环境配置
scala -> 2.11.12
spark->2.2.0
HBASE ->1.3.0 注意:用2.0的jar包写入不进去,但也不报错

/**
  * spark直接读写Hbase,已测试
  * @Author: stsahana
  * @Date: 2019-8-21 18:27
  **/
object HbaseDemo {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder
      .enableHiveSupport()
      .appName("habseDemo")
      .master("local[2]")
      .config("executor.memory", "2G")
      .config("total.executor.cores", "2")
      .config("spark.hadoop.validateOutputSpecs", false)
      .getOrCreate()

    read(spark)
    write(spark);
    read(spark)
  }


  def read(spark: SparkSession): DataFrame = {
    val sc = spark.sparkContext

    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.set("hbase.zookeeper.quorum", "localhost") //设置zooKeeper集群地址,也可以通过将hbase-site.xml导入classpath,但是建议在程序里这样设置
    hbaseConf.set("hbase.zookeeper.property.clientPort", "2181") //设置zookeeper连接端口,默认2181
    hbaseConf.set(TableInputFormat.INPUT_TABLE, "Contacts")

    //读取数据并转化成rdd TableInputFormat 是 org.apache.hadoop.hbase.mapreduce 包下的
    val hBaseRDD = sc.newAPIHadoopRDD(hbaseConf, classOf[TableInputFormat],
      classOf[ImmutableBytesWritable],
      classOf[Result])
    import spark.implicits._

    val results = hBaseRDD.map(r => (
      Bytes.toString(r._2.getRow),
      Bytes.toString(r._2.getValue("Personal".getBytes, "Name".getBytes)),
      Bytes.toString(r._2.getValue("Office".getBytes, "Address".getBytes)),
      Bytes.toString(r._2.getValue("Personal".getBytes, "a".getBytes)),
      Bytes.toString(r._2.getValue("Personal".getBytes, "b".getBytes)),
      Bytes.toString(r._2.getValue("Personal".getBytes, "c".getBytes))
    )).toDF("row", "B", "C", "d", "e", "f");
    results.show(20, false);
    return results;
  }

  def write(spark: SparkSession) = {

    val sc = spark.sparkContext
    import spark.implicits._

    val tableName = "Contacts"
    //create configuration object
    val conf: Configuration = HBaseConfiguration.create()
    //set zookeeper information
    conf.set("hbase.zookeeper.quorum", "localhost");
    conf.set("hbase.zookeeper.property.clientPort", "2181");
    //setup job object
    val job: Job = Job.getInstance(conf)
    //define outputformat class
    job.setOutputFormatClass(classOf[TableOutputFormat[ImmutableBytesWritable]])
    //add table name to configuration
    job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableName)


    val indataRDD = sc.makeRDD(Array("3,jack,15", "4,Lily,16", "5,mike,16"))
    val df = indataRDD.map(_.split(",")).map(a => {
      (a(0), a(1), a(2))
    }).toDF("a", "b", "c")
    df.show();
    val prepareHBaseToLoad: RDD[(ImmutableBytesWritable, Put)] =
      df.rdd.map(row => rowToPut(row: Row))
    try {
      prepareHBaseToLoad.saveAsNewAPIHadoopDataset(job.getConfiguration())
    } catch {
      //handle the null string excpetion while inserting to Hbase throws
      case e: Exception => {
        if (e.getMessage().equals("Can not create a Path from a null string")) {
          println(" saveAsNewAPIHadoopDataset - Exception caused due to a bug in spark 2.2 - Data is saved in HBASE but still excepton is thrown - java.lang.IllegalArgumentException: Can not create a Path from a null string at org.apache.hadoop.fs.Path.checkPathArg ")
        } else {
          throw (e)
        }
      }
    }

  }

  def rowToPut(row: Row): (ImmutableBytesWritable, Put) = {
    //  rowToPut(row)
    val columnList = row.length
    //convert the rowKey into String
    val arrayList = row.schema.fieldNames
    //
    val rowkey: String = row.getAs[String](arrayList(0));
    var put = new Put(Bytes.toBytes(rowkey))
    //   println("rowkey=="+rowkey);
    for (field <- 1 until arrayList.size) {
      // Add the score data columns to the Put object
      put.addColumn(Bytes.toBytes("Personal"), Bytes.toBytes(arrayList(field)), Bytes.toBytes(row.getAs[String](arrayList(field))))
    }
    // Returns the assembled Put object
    return (new ImmutableBytesWritable(Bytes.toBytes(rowkey)), put)
  }
}

/** 测试表
  * create "Contacts", { NAME => "Personal", VERSIONS=>5},{ NAME =>"Office",VERSIONS=>2}
  * put 'Contacts', '1', 'Personal:Name', 'John Dole'
  * put 'Contacts', '1', 'Personal:Phone', '1-234-000-0001'
  * put 'Contacts', '1', 'Office:Phone', '1-234-000-0002'
  * put 'Contacts', '1', 'Office:Address', '1111 San Gabriel Dr.'
  * put 'Contacts', '2', 'Personal:Name', 'Calvin Raji'
  * put 'Contacts', '2', 'Personal:Phone', '123-555-0191'
  * put 'Contacts', '2', 'Office:Phone', '123-555-0191'
  * put 'Contacts', '2', 'Office:Address', '5415 San Gabriel Dr.'
  *
  */

  1. 报错一
    链接:https://issues.apache.org/jira/browse/HBASE-20295
Exception in thread "main" java.lang.NullPointerException
	at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:122)
	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
	at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
	at org.apache.hadoop.hbase.mapreduce.TableOutputFormat.checkOutputSpecs(TableOutputFormat.java:177)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1099)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
回答: 要在Spark读写HBase,你需要进行以下几个步骤。首先,你需要在simple.sbt配置文件中指定HBase的版本号,包括hbase-client、hbase-common和hbase-server的版本号。你可以通过在Linux系统中打开终端并导航到HBase安装目录,然后使用命令"cd /usr/local/hbase"和"ls"来查找这些版本号。\[1\] 接下来,你需要创建一个Python文件,比如SparkOperateHBase.py,然后在文件中添加读取HBase数据的代码。你需要导入必要的库,设置Spark的配置,指定HBase的主机和表名,以及配置HBase的相关参数。然后,你可以使用Spark的newAPIHadoopRDD方法来读取HBase数据,并对数据进行操作。最后,你可以将结果打印出来。\[2\] 最后,你需要将支持HBase的jar包导入Spark的jars目录下。你可以使用命令"cp /usr/local/software/hbase/hbase-2.4.9/lib/hbase*.jar /usr/local/software/spark/spark-3.0.3-bin-hadoop2.7/jars"来完成这个步骤。\[3\] 这样,你就可以在Spark读写HBase数据了。 #### 引用[.reference_title] - *1* [大数据-05-Spark读写HBase数据](https://blog.csdn.net/weixin_33670713/article/details/85983819)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* *3* [Spark 读写Hbase](https://blog.csdn.net/jinxing_000/article/details/123706938)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^insertT0,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值