hive导入hbase批量入库----单条put 、批量put 、Mapreduce、 bluckload

本文链接：https://blog.csdn.net/qq_22473611/article/details/98503979

1、概述

hive数据导入到hbase的方式：

我们经常面临向 HBase 中导入大量数据的情景，往HBase 中批量加载数据的方式有很多种，

1、hive和hbase建映射表 直接操作hive表就是操作hbase表 --关系数据导入hbase，进行数据初始化，但是这种会hive的分区和hbase的预分区不友好，如果只是一般hive表可以使用。

2、使用sparksql操作完hive处理好之后，调用HBase 的API用put方法插入数据；单条put、批量put

3、是用MapReduce的方式从hdfs上加载数据，调用TableOutputFormat 类在reduce中直接生成put对象写入HBase（这种方式可以看作多线程的调用hbase API方式）；

但是这两种方式效率都不是很高，因为HBase会频繁的进行flush、compact、split操作，需要消耗较大的 CPU 和网络资源，并且region Server压力也比较大。

4、另一种是通过BulkLoad的方式生成HFile文件然后加载到HBase中，BulkLoad 方式调用MapReduce的job直接将数据输出成HBase table内部的存储格式的文件HFile，然后将生成的StoreFiles 加载到集群的相应节点。这种方式无需进行flush、compact、split等过程，不占用region资源，不会产生巨量的写入 I/O，所以需要较少的 CPU 和网络资源。在首次数据加载时，能极大的提高写入效率，并降低对Region Server节点的写入压力。

2、HBase API往Hbase中插入数据的流程。

client端写入操作实际上都是RPC请求，数据传到Region Server中，默认首先会写入到WAL（Write Ahead Log）中，也就是HLog中，然后才将数据写入到对应region的memStore中，memStore满了之后，flush到HFile中，这种情况的flush操作会引起瞬间堵塞用户的写操作。

当StoreFile数量达到一定的阈值，会触发compact合并操作，将多个storeFile合并成一个大的StoreFile，这一过程包含大量的硬盘I/O操作以及网络数据通信。

单个StoreFile过大超过阈值之后会触发region的split操作，并将大的StoreFile一分为二。

该方式在大数据量写入时效率低下（频繁进行flush，split，compact耗费磁盘I/O），还会对影响HBase节点的稳定性造（GC时间过长，响应变慢，导致节点超时退出，并引起一系列连锁反应）。

进行批量入库之前，首先要连接到正确的连接到hbase

static{
        conf=HBaseConfiguration.create();
        //可以连接hbase
        //zookeeper给客户端的端口
        conf.set("hbase.zookeeper.property.clientPort", "2181");
        conf.set("hbase.zookeeper.quorum", "192.168.137.138,192.168.137.139");
        conf.set("hbase.master", "192.168.10.138:60000");        
    }

然后开始建立我们的表结构:

    public static void createTable(String tableName){
        try {
            ha = new HBaseAdmin(conf);
            if(ha.tableExists(tableName)){
                ha.disableTable(tableName);
                ha.deleteTable(tableName);
            }
            //建立表结构
            HTableDescriptor hd =new HTableDescriptor(tableName);
            //添加列族
            hd.addFamily(new HColumnDescriptor("family1".getBytes()));
            hd.addFamily(new HColumnDescriptor("family2".getBytes()));
            ha.createTable(hd);
        } catch (Exception e) {
            System.out.println(e);
        }
    }

有了上面的基础后，可以正式开始进行数据的插入

1、单条put

// 插入内容，行键，列族，列名，值，插入的表名
    public static void insertData(String rowkey, String cf,  String clomun, String content, String tableName)
            throws IOException {
        htable = new HTable(conf, tableName);
        Put put = new Put(rowkey.getBytes());
        put.add(cf.getBytes(), clomun.getBytes(), content.getBytes());
        htable.put(put);
    }

这种方式是批量插入数据最慢的方式，它更合适的应用场景是一般是线上业务运行时，记录单条插入，如报文记录，处理记录，写入后htable对象即释放。每次提交就是一次rpc请求.

2、多条Put

也就是将每一个put对象,放入List集合里面,然后对这个List集合进行入库,相比于单条Put,这种方式在入库效率上明显会有所提升. 应用场景一般在数据量稍多的环境下，通过批量提交减少请求次数
在主方法里面调用该方法并且输入相关参数就可以实现用put方式对数据的批量插入了

public static void insertData(String rowkey, String cf, String clomun, String content, String tableName)
            throws IOException {
        htable = new HTable(conf, tableName);
        List<Put> list =new ArrayList<Put>(); 
        Put put = new Put(rowkey.getBytes());
        put.add(cf.getBytes(), clomun.getBytes(), content.getBytes());
        list.add(put);
        htable.put(list);
    }

public static void main(String[] args) throws IOException {
        createTable("insertTest");
        try {
            htable = new HTable(conf, tableName);
            List<Put> list = new ArrayList<Put>(); 
            for (int i = 0; i < 10; i++) {
                String rowkey = UUID.randomUUID().toString();
                // 因为不能动态增加列簇,所以只能动态添加列
                Put put = new Put(rowkey.getBytes());
                //insertData(rowkey, "family1", "column",new SimpleDateFormat("yyyy-MM-dd hh:mm:ss").format(new Date()), "insertTest");
                   
                put.add(cf.getBytes(), clomun.getBytes(), content.getBytes());  
                list.add(put);
            }
          htable.put(list);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

以上的put插入数据,因为不适合处理大批量的数据,所以都是在自己搭建的集群上进行的测试,接下来介绍的两种方式用的是公司的集群.

2.3.1 普通插入

val rdd = sc.textFile("/data/produce/2015/2015-03-01.log")
val data = rdd.map(_.split("@")).map {
	x=>(x(0)+x(1),x(2))
}
val result = data.foreachPartition {
	x => {
	val conf= HBaseConfiguration.create();
	conf.set(TableInputFormat.INPUT_TABLE,"data");
	conf.set("hbase.zookeeper.quorum","slave5,slave6,slave7");
	conf.set("hbase.zookeeper.property.clientPort","2181");
	conf.addResource("/home/hadoop/data/lib/hbase-site.xml");
	val table = new HTable(conf,"data");
	table.setAutoFlush(false,false);
	table.setWriteBufferSize(3*1024*1024);
	x.foreach {
	y => {
	var put= new Put(Bytes.toBytes(y._1));
	put.add(Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes(y._2));
	table.put(put)
}
;table.flushCommits}}}
2）执行时间如下：7.6 min

2.3.2 Bulkload

val conf = HBaseConfiguration.create();
	val tableName = "data1"
val table = new HTable(conf,tableName)
conf.set(TableOutputFormat.OUTPUT_TABLE,tableName)

val job = Job.getInstance(conf)
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
HFileOutputFormat.configureIncrementalLoad(job,table)

val rdd = sc.textFile("/data/produce/2015/2015-03-01.log")
.map(_.split("@"))
.map {
	x => (DigestUtils.md5Hex(x(0)+x(1)).substring(0,3)+x(0)+x(1),x(2))
}
.sortBy(x =>x._1)
.map {
	x=> {
	val kv:KeyValue = new KeyValue(Bytes.toBytes(x._1),Bytes.toBytes("v"),Bytes.toBytes("value"),Bytes.toBytes(x._2+""));
	(new ImmutableBytesWritable(kv.getKey),kv)
}
}rdd.saveAsNewAPIHadoopFile("/tmp/data1",classOf[ImmutableBytesWritable],classOf[KeyValue],classOf[HFileOutputFormat],job.getConfiguration())
val bulkLoader = new LoadIncrementalHFiles(conf)
bulkLoader.doBulkLoad(new Path("/tmp/data1"),table)

2） 执行时间：7s

通过对比我们可以发现bulkload批量导入所用时间远远少于普通导入，速度提升了60多倍，当然我没有使用更大的数据量测试，但是我相信导入速度的提升是非常显著的，强烈建议使用BulkLoad批量导入数据到HBase中。

3、采用bulkLoad方法批量入库

BulkLoad涉及两个过程：

1. Transform阶段：使用MapReduce将HDFS上的数据生成成HBase的底层Hfile数据。

2. Load阶段：根据生成的目标HFile，利用HBase提供的BulkLoad工具将HFile Load到HBase的region中。

在这里需要注意，在bulkLoading执行之前要提前把数据导入到hdfs上，因为mapreduce只能读取HDFS上的数据；如果原始数据在hdfs上占用100G大小的空间，那么hdfs上的预留的空间大小要大于200G，因为数据要首先生成hfile也是放在hdfs临时目录下。

4 bulkload和put适合的场景：

• bulkload适合的场景：

– 大量数据一次性加载到HBase。

– 对数据加载到HBase可靠性要求不高，不需要生成WAL文件。

– 使用put加载大量数据到HBase速度变慢，且查询速度变慢时。

– 加载到HBase新生成的单个HFile文件大小接近HDFS block大小。

• put适合的场景：

– 每次加载到单个Region的数据大小小于HDFS block大小的一半。

– 数据需要实时加载。

– 加载数据过程不会造成用户查询速度急剧下降。

5、Bulkload批量导入数据shell操作步骤：

1.将数据导入到HDFS中

2.建表并创建导入模板文件

3.执行命令，生成HFile文件

4.执行命令将HFile导入HBase

5.总结

1.本篇文章是使用hbase-spark包中提供的bulkload方法生成HFile文件，然后将生成的文件导入到HBase表中。

2.使用bulkload的方式导入数据到HBase表时，在load HFile文件到表过程中会有短暂的时间导致该表停止服务（在load文件过程中需要先disable表，load完成后在enable表。

3.需要使用hbase用户提交Spark作业

eg 1： hdfs2Hbase

object hdfsBulkLoad2Hbase {
  /**
    * 批量插入多列
    */
  def insertWithBulkLoadWithMulti(): Unit = {

    val sparkSession = SparkSession.builder().appName("insertWithBulkLoad").master("local[4]").getOrCreate()
    val sc = sparkSession.sparkContext
    val rdd = sc.textFile("hdfs://node1:9000/v2120/a.txt")
      .map(_.split(","))
      .map(x => (DigestUtils.md5Hex(x(0)).substring(0, 3) + x(0), x(1), x(2)))
      .sortBy(_._1)
      .flatMap(x => {
        val listBuffer = new ListBuffer[(ImmutableBytesWritable, KeyValue)]
        val kv1: KeyValue = new KeyValue(Bytes.toBytes(x._1), Bytes.toBytes("cf1"), Bytes.toBytes("name"), Bytes.toBytes(x._2 + ""))
        val kv2: KeyValue = new KeyValue(Bytes.toBytes(x._1), Bytes.toBytes("cf1"), Bytes.toBytes("age"), Bytes.toBytes(x._3 + ""))
        listBuffer.append((new ImmutableBytesWritable, kv2))
        listBuffer.append((new ImmutableBytesWritable, kv1))
        listBuffer
      }
      )
    val tableName = "test"
    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.addResource("hbase-site.xml") //k可以不用下面了
    hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "192.168.187.201")
    hbaseConf.set(HConstants.ZOOKEEPER_CLIENT_PORT, "2181")
    hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName)

    val conn = ConnectionFactory.createConnection(hbaseConf)
    val admin = conn.getAdmin
    val table = conn.getTable(TableName.valueOf(tableName))
    //获取hbase表的region分布
    //val regionLocator = conn.getRegionLocator(TableName.valueOf(tableName))

    val job = Job.getInstance(hbaseConf)
    //设置job的输出格式. //输出文件的内容KeyValue
    //此处最重要,需要设置文件输出的key,因为我们要生成HFil,所以outkey要用ImmutableBytesWritable
    job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
    job.setMapOutputValueClass(classOf[KeyValue])
    job.setOutputFormatClass(classOf[HFileOutputFormat2])
    //HFileOutputFormat2.configureIncrementalLoad(job, table, conn.getRegionLocator(TableName.valueOf(tableName)))
    HFileOutputFormat2.configureIncrementalLoadMap(job, table)

    //多列的排序，要按照列名字母表大小来
    val hFileOutput = "hdfs://node1:9000/test"
    isFileExist(hFileOutput, sc)

    rdd.saveAsNewAPIHadoopFile(hFileOutput,
      classOf[ImmutableBytesWritable],
      classOf[KeyValue],
      classOf[HFileOutputFormat2],
      job.getConfiguration)
    val bulkLoader = new LoadIncrementalHFiles(hbaseConf)
    //bulkLoader.doBulkLoad(new Path(hFileOutput), admin, table, regionLocator)
    bulkLoader.doBulkLoad(new Path(hFileOutput), table.asInstanceOf[HTable])
  }

  /**
    * 判断hdfs上文件是否存在，存在则删除
    */
  def isFileExist(filePath: String, sc: SparkContext): Unit = {
    val output = new Path(filePath)
    val hdfs = FileSystem.get(new URI(filePath), new Configuration)
    if (hdfs.exists(output)) {
      hdfs.delete(output, true)
    }
  }

spark-submit --master yarn 
--conf spark.yarn.tokens.hbase.enabled=true 
--class com.dounine.hbase.BulkLoad 
--executor-memory 2G 
--num-executors 2G 
--driver-memory 2G    
--executor-cores 2 build/libs/hbase-data-insert-1.0.0-SNAPSHOT-all.jar

eg2：Hive2Hbase

https://www.aboutyun.com/forum.php?mod=viewthread&tid=25037

/**
  * package: com.cloudera.hbase
  * describe: 使用BulkLoad的方式将Hive数据导入HBase
  */
object Hive2HBase {

  def main(args: Array[String]) {

    //库名、表名、rowKey对应的字段名、批次时间、需要删除表的时间参数
    val rowKeyField = "id"
    val quorum = "cdh01.fayson.com,cdh02.fayson.com,cdh03.fayson.com"
    val clientPort = "2181"
    val hBaseTempTable = "ods_user_hbase"

    val sparkConf = new SparkConf().setAppName("Hive2HBase")
    val sc = new SparkContext(sparkConf)

    val hiveContext = new HiveContext(sc)
    //从hive表读取数据
    val datahiveDF = hiveContext.sql(s"select * from ods_user")

    //表结构字段
    var fields = datahiveDF.columns

    //去掉rowKey字段
    fields = fields.dropWhile(_ == rowKeyField)

    val hBaseConf = HBaseConfiguration.create()
    hBaseConf.set("hbase.zookeeper.quorum", quorum)
    hBaseConf.set("hbase.zookeeper.property.clientPort", clientPort)

    //表不存在则建Hbase临时表
    creteHTable(hBaseTempTable, hBaseConf)

    val hbaseContext = new HBaseContext(sc, hBaseConf)

    //将DataFrame转换bulkload需要的RDD格式
    val rddnew = datahiveDF.rdd.map(row => {
      val rowKey = row.getAs[String](rowKeyField)

      fields.map(field => {
        val fieldValue = row.getAs[String](field)
        (Bytes.toBytes(rowKey), Array((Bytes.toBytes("info"), Bytes.toBytes(field), Bytes.toBytes(fieldValue))))
      })
    }).flatMap(array => {
      (array)
    })

    //使用HBaseContext的bulkload生成HFile文件
    hbaseContext.bulkLoad[Put](rddnew.map(record => {
      val put = new Put(record._1)
      record._2.foreach((putValue) => put.addColumn(putValue._1, putValue._2, putValue._3))
      put
    }), TableName.valueOf(hBaseTempTable), (t : Put) => putForLoad(t), "/tmp/bulkload")

    val conn = ConnectionFactory.createConnection(hBaseConf)
    val hbTableName = TableName.valueOf(hBaseTempTable.getBytes())
    val regionLocator = new HRegionLocator(hbTableName, classOf[ClusterConnection].cast(conn))
    val realTable = conn.getTable(hbTableName)
    HFileOutputFormat2.configureIncrementalLoad(Job.getInstance(), realTable, regionLocator)

    // bulk load start
    val loader = new LoadIncrementalHFiles(hBaseConf)
    val admin = conn.getAdmin()
    loader.doBulkLoad(new Path("/tmp/bulkload"),admin,realTable,regionLocator)

    sc.stop()
  }

  /**
    * 创建HBase表
    * @param tableName 表名
    */
  def creteHTable(tableName: String, hBaseConf : Configuration) = {
    val connection = ConnectionFactory.createConnection(hBaseConf)
    val hBaseTableName = TableName.valueOf(tableName)
    val admin = connection.getAdmin
    if (!admin.tableExists(hBaseTableName)) {
      val tableDesc = new HTableDescriptor(hBaseTableName)
      tableDesc.addFamily(new HColumnDescriptor("info".getBytes))
      admin.createTable(tableDesc)
    }
    connection.close()
  }

  /**
    * Prepare the Put object for bulkload function.
  */
  @throws(classOf[IOException])
  @throws(classOf[InterruptedException])
  def putForLoad(put: Put): Iterator[(KeyFamilyQualifier, Array[Byte])] = {
    val ret: mutable.MutableList[(KeyFamilyQualifier, Array[Byte])] = mutable.MutableList()
    import scala.collection.JavaConversions._
    for (cells <- put.getFamilyCellMap.entrySet().iterator()) {
      val family = cells.getKey
      for (value <- cells.getValue) {
        val kfq = new KeyFamilyQualifier(CellUtil.cloneRow(value), family, CellUtil.cloneQualifier(value))
        ret.+=((kfq, CellUtil.cloneValue(value)))
      }
    }
    ret.iterator
  }
}

1.将编译好的spark-demo-1.0-SNAPSHOT.jar包上传至服务器，使用spark-submit提交

export HADOOP_USER_NAME=hbase
spark-submit --class com.cloudera.hbase.Hive2HBase \
--master yarn-client \
--driver-cores 1 \
--driver-memory 2g \
--executor-cores 1 \
--executor-memory 2g \
spark-demo-1.0-SNAPSHOT.jar

eg3：file2Hbase

object file2Hbase{
  def main(args: Array[String]) = {
    //创建sparkcontext,用默认的配置
    //val sc = new SparkContext(new SparkConf())
    val sc = new SparkContext("local", "app name")
    //hbase的列族
    val columnFamily1 = "f1"
    val conf = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.property.clientPort", "2181")
    conf.set("hbase.zookeeper.quorum", "120.27.111.55")
    val res1=sc.textFile("file:///E:/BaiduYunDownload/data1").map(x =>
      x.replaceAll("<|>", "")
    ).distinct();
    val res2=res1.filter(x=>
      x.contains("REC")
    )
   val sourceRDD= res2.flatMap(x=>
    {
      val arg0=x.split(",")
      val arg1=arg0.map(y=>
      y.replaceFirst("=",",")
      ).filter(s=>
        s.split(",").length>1
        )
      //arg0(10).replaceFirst("=",",").split(",")(0).contat(arg0(10).replaceFirst("=",",").split(",")(0))
     // val key1=Bytes.toBytes(arg0(11).replaceFirst("=",",").split(",")(0).concat(arg0(17).replaceFirst("=",",").split(",")(1)));
      val sdf = new SimpleDateFormat("yyyyMMdd")
      val date=(Long.MaxValue-sdf.parse(arg0(11).replaceFirst("=",",").split(",")(1)).getTime).toString
      val key=DigestUtils.md5Hex(date).concat(arg0(17).replaceFirst("=",",").split(",")(1));
      // println(arg0(11).replaceFirst("=",",").split(",")(1).concat(arg0(17).replaceFirst("=",",").split(",")(1)))
     val arg2=arg1.map(z=>
        (key,(columnFamily1,z.split(",")(0), z.split(",")(1)))
      ).sorted
      arg2
     // arg0.
    }
    )
    val source=sourceRDD.sortBy(_._1)
    source.foreach(println)
    val date = new Date().getTime
    val rdd = source.map(x => {
      //将rdd转换成HFile需要的格式,我们上面定义了Hfile的key是ImmutableBytesWritable,那么我们定义的RDD也是要以ImmutableBytesWritable的实例为key
      //KeyValue的实例为value
      //rowkey
      val rowKey = x._1
      val family = x._2._1
      val colum = x._2._2
      val value = x._2._3
      (new ImmutableBytesWritable(Bytes.toBytes(rowKey)), new KeyValue(Bytes.toBytes(rowKey), Bytes.toBytes(family), Bytes.toBytes(colum), date,Bytes.toBytes(value)))
    })
    rdd.foreach(print)
    //生成的HFile的临时保存路径
    val stagingFolder = "file:///E:/BaiduYunDownload/data12"
    //将日志保存到指定目录
    rdd.saveAsNewAPIHadoopFile(stagingFolder,
      classOf[ImmutableBytesWritable],
      classOf[KeyValue],
      classOf[HFileOutputFormat2],
      conf)
    //此处运行完成之后,在stagingFolder会有我们生成的Hfile文件
    //开始即那个HFile导入到Hbase,此处都是hbase的api操作
    val load = new LoadIncrementalHFiles(conf)
    //hbase的表名
    val tableName = "output_table"
    //创建hbase的链接,利用默认的配置文件,实际上读取的hbase的master地址
    val conn = ConnectionFactory.createConnection(conf)
    //根据表名获取表
    val table: Table = conn.getTable(TableName.valueOf(tableName))
    //print(table.getTableDescriptor()+"eeeeeeeeeeeeeeeeeeeeeeeeeeeeee")
    try {
      //获取hbase表的region分布
      // val regionLocator = conn.getRegionLocator(TableName.valueOf(tableName))
      //创建一个hadoop的mapreduce的job
      val job = Job.getInstance(conf)
      //设置job名称
      job.setJobName("DumpFile")
      //此处最重要,需要设置文件输出的key,因为我们要生成HFil,所以outkey要用ImmutableBytesWritable
      job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
      //输出文件的内容KeyValue
      job.setMapOutputValueClass(classOf[KeyValue])
      //配置HFileOutputFormat2的信息
      //HFileOutputFormat2.configureIncrementalLoad(job, table, regionLocator)
      HFileOutputFormat2.configureIncrementalLoadMap(job, table)
      //开始导入
      load.doBulkLoad(new Path(stagingFolder), table.asInstanceOf[HTable])
    } finally {
      table.close()
      conn.close()
    }
}

https://www.jianshu.com/p/61afd6031887 https://forum.huawei.com/enterprise/zh/thread-512009.html

https://blog.csdn.net/weixin_40861707/article/details/79105753