Spark生成HBase 的 HFile 文件,并使用BulkLoad 方式将 HFile 文件加载到对应的表中

最新推荐文章于 2024-08-14 18:12:00 发布

TMH_ITBOY

最新推荐文章于 2024-08-14 18:12:00 发布

阅读量2.2k

点赞数 2

本文链接：https://blog.csdn.net/lljjyy001/article/details/96779083

版权

Spark 同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

hbase

8 篇文章 0 订阅

订阅专栏

先看一个问题

java.io.IOException: Added a key not lexically larger than previous. Current cell = M00000006/info:age/1563723718005/Put/vlen=4/seqid=0, lastCell = M00000006/info:name/1563723718005/Put/vlen=2/seqid=0
	at org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.checkKey(HFileWriterImpl.java:245)
	at org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.append(HFileWriterImpl.java:731)
	at org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:234)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:337)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:230)
	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:356)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
19/07/21 23:41:58 ERROR Utils: Aborting task

抛出的异常是我们在生成 HFile 文件的时候,我们的 Cell没有进行排序,而我们知道,使用 HBase 的 API 进行 Put 的时候,是先将我们得Cell放入到 HBase 的 MemStore 里面,等MemStore满了或者刷写时间到了以后,会使用LMS算法将里面的 KeyValue 进行排序,然后生成 HFile.也就是说 HBase 自己生成的 HFile 里面的 KeyValue 已经是有序的,现在我们自己生成的HFile,也要保证 KeyValue有序才行.

那怎么保证我们得 KeyValue得顺序呢?

借鉴一下 HBase 提供的 CellSortReducer类,该类在 hbase-mapreduce 里面,我们使用 HBase 的 Api 生成 HFile 时候用到.

/**
 * Emits sorted Cells.
 * Reads in all Cells from passed Iterator, sorts them, then emits
 * Cells in sorted order.  If lots of columns per row, it will use lots of
 * memory sorting.
 * @see HFileOutputFormat2
 */
@InterfaceAudience.Public
public class CellSortReducer
    extends Reducer<ImmutableBytesWritable, Cell, ImmutableBytesWritable, Cell> {
  protected void reduce(ImmutableBytesWritable row, Iterable<Cell> kvs,
      Reducer<ImmutableBytesWritable, Cell, ImmutableBytesWritable, Cell>.Context context)
  throws java.io.IOException, InterruptedException {
    TreeSet<Cell> map = new TreeSet<>(CellComparator.getInstance());
    for (Cell kv : kvs) {
      try {
        map.add(PrivateCellUtil.deepClone(kv));
      } catch (CloneNotSupportedException e) {
        throw new IOException(e);
      }
    }
    context.setStatus("Read " + map.getClass());
    int index = 0;
    for (Cell kv: map) {
      context.write(row, new MapReduceExtendedCell(kv));
      if (++index % 100 == 0) context.setStatus("Wrote " + index);
    }
  }
}

从CellSortReducer类的源码中我们可以看到,在 HBase 的 CellSortReducer 中,对 RowKey 相同的 KeyValue 使用 TreeSet+CellComparatorImpl比较器实现了排序.
所以,Spark 生成 HFile 的时候,我们也可以借鉴一下.

Spark 生成 HFile 并使用 BulkLoad方式加载数据

全过程如下:

package com.ljy.spark

import java.io.Closeable
import java.util

import com.ljy.common.ConfigurationFactory
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Table}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.tool.LoadIncrementalHFiles
import org.apache.hadoop.hbase.{Cell, CellComparator, CellUtil, KeyValue, TableName}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

import scala.collection.JavaConversions

object SparkGenHFile {

  import org.apache.hadoop.hbase.util.Bytes

  private val FAMILY = Bytes.toBytes("info")

  private val COL_NAME = Bytes.toBytes("name")
  private val COL_AGE = Bytes.toBytes("age")
  private val COL_GENDER = Bytes.toBytes("gender")
  private val COL_ADDRESS = Bytes.toBytes("address")
  private val COL_INCOME = Bytes.toBytes("income")
  private val COL_JOB = Bytes.toBytes("job")
  private val COL_JOINEDYM = Bytes.toBytes("joined")

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf()
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.kryo.registrationRequired", "true")
      .setAppName("spark-gen-hfile")
      .setMaster("local[*]")
    sparkConf.registerKryoClasses(Array(
      classOf[ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.KeyValue],
      classOf[Array[org.apache.hadoop.hbase.io.ImmutableBytesWritable]],
      Class.forName("org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage"),
      Class.forName("scala.reflect.ClassTag$$anon$1")
    ))

    val spark = SparkSession.builder()
      .config(sparkConf)
      .getOrCreate()

    val dataPath = "hdfs://vhb1:8020/user/hbase/bulkload/user"
    val rdd = spark.sparkContext.textFile(dataPath)
      .flatMap(line => {
        val fields = line.split("\t")
        val key = new ImmutableBytesWritable(Bytes.toBytes(fields(0)))
        val cells = buildKeyValueCells(fields)
        cells.map((key, _))
      })
    val tableName = "sparkhfile"
    val hbaseConf = ConfigurationFactory.getHBaseConf
    hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
    val job = Job.getInstance(hbaseConf)

    var conn: Connection = null
    var table: Table = null
    var fs: FileSystem = null
    fs = FileSystem.get(hbaseConf)
    val hfileDir = new Path(fs.getWorkingDirectory, "hfile-dir")
    val hfile = new Path(hfileDir, System.currentTimeMillis() + "")

    try {
      conn = ConnectionFactory.createConnection(hbaseConf)
      table = conn.getTable(TableName.valueOf(tableName))
      HFileOutputFormat2.configureIncrementalLoadMap(job, table.getDescriptor)
      // 将 生成 HFile
      rdd.sortByKey() // 根据 row 排序
        .saveAsNewAPIHadoopFile(hfile.toString, classOf[ImmutableBytesWritable], classOf[Cell], classOf[HFileOutputFormat2], job.getConfiguration)
      spark.stop()
      // 使用 bulkload 的方式加载 hfile 到 hbase 表中
      new LoadIncrementalHFiles(hbaseConf).run(Array(hfile.toString/* 生成的hfile所在的路径 */, tableName/*要加载的表名,需要事先创建好*/))
      println("hfile: " + hfile.toString)
    } finally {
      fs.delete(hfileDir, true)
      close(conn, table, fs)
    }

  }


  def buildKeyValueCells(fields: Array[String]): List[Cell] = {
    val rowKey = Bytes.toBytes(fields(0))
    val name = new KeyValue(rowKey, FAMILY, COL_NAME, Bytes.toBytes(fields(1)))
    val age = new KeyValue(rowKey, FAMILY, COL_AGE, Bytes.toBytes(fields(2).toInt))
    val gender = new KeyValue(rowKey, FAMILY, COL_GENDER, Bytes.toBytes(fields(3)))
    val address = new KeyValue(rowKey, FAMILY, COL_ADDRESS, Bytes.toBytes(fields(4)))
    val income = new KeyValue(rowKey, FAMILY, COL_INCOME, Bytes.toBytes(fields(5).toDouble))
    val job = new KeyValue(rowKey, FAMILY, COL_JOB, Bytes.toBytes(fields(6)))
    val joined = new KeyValue(rowKey, FAMILY, COL_JOINEDYM, Bytes.toBytes(fields(7)))
    // 参照 hbase-mapreduce 中的CellSortReducer
    val set = new util.TreeSet[KeyValue](CellComparator.getInstance)
    util.Collections.addAll(set, name, age, gender, address, income, job, joined)
    // 将 Java 的 set 转化成 Scala 的 Set
    JavaConversions.asScalaSet(set).toList
  }

  /**
    * 也可以使用 Scala 的方式实现排序,但是避免重复造轮子,就使用 HBase 提供的就好.
    * 故这里我们不在使用我们自己实现的比较器了.
    */
  @Deprecated
  class KeyValueOrder extends Ordering[KeyValue] {
    override def compare(x: KeyValue, y: KeyValue): Int = {
      val xRow = CellUtil.cloneRow(x)
      val yRow = CellUtil.cloneRow(y)
      var com = Bytes.compareTo(xRow, yRow)
      if (com != 0) return com

      val xf = CellUtil.cloneFamily(x)
      val yf = CellUtil.cloneFamily(y)
      com = Bytes.compareTo(xf, yf)
      if (com != 0) return com

      val xq = CellUtil.cloneQualifier(x)
      val yq = CellUtil.cloneQualifier(y)
      com = Bytes.compareTo(xq, yq)
      if (com != 0) return com

      val xv = CellUtil.cloneValue(x)
      val yv = CellUtil.cloneValue(x)
      Bytes.compareTo(xv, yv)
    }
  }

  def close(closes: Closeable*): Unit = {
    for (elem <- closes) {
      if (elem != null) {
        elem.close()
      }
    }
  }
}

我里面用到的 Conf,为了方便,我抽取出来了.
代码如下:

package com.ljy.common;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;

public class ConfigurationFactory {
    public static Configuration getHBaseConf() {
        final Configuration conf = getConf();
        conf.set("hbase.rootdir", "hdfs://vhb1:8020/hbase2");
        conf.set("hbase.zookeeper.quorum", "vhb1,vhb2,vhb3");
        conf.set("hbase.zookeeper.property.clientPort", "2181");
        conf.set("zookeeper.znode.parent", "/hbase");
        return HBaseConfiguration.create(conf);
    }

    public static Configuration getConf() {
        final Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://vhb1:8020");
        return conf;
    }
}

pom 依赖如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.ljy</groupId>
    <artifactId>spark-hbase</artifactId>
    <version>1.0</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-mapreduce</artifactId>
            <version>2.1.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.1.0</version>
        </dependency>

    </dependencies>
</project>