先看一个问题
java.io.IOException: Added a key not lexically larger than previous. Current cell = M00000006/info:age/1563723718005/Put/vlen=4/seqid=0, lastCell = M00000006/info:name/1563723718005/Put/vlen=2/seqid=0
at org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.checkKey(HFileWriterImpl.java:245)
at org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.append(HFileWriterImpl.java:731)
at org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:234)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:337)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:230)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:356)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/07/21 23:41:58 ERROR Utils: Aborting task
抛出的异常是我们在生成 HFile 文件的时候,我们的 Cell没有进行排序,而我们知道,使用 HBase 的 API 进行 Put 的时候,是先将我们得Cell放入到 HBase 的 MemStore 里面,等MemStore满了或者刷写时间到了以后,会使用LMS算法将里面的 KeyValue 进行排序,然后生成 HFile.也就是说 HBase 自己生成的 HFile 里面的 KeyValue 已经是有序的,现在我们自己生成的HFile,也要保证 KeyValue有序才行.
那怎么保证我们得 KeyValue得顺序呢?
借鉴一下 HBase 提供的 CellSortReducer类,该类在 hbase-mapreduce 里面,我们使用 HBase 的 Api 生成 HFile 时候用到.
/**
* Emits sorted Cells.
* Reads in all Cells from passed Iterator, sorts them, then emits
* Cells in sorted order. If lots of columns per row, it will use lots of
* memory sorting.
* @see HFileOutputFormat2
*/
@InterfaceAudience.Public
public class CellSortReducer
extends Reducer<ImmutableBytesWritable, Cell, ImmutableBytesWritable, Cell> {
protected void reduce(ImmutableBytesWritable row, Iterable<Cell> kvs,
Reducer<ImmutableBytesWritable, Cell, ImmutableBytesWritable, Cell>.Context context)
throws java.io.IOException, InterruptedException {
TreeSet<Cell> map = new TreeSet<>(CellComparator.getInstance());
for (Cell kv : kvs) {
try {
map.add(PrivateCellUtil.deepClone(kv));
} catch (CloneNotSupportedException e) {
throw new IOException(e);
}
}
context.setStatus("Read " + map.getClass());
int index = 0;
for (Cell kv: map) {
context.write(row, new MapReduceExtendedCell(kv));
if (++index % 100 == 0) context.setStatus("Wrote " + index);
}
}
}
从CellSortReducer类的源码中我们可以看到,在 HBase 的 CellSortReducer 中,对 RowKey 相同的 KeyValue 使用 TreeSet+CellComparatorImpl比较器实现了排序.
所以,Spark 生成 HFile 的时候,我们也可以借鉴一下.
Spark 生成 HFile 并使用 BulkLoad方式加载数据
全过程如下:
package com.ljy.spark
import java.io.Closeable
import java.util
import com.ljy.common.ConfigurationFactory
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Table}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.tool.LoadIncrementalHFiles
import org.apache.hadoop.hbase.{Cell, CellComparator, CellUtil, KeyValue, TableName}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConversions
object SparkGenHFile {
import org.apache.hadoop.hbase.util.Bytes
private val FAMILY = Bytes.toBytes("info")
private val COL_NAME = Bytes.toBytes("name")
private val COL_AGE = Bytes.toBytes("age")
private val COL_GENDER = Bytes.toBytes("gender")
private val COL_ADDRESS = Bytes.toBytes("address")
private val COL_INCOME = Bytes.toBytes("income")
private val COL_JOB = Bytes.toBytes("job")
private val COL_JOINEDYM = Bytes.toBytes("joined")
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrationRequired", "true")
.setAppName("spark-gen-hfile")
.setMaster("local[*]")
sparkConf.registerKryoClasses(Array(
classOf[ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.KeyValue],
classOf[Array[org.apache.hadoop.hbase.io.ImmutableBytesWritable]],
Class.forName("org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage"),
Class.forName("scala.reflect.ClassTag$$anon$1")
))
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val dataPath = "hdfs://vhb1:8020/user/hbase/bulkload/user"
val rdd = spark.sparkContext.textFile(dataPath)
.flatMap(line => {
val fields = line.split("\t")
val key = new ImmutableBytesWritable(Bytes.toBytes(fields(0)))
val cells = buildKeyValueCells(fields)
cells.map((key, _))
})
val tableName = "sparkhfile"
val hbaseConf = ConfigurationFactory.getHBaseConf
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(hbaseConf)
var conn: Connection = null
var table: Table = null
var fs: FileSystem = null
fs = FileSystem.get(hbaseConf)
val hfileDir = new Path(fs.getWorkingDirectory, "hfile-dir")
val hfile = new Path(hfileDir, System.currentTimeMillis() + "")
try {
conn = ConnectionFactory.createConnection(hbaseConf)
table = conn.getTable(TableName.valueOf(tableName))
HFileOutputFormat2.configureIncrementalLoadMap(job, table.getDescriptor)
// 将 生成 HFile
rdd.sortByKey() // 根据 row 排序
.saveAsNewAPIHadoopFile(hfile.toString, classOf[ImmutableBytesWritable], classOf[Cell], classOf[HFileOutputFormat2], job.getConfiguration)
spark.stop()
// 使用 bulkload 的方式加载 hfile 到 hbase 表中
new LoadIncrementalHFiles(hbaseConf).run(Array(hfile.toString/* 生成的hfile所在的路径 */, tableName/*要加载的表名,需要事先创建好*/))
println("hfile: " + hfile.toString)
} finally {
fs.delete(hfileDir, true)
close(conn, table, fs)
}
}
def buildKeyValueCells(fields: Array[String]): List[Cell] = {
val rowKey = Bytes.toBytes(fields(0))
val name = new KeyValue(rowKey, FAMILY, COL_NAME, Bytes.toBytes(fields(1)))
val age = new KeyValue(rowKey, FAMILY, COL_AGE, Bytes.toBytes(fields(2).toInt))
val gender = new KeyValue(rowKey, FAMILY, COL_GENDER, Bytes.toBytes(fields(3)))
val address = new KeyValue(rowKey, FAMILY, COL_ADDRESS, Bytes.toBytes(fields(4)))
val income = new KeyValue(rowKey, FAMILY, COL_INCOME, Bytes.toBytes(fields(5).toDouble))
val job = new KeyValue(rowKey, FAMILY, COL_JOB, Bytes.toBytes(fields(6)))
val joined = new KeyValue(rowKey, FAMILY, COL_JOINEDYM, Bytes.toBytes(fields(7)))
// 参照 hbase-mapreduce 中的CellSortReducer
val set = new util.TreeSet[KeyValue](CellComparator.getInstance)
util.Collections.addAll(set, name, age, gender, address, income, job, joined)
// 将 Java 的 set 转化成 Scala 的 Set
JavaConversions.asScalaSet(set).toList
}
/**
* 也可以使用 Scala 的方式实现排序,但是避免重复造轮子,就使用 HBase 提供的就好.
* 故这里我们不在使用我们自己实现的比较器了.
*/
@Deprecated
class KeyValueOrder extends Ordering[KeyValue] {
override def compare(x: KeyValue, y: KeyValue): Int = {
val xRow = CellUtil.cloneRow(x)
val yRow = CellUtil.cloneRow(y)
var com = Bytes.compareTo(xRow, yRow)
if (com != 0) return com
val xf = CellUtil.cloneFamily(x)
val yf = CellUtil.cloneFamily(y)
com = Bytes.compareTo(xf, yf)
if (com != 0) return com
val xq = CellUtil.cloneQualifier(x)
val yq = CellUtil.cloneQualifier(y)
com = Bytes.compareTo(xq, yq)
if (com != 0) return com
val xv = CellUtil.cloneValue(x)
val yv = CellUtil.cloneValue(x)
Bytes.compareTo(xv, yv)
}
}
def close(closes: Closeable*): Unit = {
for (elem <- closes) {
if (elem != null) {
elem.close()
}
}
}
}
我里面用到的 Conf,为了方便,我抽取出来了.
代码如下:
package com.ljy.common;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
public class ConfigurationFactory {
public static Configuration getHBaseConf() {
final Configuration conf = getConf();
conf.set("hbase.rootdir", "hdfs://vhb1:8020/hbase2");
conf.set("hbase.zookeeper.quorum", "vhb1,vhb2,vhb3");
conf.set("hbase.zookeeper.property.clientPort", "2181");
conf.set("zookeeper.znode.parent", "/hbase");
return HBaseConfiguration.create(conf);
}
public static Configuration getConf() {
final Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://vhb1:8020");
return conf;
}
}
pom 依赖如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.ljy</groupId>
<artifactId>spark-hbase</artifactId>
<version>1.0</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-mapreduce</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
</project>