文章目录
Spark通过Bulk Load 写入Hbase
背景
项目中有使用到 Hbase,将历史数据存入到Hbase中,最新数据存入到Hive中, 代码中使用SPark通过Bulkload方式将数据导入到Hbase中, 以前从未这样使用过,所以特意写一篇文章,总结学习一下,Spark通过Bulkload方式将数据存入到Hbase中
BulkLoad
优点: 效率高 原因:Bulk load方式由于利用了Hbase的数据信息是按照特定格式存储在HDFS里的这一特性,直接在HDFS中生成持久化的HFile数据格式文件,然后完成巨量数据快速入库的操作,配合MR完成这样的操作,不占用Region资源,不会产生巨量的写入I/O,所以需要较少的CPU和网络资源
Bulk Load的实现原理
Bulk Load 的实现原理是通过一个MR Job来实现的,通过Job直接生成一个Hbase的内部Hfile格式文件,用来形成一个特殊的Hbase数据表,然后直接将数据文件加载到运行的集群中,与使用Hbase API相比,使用Bulk Load方式占用更少的CPU和网络资源
使用Put普通的方式
使用put 方式将数据一条一条写入Hbase中,但是和Bulk加载相比,效率低下,仅作为对比
判断表是否存在-> 建表-> 添加put对象 ->put -> 查询 -> 删除
java put方式
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.util.Bytes;
import org.junit.Test;
import java.io.IOException;
public class TestHBaseApi {
@Test
public void test() throws IOException {
//1.导入依赖jar包,使用maven中导入依赖关系
//获取configuration
Configuration conf = HBaseConfiguration.create();
//增加zookeeper的相关配置
conf.set("hbase.zookeeper.quorum", "hadoop102");
conf.set("hbase.zookeeper.property.clientPort", "2181");
//建立和hbase数据库的连接
Connection conn = ConnectionFactory.createConnection(conf);
//打印连接
System.out.println(conn);
//操作Hbase数据库
//获取管理对象
Admin admin = conn.getAdmin();
//判断表是否存在
TableName tableName = TableName.valueOf("student1");
boolean b = admin.tableExists(tableName);
if(!b) { //表不存在,就创建
//创建表--shell: create table tablename info
HTableDescriptor tableDescriptor = new HTableDescriptor(tableName);
//添加列族
HColumnDescriptor hColumnDescriptor = new HColumnDescriptor("info");
tableDescriptor.addFamily(hColumnDescriptor);
admin.createTable(tableDescriptor);
}else { //表已存在
//删除原有的表
//删除表前,先将表进行禁用
// admin.disableTableAsync(tableName);
// //禁用后在进行删除
// admin.deleteTable(tableName);
System.out.println("表已存在");
}
//增加数据--put 'student',1001 ,'info:name', 'value'
//增加数据前获取表格对象
Table table = conn.getTable(tableName);
String rowkey = "1001";
Put put = new Put(Bytes.toBytes(rowkey)); //Bytes工具默认使用utf-8进行编解码
byte[] infos = Bytes.toBytes("info");
byte[] names = Bytes.toBytes("name");
byte[] value = Bytes.toBytes("zhangsan");
put.addColumn(infos, names, value);
table.put(put);// 可以进行单个的put 或者进行一次性Put多次
System.out.println("数据保存成功...");
//查询数据
Get get = new Get(Bytes.toBytes(rowkey));
Result result = table.get(get);
//遍历result
for (Cell cell : result.rawCells()) {
System.out.println(Bytes.toString(CellUtil.cloneValue(cell)));
}
//删除数据
Delete delete = new Delete(Bytes.toBytes(rowkey));
table.delete(delete);
System.out.println("删除成功...");
}
}
使用 Bulk Load方式导入数据
步骤:
- 生成Hfile文件
- 使用doBulkLoad()方式导入
数据准备
使用json方式字符串
{ "id":"1" , "title":"title1" }
maven依赖
有些不需要,可自行删减
<properties>
<spark.version>2.3.0</spark.version>
<hbase.version>1.3.1</hbase.version>
<scala.main.version>2.11</scala.main.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-yarn_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<!-- 数据库JDBC -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.27</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.11.0.2</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>1.3.1</version>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.24</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-hadoop-compat</artifactId>
<version>1.3.1</version>
</dependency>
</dependencies>
完整代码
package HbaseBulk
import com.alibaba.fastjson.JSON
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hbase._
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{HFileOutputFormat2, LoadIncrementalHFiles, TableOutputFormat}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.{SparkConf, SparkContext}
object BulkLoad {
val zookeeperQuorum = "hadoop102:2181"
val dataSourcePath = "file:///D:/news_profile_data.txt"
val hdfsRootPath = "hdfs://hadoop102:9000/bulk"
val hFilePath = "hdfs://hadoop102:9000/bulk/hfile/"
val tableName = "news"
val familyName = "cf1"
val qualifierName = "title"
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName(s"${this.getClass.getSimpleName}").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val hadoopConf = new Configuration()
hadoopConf.set("fs.defaultFS", hdfsRootPath)
val fileSystem = FileSystem.get(hadoopConf)
val hbaseConf = HBaseConfiguration.create(hadoopConf)
hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, zookeeperQuorum)
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val hbaseConn = ConnectionFactory.createConnection(hbaseConf)
val admin = hbaseConn.getAdmin
// 0. 准备程序运行的环境
// 如果 HBase 表不存在,就创建一个新表
if (!admin.tableExists(TableName.valueOf(tableName))) {
val desc = new HTableDescriptor(TableName.valueOf(tableName))
val hcd = new HColumnDescriptor(familyName)
desc.addFamily(hcd)
admin.createTable(desc)
}
// 如果存放 HFile文件的路径已经存在,就删除掉
if(fileSystem.exists(new Path(hFilePath))) {
fileSystem.delete(new Path(hFilePath), true)
}
// 1. 清洗需要存放到 HFile 中的数据,rowKey 一定要排序,否则会报错:
// java.io.IOException: Added a key not lexically larger than previous.
val data = sc.textFile(dataSourcePath)
.map(jsonStr => {
// 处理数据的逻辑
val jsonObject = JSON.parseObject(jsonStr)
val rowkey = jsonObject.get("id").toString.trim
val title = jsonObject.get("title").toString.trim
(rowkey, title)
})
.sortByKey()
.map(tuple => {
val kv = new KeyValue(Bytes.toBytes(tuple._1), Bytes.toBytes(familyName), Bytes.toBytes(qualifierName), Bytes.toBytes(tuple._2))
(new ImmutableBytesWritable(Bytes.toBytes(tuple._1)), kv)
})
// 2. Save Hfiles on HDFS
val table = hbaseConn.getTable(TableName.valueOf(tableName))
val job = Job.getInstance(hbaseConf)
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
HFileOutputFormat2.configureIncrementalLoadMap(job, table)
job.getConfiguration.set("mapred.output.dir", hFilePath)
data.saveAsNewAPIHadoopDataset(job.getConfiguration)
// 3. Bulk load Hfiles to Hbase
val bulkLoader = new LoadIncrementalHFiles(hbaseConf)
val regionLocator = hbaseConn.getRegionLocator(TableName.valueOf(tableName))
bulkLoader.doBulkLoad(new Path(hFilePath), admin, table, regionLocator)
hbaseConn.close()
fileSystem.close()
sc.stop()
}
}
Hbase中结果
总结
Bulk Load方式:直接在HDFS中生成Hfile文件, 不占用Region资源,不会产生大量的IO,所以不需要较多的CPU和网络资源
其他
Hbase 删除表时
1. 先: disable 'news'
2. 在进行删除: drop 'news'
参考链接:
https://www.iteblog.com/archives/1889.htmlhttps://www.iteblog.com/archives/1891.htmlhttps://www.cnblogs.com/smartloli/p/9501887.htmlhttps://www.2cto.com/net/201710/692437.htmlhttps://www.jianshu.com/p/b6c5a5ba30af