Spark RDD编程读写HBase数据

晓之以理的喵~~

已于 2023-03-13 15:29:17 修改

阅读量1.2k

点赞数

于 2023-03-13 15:28:56 首次发布

本文链接：https://blog.csdn.net/weixin_42011858/article/details/129489531

版权

HBase是一个高可靠、高性能、面向列、可伸缩的分布式数据库，主要用来存储非结构化和半结构化的松散数据。Spark支持对HBase数据库中的数据进行读写。

一、创建一个HBase表

启动Hadoop的HDFS：

$ cd /usr/local/hadoop
$ ./sbin/start-dfs.sh

启动HBase：

$ cd /usr/local/hbase
$ ./bin/start-hbase.sh //启动HBase
$ ./bin/hbase shell //启动HBase Shell

创建一个student表，把数据保存到HBase中时，可以把id作为行健（Row Key），把info作为列族，把name、gender和age作为列。

hbase> create  'student','info'
//首先录入student表的第一个学生记录 
hbase> put  'student','1','info:name','Xueqian' 
hbase> put  'student','1','info:gender','F' 
hbase> put  'student','1','info:age','23' 
//然后录入student表的第二个学生记录 
hbase> put  'student','2','info:name','Weiliang' 
hbase> put  'student','2','info:gender','M' 
hbase> put  'student','2','info:age','24'

二、配置Spark

把HBase安装目录下的lib目录中的一些jar文件拷贝到Spark安装目录中，这些都是编程时需要引入的jar包。需要拷贝的jar文件包括：所有hbase开头的jar文件、guava-12.0.1.jar、htrace-core3.1.0-incubating.jar和protobuf-java-2.5.0.jar。

$ cd  /usr/local/spark/jars
$ mkdir  hbase 
$ cd  hbase 
$ cp  /usr/local/hbase/lib/hbase*.jar  ./ 
$ cp  /usr/local/hbase/lib/guava-12.0.1.jar  ./ 
$ cp  /usr/local/hbase/lib/htrace-core-3.1.0-incubating.jar  ./ 
$ cp  /usr/local/hbase/lib/protobuf-java-2.5.0.jar  ./

三、编写程序读取HBase数据

如果要让Spark读取HBase，就需要使用SparkContext提供的newAPIHadoopRDD这个API将表的内容以RDD的形式加载到Spark中。
新建一个SparkOperateHBase.scala代码文件：

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase._ import org.apache.hadoop.hbase.client._ 
import org.apache.hadoop.hbase.mapreduce.TableInputFormat  
import org.apache