Spark经常会读写一些外部数据源,常见的有HDFS、HBase、JDBC、Redis、Kafka等。这些都是Spark的常见操作,做一个简单的Demo总结,方便后续开发查阅。
1.1 maven依赖
需要引入Hadoop和HBase的相关依赖,版本信息根据实际情况确定。
<properties>
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
<hbase.version>1.2.0-cdh5.7.0</hbase.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-common</artifactId>
<version>${hbase.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>
</dependencies>
1.2 HBaseUtils
为了方便使用,需要写HBaseUtils类,完成一些基本信息的配置。比如完成Configuration、zookeeper的配置,返回HBaseAdmin和HTable等操作。
package com.bupt.Hbase
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor, TableName}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.client.{HBaseAdmin, HTable}
object HBaseUtils {
/**
* 设置HBaseConfiguration
* @param quorum
* @param port
* @param tableName
*/
def getHBaseConfiguration(quorum:String, port:String, tableName:String) = {
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum",quorum)
conf.set("hbase.zookeeper.property.clientPort",port)
conf
}
/**
* 返回或新建HBaseA