环境配置:
Spark启动环境中添加hbase的jar包以及spark-examples的jar包。
1、下载spark-examples jar包,地址https://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.11/1.6.0-typesafe-001
2、将下载的spark-examples包放在hbase的lib目录下,我这里使用的是cdh发行版的集群,hbase的lib目录为:/opt/cloudera/parcels/CDH/lib/hbase/lib
3、spark-env.sh增加配置,如下:
export SPARK_DIST_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/lib/*
重启spark
创建test.py,代码如下:
from pyspark.sql import SparkSession
def hbasetest():
spark = SparkSession.builder.appName('SparkHBaseRDD').getOrCreate()
sc=spark.sparkContext
tablename='test'
conf = {"hbase.mapreduce.inputtable": tablename}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = spark.sparkContext.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=conf)
output = hbase_rdd.collect()
for (k, v) in output:
print(k, v)
if __name__ == '__main__':
hbasetest()
提交代码:spark-submit --master local /tmp/hbasetest.py,输出如下: