一、创建hbase表
在hbase shell中使用下面命令创建test_table表:
hbase> create 'test_table','info'
hbase shell 链接指定集群
hbase shell 启动脚本自动使用 $HBASE_HOME 目录配置相同的目录, 用户可以使用其它设置覆盖这个位置,连接到不同的集群
新建一个包含
hbase-site.xml 文件的单独的目录,配置 hbase.zookeeper.quorum 属性指向一个不同的集群,然后像下面这样启动 shell:
HBASE_CONF_DIR="/<your-other-config-dir>/" hbase shell
注意,必须指定一个完整的目录,而不仅仅是 hbase-site.xml 文件
二、配置spark
1、配置spark 环境
把HBase的lib目录下的一些jar文件拷贝到Spark中,这些都是编程时需要引入的jar包,需要拷贝的jar文件包括:所有hbase开头的jar文件、guava-12.0.1.jar、htrace-core-3.1.0-incubating.jar和protobuf-java-2.5.0.jar
cd /usr/local/spark/jars
mkdir hbase
cd hbase
cp /usr/local/hbase/lib/hbase*.jar ./
cp /usr/local/hbase/lib/guava-12.0.1.jar ./
cp /usr/local/hbase/lib/htrace-core-3.1.0-incubating.jar ./
cp /usr/local/hbase/lib/protobuf-java-2.5.0.jar ./
需要注意:在Spark 2.0版本上缺少相关把hbase的数据转换python可读取的jar包,需要另行下载
打开spark-example-1.6.0.jar下载jar包
将这些jar拷贝到 spark 目录里面的 jar文件夹里面
mkdir -p /usr/local/spark/jars/hbase/
mv ~/下载/spark-examples* /usr/local/spark/jars/hbase/
然后配置Spark的spark-env.sh文件,告诉Spark可以在哪个路径下找到HBase相关的jar文件,命令如下:
cd /usr/local/spark/conf
vim spark-env.sh
打开spark-env.sh文件以后,可以在文件最前面增加下面一行内容:
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath):$(/usr/local/hbase/bin/hbase classpath):/usr/local/spark/jars/hbase/*
2、也可在提交任务时使用 --jars 参数来配置hbase相关jar包
spark-submit --master yarn\
--deploy-mode client\
--queue queue_name\
--driver-memory 4g\
--executor-memory 15g\
--num-executors 40\
--executor-cores 8\
--conf "spark.yarn.am.memory=2g"\
--conf "spark.core.connection.ack.wait.timeout=600"\
--conf "spark.driver.maxResultSize=1g"\
--conf "spark.kryoserializer.buffer.max=1g"\
--conf "spark.yarn.executor.memoryOverhead=4096"\
--jars $(files=(jar包路径/*.jar); IFS=,; echo "${files[*]}")\
spark_hbase_demo.py
三、读写代码
spark_hbase_demo.py 代码
# -*- coding: utf-8 -*-
from pyspark.sql import SparkSession
class SparkHbase(object):
def __init__(self, zk_host, table_name):
self.zk_host = zk_host
self.table_name = table_name
self.read_key_conv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
self.read_value_conv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
self.input_format_class = "org.apache.hadoop.hbase.mapreduce.TableInputFormat"
self.key_class = "org.apache.hadoop.hbase.io.ImmutableBytesWritable"
self.value_calss = "org.apache.hadoop.hbase.client.Result"
self.read_conf = {"hbase.zookeeper.quorum": self.zk_host,
"hbase.mapreduce.inputtable": self.table_name}
self.write_key_conv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
self.write_value_conv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
self.write_conf = {"hbase.zookeeper.quorum": self.zk_host,
"hbase.mapred.outputtable": self.table_name,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
def test_read(self):
spark = SparkSession.builder.appName("test_read_hbase").getOrCreate()
sc = spark.sparkContext
hbase_rdd = sc.newAPIHadoopRDD(self.input_format_class, self.key_class, self.value_calss,
keyConverter=self.read_key_conv,
valueConverter=self.read_value_conv,
conf= self.read_conf)
output = hbase_rdd.collect()
print(output)
spark.stop()
def test_write(self):
spark = SparkSession.builder\
.appName("test_write_hbase").getOrCreate()
sc = spark.sparkContext
key_conv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
value_conv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf = {"hbase.zookeeper.quorum": self.zk_host,
"hbase.mapred.outputtable": self.table_name,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
test_data = ['1,info,name,Rongcheng', '2,info,name,Guanhua']
# 写入hbase格式:( rowkey , [ row key , column family , column name , value ])
sc.parallelize(test_data).map(lambda x: (x[0], x.split(',')))\
.saveAsNewAPIHadoopDataset(conf=conf,
keyConverter=key_conv,
valueConverter=value_conv)
spark.stop()
if __name__ == "__main__":
zk_host = "localhost"
table_name = "test_table"
spark_hbase = SparkHbase(zk_host, table_name)
spark_hbase.test_write()
spark_hbase.test_read()