1.在spark/jars目录下建立./hbase目录,然后拷贝以下jar包导该目录
cd /usr/local/spark/jars
mkdir hbase
cd hbase
cp /usr/local/hbase/lib/hbase*.jar ./
cp /usr/local/hbase/lib/guava-12.0.1.jar ./
cp /usr/local/hbase/lib/htrace-core-3.1.0-incubating.jar ./
cp /usr/local/hbase/lib/protobuf-java-2.5.0.jar ./
$HIVE_HOME/lib/hive-hbase-handler-0.13.1.jar
然后把此包传到每个节点
2.编辑spark-env.sh,加入以下(每个节点)
export SPARK_DIST_CLASSPATH=$(/usr/BigData/hadoop/bin/hadoop classpath):$(/usr/BigData/hbase/bin/hbase classpath):/usr/BigData/spark/jars/hbase/*
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster('spark://master:7077').setAppName('SparkTest_Transformation')
sc = SparkContext(conf=conf) # 连接到spark集群
host = 'localhost'
table = 'hbase_sql'
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result", keyConverter=keyConv, valueConverter=valueConv,
conf=conf)
count = hbase_rdd.count()
hbase_rdd.cache()
output = hbase_rdd.collect()
for (k, v) in output:
print(k, v)
感觉查询速度很慢,适合做统计分析?实时查询怎么办。