目录
前言
日前笔者尝试使用pyspark 2.4.3访问问hbase 2.1并进行读写,遇到以下一些坑,分享给大家。
测试过程
使用的liunux环境安装了CDH-6,安装了hbase 2.1, spark 2.2.0。使用anaconda安装了python3.5的虚拟环境,pip安装了pyspark 2.4.3。启动pyspark shell,运行以下python代码:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils,TopicAndPartition
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED", "0").set("spark.kryoserializer.buffer.max", "2040mb")
sc.stop()
sc = SparkContext(appName='HBaseInputFormat', conf=conf)
host = "10.210.110.24,10.210.110.129,10.210.110.130"
table = 'leo01'
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf={
"hbase.zookeeper.quorum": host, "hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable",
"mapreduce.output.fileoutputformat.outputdir": "/tmp"}
rawData = ['3,course,a100,200','4,course,chinese,90']
print