如有不妥之处,欢迎随时留言沟通交流,谢谢~
1、 Pyspark读Kudu(linux下提交任务ok)
import pyspark
pyspark --jars /home/zwshi/kudu-spark2_2.11-1.6.0.jar # 启动
sqlContext = pyspark.sql.SQLContext(spark) # 创建sql连接
df = sqlContext.read.format('org.apache.kudu.spark.kudu').option({"kudu.master":"127.0.0.1:7051").option("kudu.table":"impala::test_db.test_table").load() # 读取kudu表
df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', '127.0.0.1:7051').option('kudu.table', 'impala::test_db.test_table').mode('append').save() # 写入kudu表
kudu_spark2_2.11下载方式: https://github.com/asarraf/KuduPyspark
如果是pyspark连接kudu,则不能对kudu进行额外的操作;python连接kudu,可进行常规的增删改查操作
1.1 (2018.10.25) 补充,用PyCharm按上述方法在Win读kudu调试,会报错:
The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx-----
# coding:utf-8
import sys
import os
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
reload(sys) # reload 才能调用 setdefaultencoding 方法
sys.setdefaultencoding('utf-8') # 设置 'utf-8'
os.environ["PYSPARK_SUBMIT_ARGS"] = '--jars /home/zwshi/kudu-spark2_2.11-1.2.0-cdh5.10.2.jar pyspark-shell'
conf = SparkConf().setAppName("xxx").set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
df = sqlContext.read.format('org.apache.kudu.spark.kudu').option('kudu.master','127.0.0.1:7051').option('kudu.table', 'impala:test_db.test_table').load()
df.show()
解决方案:指定环境变量,就ok
os.environ["YARN_CONF_DIR"] = '/xx/xxx/conf.cloudera.yarn2'
os.environ["HADOOP_CONF_DIR"] = '/xx/hadoop/conf'
2、Python读Kudu(Impala)
from impala.dbapi import connect
def fun2(str,is_test=True):
host = '192.168.0.1' if is_test else '192.168.0.1'
conn = connect(host=host, port=25001, timeout=3600)
cursor = conn.cursor()
sql = 'INSERT INTO test_db.test_table(sec_code,dt,minute,itype,ftype) values(%s,%s,%s,%s,%s)'
cursor.execute(sql, str)
data = [('0' , '1', "a", '23.0','a'), ('1','3', "C", '-23.0','a'), ('2','3', "A", '-21.0','a'), ('3','2', "B", '-19.0','a') ]
rdd = sc.parallelize(data)
rdd2 = rdd.map(lambda x : fun2(x))
rdd2.collect()
参考文献:
kudu简介与操作方式: https://www.jianshu.com/p/d91761c63a45
操作kudu的各种形式: https://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
KUDU主页:https://kudu.apache.org/docs/index.html