最近做实时数仓用到了spark streaming和kudu两个组件,因为资料少得可怜,折腾了一番终于是搞定了,在这里记录下期间遇到的坑
先通过Impala建张Kudu表
create table kudu_appbind_test(
md5 string,
userid string,
datetime_ string,
time_ string,
cardno string,
flag string,
cardtype string,
primary key(md5,userid,datetime_)
)
stored as kudu;
依赖选择
参考kudu官网:https://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark
官网上提及了几点关键信息
- Use the kudu-spark_2.10 artifact if using Spark with Scala 2.10. Note that Spark 1 is no longer supported in Kudu starting from version 1.6.0. So in order to use Spark 1 integrated with Kudu, version 1.5.0 is the latest to go to.
- Use kudu-spark2_2.11 artifact if using Spark 2 with Scala 2.11.
- kudu-spark versions 1.8.0 and below have slightly different syntax.
- Spark 2.2+ requires Java 8 at runtime even though Kudu Spark 2.x integration is Java 7 compatible. Spark 2.2 is the default dependency version as of Kudu 1.5.0.
我这里是使用spark 2.4.0、scala 2.11、kudu 1.8.0,所以也该选择 kudu-spark_2.11-1.8.0.jar,maven中配置如下:
<!-- https://mvnrepository.com/artifact/org.apache.kudu/kudu-spark2 -->
<dependency>
<groupId