一次Delta lake 0.8.0 踩坑有感:使用新框架的新版本,一定要尽早关注多多关注社区动态...

点击上方蓝色“明哥的IT随笔”,关注并选择“设为星标”,keep striving!

一。数据胡三剑客介绍


    关注大数据发展动态的朋友,都知道最近几年数据湖存储引擎发展很快,已经涌现出了数据湖三剑客 delta lake, iceberg 和hudi。这三者给数据湖带来了诸多新特性,如acid事务支持,记录级别的增删改,流批统一的存储,可扩张的元数据管理,schema约束和schema 演变,基于时间旅行提供多版本支持等等。

 

    数据湖引擎带来的以上种种新特性,切切实实解决了大数据在企业落地过程中遇到的很多难以解决的痛点。也正是因此,很多企业都在积极调研尝试引进合适的数据湖框架。笔者所在公司自2020年开始也一直在进行这方面的调研和尝试。

二。我司数据湖引擎落地策略

    关于三者的技术对比,网上有很多资料,笔者在这里不再详细赘述,不过有几点在此说明下:

  • 在跟 spark执行引擎的兼容成熟度上,delta lake目前是做的最好的(毕竟是一个databricks娘胎里出来的);

  • 在跟flink执行引擎的对接成熟度上,目前 iceberg和 hudi是做的更好的,delta 还没有投入很多精力做这块;(笔者的感觉是,flink在国内的热度比国外的热度要大,另外iceberg和hudi社区中华人贡献者的比重也相对 delta lake更多些,这可能跟国内厂商的推动和宣传有关);

  • 在架构设计上,iceberg是公认更为优和雅先进的,经常被拿来跟 hive 对比;

 

    由于我司目前大数据计算引擎大都是使用的spark (一些实时要求很高的场景,也已经在探索和落地使用flink了),考虑到delta lake跟spark一样都是来自砖厂,其背后有着砖厂强大的背书支持和广泛的用户基础 (从dbr8.0后,建表时默认的存储格式就是delta lake了),跟spark在兼容性上也做得更好,同时底层存储采用的是开源的Parquet格式为日后对接更多执行引擎打好了基础,所以我们决定优先尝试落地使用 deltalake。待日后 iceberg 更为成熟后,可以再尝试引进 iceberg,两者可以并存并不冲突。(这三者都是轻量级的jar包形式,所以引入落地的技术成本都不大)。


三。我司Delta lake数据湖引擎落地探索

    跟大多数公司一样,为降低业务系统开发难度,我们的业务人员倾向于纯sql的开发方式。好消息是,Delta lake从 0.7.0版本之后,就支持spark 3.x系列的各种纯 sql操作了,包括ddl和 dml 。

    所以通过使用 spark3.x,结合HiveExternalCatalog和 delta lake,我们就能搭建一套现代化的数据湖仓架构,以纯sql方式开发业务代码了!Everyone loves sql, so why not?!

    

    笔者先按照官方文档,使用delta-core_2.12-0.7.0.jar 配合spark-3.0.2-bin-hadoop2.7-hive1.2 ,通过 spark-shell和spark-sql做了scala各种 api,和 sql各种 ddl,dml的验证测试,一切顺利。

注意:delta lake 和 spark各版本的兼容性,如官方下图所示:

注意:delta lake 和spark各版本,跟scala的版本匹配关系如下:

  • Spark 3.x 预编译都是使用的scala2.12;

  • spark 2.x 预编译使用的有scala2.12 也有 scala 2.11;

  • delta 在0.7.0之后只提供了scala 2.12版;(因为对接的是spark 3.x)

  • delta在0.7.0版本之前提供的有对接scala 2.11的也有对接scala 2.12的。


四。Delta lake 0.8.0踩坑记

在2021/02/05, delta lake 发布了0.8.0版本,如下图所示:

    同时在2021/03/02,spark 发布了3.1.1版本,如下图所示:(spark 3.1系列的第一个正式版本是3.1.1,而不是3.1.1,如下图所示):

    看到 spark3.1和delta0.8各种feature的增强,笔者迫不及待地使用delta-core_2.12-0.8.0.jar配合 spark-3.1.1-bin-hadoop2.7 ,通过 spark-shell 和spark-sql做了scala各种 api,和 sql各种ddl, dml的验证测试,然后,然后坑来了!

    如下图所示,sql的 update 和insert 都会报错java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.<init>:

21/04/02 14:05:14 ERROR thriftserver.SparkSQLDriver: Failed in [insert into events values (current_date(),'eventid1','eventtype1','data1')]
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.<init>(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V
    at org.apache.spark.sql.delta.DeltaAnalysis.$anonfun$resolveQueryColumns$1(DeltaAnalysis.scala:204)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    at scala.collection.immutable.List.map(List.scala:298)
    at org.apache.spark.sql.delta.DeltaAnalysis.org$apache$spark$sql$delta$DeltaAnalysis$$resolveQueryColumns(DeltaAnalysis.scala:192)
    at org.apache.spark.sql.delta.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:64)
    at org.apache.spark.sql.delta.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:61)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
    at org.apache.spark.sql.delta.DeltaAnalysis.apply(DeltaAnalysis.scala:61)
    at org.apache.spark.sql.delta.DeltaAnalysis.apply(DeltaAnalysis.scala:54)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
    at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
    at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
    at scala.collection.immutable.List.foldLeft(List.scala:89)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:98)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
    at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:67)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:381)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:500)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:494)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:494)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:284)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias.<init>(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V
    at org.apache.spark.sql.delta.DeltaAnalysis.$anonfun$resolveQueryColumns$1(DeltaAnalysis.scala:204)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    at scala.collection.immutable.List.map(List.scala:298)
    at org.apache.spark.sql.delta.DeltaAnalysis.org$apache$spark$sql$delta$DeltaAnalysis$$resolveQueryColumns(DeltaAnalysis.scala:192)
    at org.apache.spark.sql.delta.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:64)
    at org.apache.spark.sql.delta.DeltaAnalysis$$anonfun$apply$1.applyOrElse(DeltaAnalysis.scala:61)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:221)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
    at org.apache.spark.sql.delta.DeltaAnalysis.apply(DeltaAnalysis.scala:61)
    at org.apache.spark.sql.delta.DeltaAnalysis.apply(DeltaAnalysis.scala:54)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
    at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
    at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
    at scala.collection.immutable.List.foldLeft(List.scala:89)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
    at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
    at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
    at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
    at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:98)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
    at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:67)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:381)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:500)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:494)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:494)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:284)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


(注意:spark官网发布的二进制安装包spark-3.1.1-bin-hadoop2.7 内置的是hive 2.3.7,而不是像spark-3.0.2-bin-hadoop2.7-hive1.2那样内置的hive1.2.1,这点需要注意。)

    笔者再次检查了delta 官网关于spark版本兼容性的说明,很明显,delta 0.8.0应该是支持spark3.1.1的:

 

    

    所以接下来笔者经过了各种折腾,尝试了不同的ddl和dml语句 (使用hive 的 storedas 语法格式,使用spark 的using 语法格式,尝试建不同字段名不同字段类型的delta 表,尝试不同的insert 和 update 语句等), 也尝试了不同的 scala update api, 折腾了一天多,都没有解决。

    最后笔者通过谷歌,偶然搜索到了以下链接,打开链接查看,发现描述的问题跟我的现象是一样的:

    

    该issue下还有详细描述,通过仔细查看发现,该问题的原因是 delta lake 0.8.0发布时,spark3.1.1还未正式对外发布,所以没有办法测试二者的兼容性;后续delta 会在0.8.0的基础上,针对跟spark3.1.1的该兼容性问题,出一个patch release,我们只需要耐心静待一段时间就好了。


五。引进新框架新版本,如何避免踩坑

    回头来看,以上链接其实就是delta 官方使用的记录community reported issues 的链接!兜兜转转一大圈,浪费了大量精力和时间,最后发现问题细节和应对方案,竟然在官网就有描述!

    如果我们在使用某个框架的新版本前,能提前多多关注社区动态,比如查看记录社区发现的已知问题的系统(大多使用的是jira,delta lake使用的是githubissues),或加入社区日常讨论组(mail list, 或 slack channel),则自然能少走不少弯路,少踩很多坑。

六。兼容性测试过程中使用的命令附录

通过 spark-shell 和 spark-sql整合使用delta lake,命令示例:

/opt/spark-3.0.2-bin-hadoop2.7-hive1.2/bin/spark-shell--verbose --jars /opt/delta-core_2.12-0.8.0.jar --conf"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"


 /opt/spark-3.0.2-bin-hadoop2.7-hive1.2/bin/spark-sql--verbose --jars /opt/delta-core_2.12-0.8.0.jar --conf"spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf"spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

 相关 sql ddl和dml 测试命令如下:

--ddl: Create a delta table in themetastore
CREATE OR REPLACE TABLE events (
 date DATE,
 eventId STRING,
 eventType STRING,
 data STRING)
USING DELTA;
 
-- ddl: Create a parquet table in themetastore
CREATE TABLE events_parquet (date DATE,eventId STRING, eventType STRING,  dataSTRING) USING parquet;
 
-- dml: insert records into parquet tablein the metastore
insert into events_parquet values(current_date(),'eventid1','eventtype1','data1');
 
-- dml: insert records into delta table inthe metastore
insert into events values(current_date(),'eventid1','eventtype1','data1');
 
-- dml: update records in the parquet tablein the metastore, should shout error
UPDATE events_parquet SET eventType ='eventtypeT' WHERE eventType = 'eventtype1'
 
-- dml: update records in the delta tablein the metastore, should be good
UPDATE events SET eventType = 'eventtypeT'WHERE eventType = 'eventtype1'


相关 scala api 测试命令如下:

--spark-shell, scala apisql("insert into delta.`/delta/events`select * from events_parquet;").showimport io.delta.tables._val deltaTable = DeltaTable.forPath(spark,"/data/events/")deltaTable.updateExpr( // predicate and update expressionsusing SQL formatted string "eventType = 'clck'", Map("eventType" -> "'click'") import org.apache.spark.sql.functions._import spark.implicits._deltaTable.update( // predicate using Spark SQLfunctions and implicits col("eventType") === "clck", Map("eventType" -> lit("click")));

七。参考链接

  • https://docs.delta.io/latest/index.html

  • https://github.com/delta-io/delta

  • https://github.com/delta-io/delta/issues/594

  • http://spark.apache.org/news/index.html

  • http://spark.apache.org/downloads.html

  • https://databricks.com/blog/2021/03/02/introducing-apache-spark-3-1.html

关注不迷路~ 各种福利、资源定期分享欢迎有想法、乐于分享的小伙伴们一起交流学习。

你点的每个在看,我都认真当成了喜欢!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

明哥的IT随笔

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值