spark写表指定外部表_如何将Spark结构化的流式DataFrame插入Hive外部表/位置?

One query on spark structured streaming integration with HIVE table.

I have tried to do some examples of spark structured streaming.

here is my example

val spark =SparkSession.builder().appName("StatsAnalyzer")

.enableHiveSupport()

.config("hive.exec.dynamic.partition", "true")

.config("hive.exec.dynamic.partition.mode", "nonstrict")

.config("spark.sql.streaming.checkpointLocation", "hdfs://pp/apps/hive/warehouse/ab.db")

.getOrCreate()

// Register the dataframe as a Hive table

val userSchema = new StructType().add("name", "string").add("age", "integer")

val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///home/su/testdelta")

csvDF.createOrReplaceTempView("updates")

val query= spark.sql("insert into table_abcd select * from updates")

query.writeStream.start()

As you can see in the last step while writing data-frame to hdfs location, , the data is not getting inserted into the exciting directory (my existing directory having some old data partitioned by "age").

I am getting

spark.sql.AnalysisException : queries with streaming source must be executed with writeStream start()

Can you help why i am not able to insert data in to existing directory in hdfs location ? or is there any other way that i can do "insert into " operation on hive table ?

Looking for a solution

解决方案

Spark Structured Streaming does not support writing the result of a streaming query to a Hive table.

scala> println(spark.version)

2.4.0

val sq = spark.readStream.format("rate").load

scala> :type sq

org.apache.spark.sql.DataFrame

scala> assert(sq.isStreaming)

scala> sq.writeStream.format("hive").start

org.apache.spark.sql.AnalysisException: Hive data source can only be used with tables, you can not write files of Hive data source directly.;

at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:246)

... 49 elided

If a target system (aka sink) is not supported you could use use foreach and foreachBatch operations (highlighting mine):

The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.

I think foreachBatch is your best bet.

import org.apache.spark.sql.DataFrame

sq.writeStream.foreachBatch { case (ds: DataFrame, batchId: Long) =>

// do whatever you want with your input DataFrame

// incl. writing to Hive

// I simply decided to print out the rows to the console

ds.show

}.start

There is also Apache Hive Warehouse Connector that I've never worked with but seems like it may be of some help.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值