spark conf - spark.sql.catalogImplementation

Does Spark SQL use Hive Metastore?

Ask Question

Asked 3 years, 3 months ago

Active 1 year, 5 months ago

Viewed 9k times

 

12

 

7

 

I am developing a Spark SQL application and I've got few questions:

  1. I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.
  2. I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive. But am I correct?

apache-spark hive apache-spark-sql

share  edit  follow 

edited Feb 18 '19 at 9:36

 

Jacek Laskowski

60.2k2020 gold badges181181 silver badges337337 bronze badges

asked May 9 '17 at 15:35

user1888243

2,19177 gold badges2626 silver badges3838 bronze badges

  • 4

    Spark bootstraps a pseudo-Metastore (embedded Derby DB) for internal use, and optionally uses an actual Hive Metastore to read/write persistent Hadoop data. Which does not mean that Spark uses Hive I/O libs, just the Hive meta-data. – Samson Scharfrichter May 9 '17 at 19:19

  • Hello @SamsonScharfrichter I have a case in which I have some inconsistency between DESCRIBE DETAIL table and data I retrieve from the hiveMetastore db. The case is described here. Do you know how Delta table location is determined? – abiratsis Mar 10 at 9:35 

add a comment

2 Answers

ActiveOldestVotes

16

 

I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation.

Spark SQL does not use a Hive metastore under the covers (and defaults to in-memory non-Hive catalogs unless you're in spark-shell that does the opposite).

The default external catalog implementation is controlled by spark.sql.catalogImplementation internal property and can be one of the two possible values: hive and in-memory.

Use the SparkSession to know what catalog is in use.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> spark.version
res0: String = 2.4.0

scala> :type spark.sharedState.externalCatalog
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener

scala> println(spark.sharedState.externalCatalog.unwrapped)
org.apache.spark.sql.hive.HiveExternalCatalog@49d5b651

Please note that I used spark-shell that does start a Hive-aware SparkSession and so I had to start it with --conf spark.sql.catalogImplementation=in-memory to turn it off.

I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive.

That's a very interesting question and can have different answers (some even primarily opinion-based so we have to be extra careful and follow the StackOverflow rules).

Is there any reason to use Hive?

No.

But...if you want to use the very recent feature of Spark 2.2, i.e. cost-based optimizer, you may want to consider it as ANALYZE TABLE for cost statistics can be fairly expensive and so doing it once for tables that are used over and over again across different Spark application runs could give a performance boost.

Please note that Spark SQL without Hive can do it too, but have some limitation as the local default metastore is just for a single-user access and reusing the metadata across Spark applications submitted at the same time won't work.

I don't see any reason to use Hive.

I wrote a blog post Why is Spark SQL so obsessed with Hive?! (after just a single day with Hive) where I asked a similar question and to my surprise it's only now (almost a year after I posted the blog post on Apr 9, 2016) when I think I may have understood why the concept of Hive metastore is so important, esp. in multi-user Spark notebook environments.

Hive itself is just a data warehouse on HDFS so not much use if you've got Spark SQL, but there are still some concepts Hive has done fairly well that are of much use in Spark SQL (until it fully stands on its own legs with a Hive-like metastore).

share  edit  follow 

edited Feb 18 '19 at 9:36

answered Jan 9 '18 at 15:45

 

Jacek Laskowski

60.2k2020 gold badges181181 silver badges337337 bronze badges

  • 1

    You seem to be contradicting yourself when you say Spark does not use Hive Metastore by default: jaceklaskowski.gitbooks.io/mastering-spark-sql/… . Also here it says 'Spark will create a default local Hive metastore (using Derby) for you.' spark.apache.org/docs/latest/… – Gadam Oct 12 '19 at 4:47

  • It will only use it if your SparkSession was created with enableHiveSupport so it's a deliberate choice not a default setting. Spark SQL may or may not use Hive. It's up to a Spark dev. – Jacek Laskowski Oct 12 '19 at 9:33

  • @JacekLaskowski , it seems this behavior is different now . As of with spark 2.4.5, spark shell starts the sparksession with inmemory as catalog. How Can in configure to use "hive" instead ? – Nag Jul 27 at 18:57

  • A simple command such as spark-shell --conf spark.sql.catalogImplementation=hive didnt help – Nag Jul 27 at 18:57

add a comment

2

 

It will connect to a Hive Metastore or instantiate one if none is found when you initialize a HiveContext() object or a spark-shell.

The main reason to use Hive is if you are reading HDFS data in from Hive's managed tables or if you want the convenience of selecting from external tables.

Remember that Hive is simply a lens for reading and writing HDFS files and not an execution engine in and of itself.

share  edit  follow 

answered Jan 9 '18 at 16:02

Paul Back

9411212 silver badges1919 bronze badges

  • Does it do the same for SQLContext object? Where is metadata managed for tables created with SQLContext or SparkSession.sql? – Gadam Oct 12 '19 at 4:49

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值