Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: jdbc. Please find packages at http://spark.apache.org/third-party-projects.html
The question is almost Why does format("kafka") fail with "Failed to find data source: kafka." with uber-jar? with the differences that the other OP used Apache Maven to create an uber-jar and here it's about sbt (sbt-assembly plugin's configuration to be precise).
The short name (aka alias) of a data source, e.g. jdbc
or kafka
, are only available if the corresponding META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
registers a DataSourceRegister
.
For jdbc
alias to work Spark SQL uses META-INF/services/org.apache.spark.sql.sources.DataSourceRegister with the following entry (there are others):
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
That's what ties jdbc
alias up with the data source.
And you've excluded it from an uber-jar by the following assemblyMergeStrategy
.
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
Note case PathList("META-INF", xs @ _*)
which you simply MergeStrategy.discard
. That's the root cause.
Just to check that the "infrastructure" is available and you could use the jdbc
data source by its fully-qualified name (not the alias), try this:
spark.read.
format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider").
load("jdbc:postgresql://localhost/testdb")
You will see other problems due to missing options like url
, but...we're digressing.
A solution is to MergeStrategy.concat
all META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
(that would create an uber-jar with all data sources, incl. the jdbc
data source).
case "META-INF/services/org.apache.spark.sql.sources.DataSourceRegister" => MergeStrategy.concat
kafka
data source is an external module and is not available to Spark applications by default.
You have to define it as a dependency in your pom.xml
(as you have done), but that's just the very first step to have it in your Spark application.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.2.0</version>
</dependency>
With that dependency you have to decide whether you want to create a so-called uber-jar that would have all the dependencies bundled altogether (that results in a fairly big jar file and makes the submission time longer) or use --packages
(or less flexible --jars
) option to add the dependency at spark-submit
time.
(There are other options like storing the required jars on Hadoop HDFS or using Hadoop distribution-specific ways of defining dependencies for Spark applications, but let's keep things simple)
I'd recommend using --packages
first and only when it works consider the other options.
Use spark-submit --packages
to include the spark-sql-kafka-0-10 module as follows.
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0
Include the other command-line options as you wish.
Uber-Jar Approach
Including all the dependencies in a so-called uber-jar may not always work due to how META-INF
directories are handled.
For kafka
data source to work (and other data sources in general) you have to ensure that META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
of all the data sources are merged (not replace
or first
or whatever strategy you use).
kafka
data sources uses its own META-INF/services/org.apache.spark.sql.sources.DataSourceRegister that registers org.apache.spark.sql.kafka010.KafkaSourceProvider as the data source provider for kafka
format.