spark sql使用jdbc读取数据库的过程

sf_www

于 2023-07-21 16:06:19 发布

阅读量1.3k

点赞数 1

分类专栏： spark 文章标签：数据库 spark sql

本文链接：https://blog.csdn.net/chanyue123/article/details/131850155

版权

spark 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

1. 即使用sparkSession.read().format("jdbc").load()读取数据库的过程。

sparkSession.read().format("jdbc")
        .option("driver", "xxx")
        .option("url", "xxx")
        .option("user", "xxx")
        .option("password", "xxx")
        .option("dbtable", "dbtable")
        .option("fetchsize", 100)
        .option("numPartitions", 1)
        .option("customSchema", "ID DECIMAL(38,0),NAME STRING")
        .load().show();

相关参数说明：

dbtable：可以写表名，也可以写查询语句，但是要括起来加上表别名，即比如：

(select x1,x2... from xxx ) a

customSchema：

用于从JDBC连接器读取数据的自定义 schema。例如，id DECIMAL(38, 0), name STRING。您还可以指定部分字段，其他字段使用默认类型映射。例如，id DECIMAL（38,0）。列名应与JDBC表的相应列名相同。用户可以指定Spark SQL的相应数据类型，而不是使用默认值。此选项仅适用于读。

numPartitions：

表读取和写入中可用于并行的最大分区数，同时确定了最大并发的JDBC连接数。

fetchsize：

用于确定每次往返要获取的行数（例如Oracle是10行），可以用于提升JDBC驱动程序的性能。此选项仅适用于读。

2. 源码过程解析

1）DataFrameReader.format

/**
   * Specifies the input data source format.
   *
   * @since 1.4.0
   */
  def format(source: String): DataFrameReader = {
    this.source = source
    this
  }

2）DataFrameReader.option

/**
   * Adds an input option for the underlying data source.
   *
   * You can set the following option(s):
   * <ul>
   * <li>`timeZone` (default session local timezone): sets the string that indicates a timezone
   * to be used to parse timestamps in the JSON/CSV datasources or partition values.</li>
   * </ul>
   *
   * @since 1.4.0
   */
  def option(key: String, value: String): DataFrameReader = {
    this.extraOptions += (key -> value)
    this
  }

3）DataFrameReader.load

/**
   * Loads input in as a `DataFrame`, for data sources that don't require a path (e.g. external
   * key-value stores).
   *
   * @since 1.4.0
   */
  def load(): DataFrame = {
    load(Seq.empty: _*) // force invocation of `load(...varargs...)`
  }

/**
   * Loads input in as a `DataFrame`, for data sources that support multiple paths.
   * Only works if the source is a HadoopFsRelationProvider.
   *
   * @since 1.6.0
   */
  @scala.annotation.varargs
  def load(paths: String*): DataFrame = {
    if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
      throw new AnalysisException("Hive data source can only be used with tables, you can not " +
        "read files of Hive data source directly.")
    }

    val cls = DataSource.lookupDataSource(source, sparkSession.sessionState.conf)
    if (classOf[DataSourceV2].isAssignableFrom(cls)) {
      val ds = cls.newInstance()
      val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
        ds = ds.asInstanceOf[DataSourceV2],
        conf = sparkSession.sessionState.conf)
      val options = new DataSourceOptions((sessionOptions ++ extraOptions).asJava)

      // Streaming also uses the data source V2 API. So it may be that the data source implements
      // v2, but has no v2 implementation for batch reads. In that case, we fall back to loading
      // the dataframe as a v1 source.
      val reader = (ds, userSpecifiedSchema) match {
        case (ds: ReadSupportWithSchema, Some(schema)) =>
          ds.createReader(schema, options)

        case (ds: ReadSupport, None) =>
          ds.createReader(options)

        case (ds: ReadSupportWithSchema, None) =>
          throw new AnalysisException(s"A schema needs to be specified when using $ds.")

        case (ds: ReadSupport, Some(schema)) =>
          val reader = ds.createReader(options)
          if (reader.readSchema() != schema) {
            throw new AnalysisException(s"$ds does not allow user-specified schemas.")
          }
          reader

        case _ => null // fall back to v1
      }

      if (reader == null) {
        loadV1Source(paths: _*)
      } else {
        Dataset.ofRows(sparkSession, DataSourceV2Relation(reader))
      }
    } else {
      loadV1Source(paths: _*)
    }
  }

这里查看

val cls = DataSource.lookupDataSource(source, sparkSession.sessionState.conf)

传入的是jdbc，判断逻辑进入

serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList match {

所以最终返回的是org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider。

if (classOf[DataSourceV2].isAssignableFrom(cls)) 返回false，所以直接进入到loadV1Source(paths: _*)

4）DataFrameReader.loadV1Source

private def loadV1Source(paths: String*) = {
    // Code path for data source v1.
    sparkSession.baseRelationToDataFrame(
      DataSource.apply(
        sparkSession,
        paths = paths,
        userSpecifiedSchema = userSpecifiedSchema,
        className = source,
        options = extraOptions.toMap).resolveRelation())
  }

这里传入的paths为空，userSpecifiedSchema也是空（只有读取csv、parquet文件时才会主动调用schema方法传入值）。往下查看DataSource.resolveRelation

5）DataSource.resolveRelation

创建已解析的BaseRelation，可以从该datasource读取或写入数据。

/**
   * Create a resolved [[BaseRelation]] that can be used to read data from or write data into this
   * [[DataSource]]
   *
   * @param checkFilesExist Whether to confirm that the files exist when generating the
   *                        non-streaming file based datasource. StructuredStreaming jobs already
   *                        list file existence, and when generating incremental jobs, the batch
   *                        is considered as a non-streaming file based data source. Since we know
   *                        that files already exist, we don't need to check them again.
   */
  def resolveRelation(checkFilesExist: Boolean = true): BaseRelation = {
    val relation = (providingClass.newInstance(), userSpecifiedSchema) match {
      // TODO: Throw when too much is given.
      case (dataSource: SchemaRelationProvider, Some(schema)) =>
        dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
      case (dataSource: RelationProvider, None) =>
        dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
      case (_: SchemaRelationProvider, None) =>
        throw new AnalysisException(s"A schema needs to be specified when using $className.")
      case (dataSource: RelationProvider, Some(schema)) =>
        val baseRelation =
          dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
        if (baseRelation.schema != schema) {
          throw new AnalysisException(s"$className does not allow user-specified schemas.")
        }
        baseRelation

      // We are reading from the results of a streaming query. Load files from the metadata log
      // instead of listing them using HDFS APIs.
      case (format: FileFormat, _)
          if FileStreamSink.hasMetadata(
            caseInsensitiveOptions.get("path").toSeq ++ paths,
            sparkSession.sessionState.newHadoopConf()) =>
        val basePath = new Path((caseInsensitiveOptions.get("path").toSeq ++ paths).head)
        val tempFileCatalog = new MetadataLogFileIndex(sparkSession, basePath, None)
        val fileCatalog = if (userSpecifiedSchema.nonEmpty) {
          val partitionSchema = combineInferredAndUserSpecifiedPartitionSchema(tempFileCatalog)
          new MetadataLogFileIndex(sparkSession, basePath, Option(partitionSchema))
        } else {
          tempFileCatalog
        }
        val dataSchema = userSpecifiedSchema.orElse {
          format.inferSchema(
            sparkSession,
            caseInsensitiveOptions,
            fileCatalog.allFiles())
        }.getOrElse {
          throw new AnalysisException(
            s"Unable to infer schema for $format at ${fileCatalog.allFiles().mkString(",")}. " +
                "It must be specified manually")
        }

        HadoopFsRelation(
          fileCatalog,
          partitionSchema = fileCatalog.partitionSchema,
          dataSchema = dataSchema,
          bucketSpec = None,
          format,
          caseInsensitiveOptions)(sparkSession)

      // This is a non-streaming file based datasource.
      case (format: FileFormat, _) =>
        val allPaths = caseInsensitiveOptions.get("path") ++ paths
        val hadoopConf = sparkSession.sessionState.newHadoopConf()
        val globbedPaths = allPaths.flatMap(
          DataSource.checkAndGlobPathIfNecessary(hadoopConf, _, checkFilesExist)).toArray

        val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
        val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format, fileStatusCache)

        val fileCatalog = if (sparkSession.sqlContext.conf.manageFilesourcePartitions &&
            catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog) {
          val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes
          new CatalogFileIndex(
            sparkSession,
            catalogTable.get,
            catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))
        } else {
          new InMemoryFileIndex(
            sparkSession, globbedPaths, options, Some(partitionSchema), fileStatusCache)
        }

        HadoopFsRelation(
          fileCatalog,
          partitionSchema = partitionSchema,
          dataSchema = dataSchema.asNullable,
          bucketSpec = bucketSpec,
          format,
          caseInsensitiveOptions)(sparkSession)

      case _ =>
        throw new AnalysisException(
          s"$className is not a valid Spark SQL Data Source.")
    }

    relation match {
      case hs: HadoopFsRelation =>
        SchemaUtils.checkColumnNameDuplication(
          hs.dataSchema.map(_.name),
          "in the data schema",
          equality)
        SchemaUtils.checkColumnNameDuplication(
          hs.partitionSchema.map(_.name),
          "in the partition schema",
          equality)
      case _ =>
        SchemaUtils.checkColumnNameDuplication(
          relation.schema.map(_.name),
          "in the data schema",
          equality)
    }

    relation
  }

providingClass.newInstance() 返回的是JdbcRelationProvider，所以这里走向第二个case分支

case (dataSource: RelationProvider, None) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)

然后走向 JdbcRelationProvider.createRelation

6）JdbcRelationProvider.createRelation

override def createRelation(
      sqlContext: SQLContext,
      parameters: Map[String, String]): BaseRelation = {
    import JDBCOptions._

    val jdbcOptions = new JDBCOptions(parameters)
    val partitionColumn = jdbcOptions.partitionColumn
    val lowerBound = jdbcOptions.lowerBound
    val upperBound = jdbcOptions.upperBound
    val numPartitions = jdbcOptions.numPartitions

    val partitionInfo = if (partitionColumn.isEmpty) {
      assert(lowerBound.isEmpty && upperBound.isEmpty, "When 'partitionColumn' is not specified, " +
        s"'$JDBC_LOWER_BOUND' and '$JDBC_UPPER_BOUND' are expected to be empty")
      null
    } else {
      assert(lowerBound.nonEmpty && upperBound.nonEmpty && numPartitions.nonEmpty,
        s"When 'partitionColumn' is specified, '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', and " +
          s"'$JDBC_NUM_PARTITIONS' are also required")
      JDBCPartitioningInfo(
        partitionColumn.get, lowerBound.get, upperBound.get, numPartitions.get)
    }
    val parts = JDBCRelation.columnPartition(partitionInfo)
    JDBCRelation(parts, jdbcOptions)(sqlContext.sparkSession)
  }

这里是 jdbc连接的核心部分，整体逻辑也简单，返回JDBCRelation。回到DataSource.resolveRelation，检查下重复列，返回该relation，最终调用sparkSession.baseRelationToDataFrame返回DataFrame。

7）DataFrame.show

不论是调用show还是将该dataframe写往其他地方，最终会调用JDBCRelation的buildScan方法，这里就是真正读取数据库的地方。可以去看看 PrunedFilteredScan。