spark自定义外部数据源

最新推荐文章于 2024-08-22 09:40:31 发布

caixiaohao007

最新推荐文章于 2024-08-22 09:40:31 发布

阅读量2.6k

点赞数

分类专栏： spark 文章标签： spark datasource

本文链接：https://blog.csdn.net/qq_38007708/article/details/98326556

版权

spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

对于spark外部数据源来说，要先了解这几个类

BaseRelation:定义数据的schema信息，把我们的数据转成RDD[Row]
RelationProvider:是一个relation的提供者，创建BaseRelation
TableScan：读取数据并构建行，拿出所有的数据
PrunedScan:列裁剪的
PrunedFilteredScan：列裁剪➕过滤

InsertableRelation：回写数据的relation

insertableRelation有如下假设需要注意：
/**

A BaseRelation that can be used to insert data into it through the insert method.
If overwrite in insert method is true, the old data in the relation should be overwritten with
the new data. If overwrite in insert method is false, the new data should be appended.
InsertableRelation has the following three assumptions.
1. It assumes that the data (Rows in the DataFrame) provided to the insert method
exactly matches the ordinal of fields in the schema of the BaseRelation.
1. It assumes that the schema of this relation will not be changed.
Even if the insert method updates the schema (e.g. a relation of JSON or Parquet data may have a
schema update after an insert operation), the new schema will not be used.
1. It assumes that fields of the data provided in the insert method are nullable.
If a data source needs to check the actual nullability of a field, it needs to do it in the
insert method.
@since 1.3.0
*/
@InterfaceStability.Stable
trait InsertableRelation {
def insert(data: DataFrame, overwrite: Boolean): Unit
}

对于spark外部数据源来说，要先了解这几个类

InsertableRelation：回写数据的relation

insertableRelation有如下假设需要注意：
/**

A BaseRelation that can be used to insert data into it through the insert method.
If overwrite in insert method is true, the old data in the relation should be overwritten with
the new data. If overwrite in insert method is false, the new data should be appended.
InsertableRelation has the following three assumptions.
1. It assumes that the data (Rows in the DataFrame) provided to the insert method
exactly matches the ordinal of fields in the schema of the BaseRelation.
1. It assumes that the schema of this relation will not be changed.
Even if the insert method updates the schema (e.g. a relation of JSON or Parquet data may have a
schema update after an insert operation), the new schema will not be used.
1. It assumes that fields of the data provided in the insert method are nullable.
If a data source needs to check the actual nullability of a field, it needs to do it in the
insert method.
@since 1.3.0
*/
@InterfaceStability.Stable
trait InsertableRelation {
def insert(data: DataFrame, overwrite: Boolean): Unit
}

下面我们可以简单看一下JDBCRelation

private[sql] case class JDBCRelation(
parts: Array[Partition], jdbcOptions: JDBCOptions)(@transient val sparkSession: SparkSession)
extends BaseRelation
with PrunedFilteredScan
with InsertableRelation

继承了 BaseRelation ，BaseRelation 是一个抽象类，就需要实现他的属性和方法：
override def sqlContext: SQLContext = sparkSession.sqlContext

override val needConversion: Boolean = false

override val schema: StructType = {
val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
jdbcOptions.customSchema match {
case Some(customSchema) => JdbcUtils.getCustomSchema(
tableSchema, customSchema, sparkSession.sessionState.conf.resolver)
case None => tableSchema
}
}
…

在这里插入图片描述

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.sources.{BaseRelation, RelationProvider, SchemaRelationProvider}
import org.apache.spark.sql.types.StructType

/**
  * @Author: lih
  * @Date: 2019/8/2 11:40 PM
  * @Version 1.0
  */
class DefaultSource extends RelationProvider with SchemaRelationProvider{
  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
    createRelation(sqlContext,parameters,null)
  }

  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation = {
    val path = parameters.get("path")

    path match {
      case Some(p)=> new TextDataSourceRelation(sqlContext,p,schema)
      case _=> throw new IllegalArgumentException("path is required ...")
    }


  }
}


import org.apache.spark.internal.Logging
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._

/**
  * @Author: lih
  * @Date: 2019/8/2 11:41 PM
  * @Version 1.0
  */
class TextDataSourceRelation(override val sqlContext: SQLContext,
                             path: String,
                             userSchema: StructType
                            ) extends BaseRelation with TableScan with Logging with Serializable {

  override def schema: StructType = {
    if (userSchema != null) {
      userSchema
    } else {
      StructType(
        StructField("id", LongType, false) ::
          StructField("name", StringType, false) ::
          StructField("gender", StringType, false) ::
          StructField("salar", LongType, false) ::
          StructField("comm", LongType, false) :: Nil
      )
    }
  }


  override def buildScan(): RDD[Row] = {
    logError("this is ruozedata custom buildScan...")

    var rdd = sqlContext.sparkContext.wholeTextFiles(path).map(_._2)
    val schemaField = schema.fields

    // rdd + schemaField
    val rows = rdd.map(fileContent => {
      val lines = fileContent.split("\n")
      val data = lines.map(_.split(",").map(x=>x.trim)).toSeq

      val result = data.map(x => x.zipWithIndex.map{
        case  (value, index) => {

          val columnName = schemaField(index).name

           caseTo(if(columnName.equalsIgnoreCase("gender")) {
            if(value == "0") {
              "男"
            } else if(value == "1"){
              "女"
            } else {
              "未知"
            }
          } else {
            value
          }, schemaField(index).dataType)
        }
      })
      result.map(x => Row.fromSeq(x))
    })

    rows.flatMap(x=>x)
  }


  def caseTo(value: String, dataType: DataType) = {
      dataType match {
        case _:DoubleType => value.toDouble
        case _:LongType => value.toLong
        case _:StringType => value
    }
  }
}

数据：
1,li,0,100000,2000
2,zhang,1,20000,23223

object Test {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .appName(this.getClass.getSimpleName).master("local[8]")
      .getOrCreate()


    val df = spark
      .read
      .format("com.ztgx.datasourse.DefaultSource")
      .load("file:///Users/mac/Desktop/1.txt")

    df.show()

    spark.stop()
  }

}

结果：

+---+-----+------+------+-----+
| id| name|gender| salar| comm|
+---+-----+------+------+-----+
|  1|   li|     男|100000| 2000|
|  2|zhang|     女| 20000|23223|
+---+-----+------+------+-----+