spark自定义外部数据源

对于spark外部数据源来说,要先了解这几个类

BaseRelation:定义数据的schema信息,把我们的数据转成RDD[Row]
RelationProvider:是一个relation的提供者,创建BaseRelation
TableScan:读取数据并构建行,拿出所有的数据
PrunedScan:列裁剪的
PrunedFilteredScan:列裁剪➕过滤

InsertableRelation:回写数据的relation

insertableRelation有如下假设需要注意:
/**

  • A BaseRelation that can be used to insert data into it through the insert method.
  • If overwrite in insert method is true, the old data in the relation should be overwritten with
  • the new data. If overwrite in insert method is false, the new data should be appended.
  • InsertableRelation has the following three assumptions.
    1. It assumes that the data (Rows in the DataFrame) provided to the insert method
  • exactly matches the ordinal of fields in the schema of the BaseRelation.
    1. It assumes that the schema of this relation will not be changed.
  • Even if the insert method updates the schema (e.g. a relation of JSON or Parquet data may have a
  • schema update after an insert operation), the new schema will not be used.
    1. It assumes that fields of the data provided in the insert method are nullable.
  • If a data source needs to check the actual nullability of a field, it needs to do it in the
  • insert method.
  • @since 1.3.0
    */
    @InterfaceStability.Stable
    trait InsertableRelation {
    def insert(data: DataFrame, overwrite: Boolean): Unit
    }

对于spark外部数据源来说,要先了解这几个类

BaseRelation:定义数据的schema信息,把我们的数据转成RDD[Row]
RelationProvider:是一个relation的提供者,创建BaseRelation
TableScan:读取数据并构建行,拿出所有的数据
PrunedScan:列裁剪的
PrunedFilteredScan:列裁剪➕过滤

InsertableRelation:回写数据的relation

insertableRelation有如下假设需要注意:
/**

  • A BaseRelation that can be used to insert data into it through the insert method.
  • If overwrite in insert method is true, the old data in the relation should be overwritten with
  • the new data. If overwrite in insert method is false, the new data should be appended.
  • InsertableRelation has the following three assumptions.
    1. It assumes that the data (Rows in the DataFrame) provided to the insert method
  • exactly matches the ordinal of fields in the schema of the BaseRelation.
    1. It assumes that the schema of this relation will not be changed.
  • Even if the insert method updates the schema (e.g. a relation of JSON or Parquet data may have a
  • schema update after an insert operation), the new schema will not be used.
    1. It assumes that fields of the data provided in the insert method are nullable.
  • If a data source needs to check the actual nullability of a field, it needs to do it in the
  • insert method.
  • @since 1.3.0
    */
    @InterfaceStability.Stable
    trait InsertableRelation {
    def insert(data: DataFrame, overwrite: Boolean): Unit
    }

下面我们可以简单看一下JDBCRelation

private[sql] case class JDBCRelation(
parts: Array[Partition], jdbcOptions: JDBCOptions)(@transient val sparkSession: SparkSession)
extends BaseRelation
with PrunedFilteredScan
with InsertableRelation

继承了 BaseRelation ,BaseRelation 是一个抽象类,就需要实现他的属性和方法:
override def sqlContext: SQLContext = sparkSession.sqlContext

override val needConversion: Boolean = false

override val schema: StructType = {
val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
jdbcOptions.customSchema match {
case Some(customSchema) => JdbcUtils.getCustomSchema(
tableSchema, customSchema, sparkSession.sessionState.conf.resolver)
case None => tableSchema
}
}

在这里插入图片描述

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.sources.{BaseRelation, RelationProvider, SchemaRelationProvider}
import org.apache.spark.sql.types.StructType

/**
  * @Author: lih
  * @Date: 2019/8/2 11:40 PM
  * @Version 1.0
  */
class DefaultSource extends RelationProvider with SchemaRelationProvider{
  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
    createRelation(sqlContext,parameters,null)
  }

  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation = {
    val path = parameters.get("path")

    path match {
      case Some(p)=> new TextDataSourceRelation(sqlContext,p,schema)
      case _=> throw new IllegalArgumentException("path is required ...")
    }


  }
}

import org.apache.spark.internal.Logging
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._

/**
  * @Author: lih
  * @Date: 2019/8/2 11:41 PM
  * @Version 1.0
  */
class TextDataSourceRelation(override val sqlContext: SQLContext,
                             path: String,
                             userSchema: StructType
                            ) extends BaseRelation with TableScan with Logging with Serializable {

  override def schema: StructType = {
    if (userSchema != null) {
      userSchema
    } else {
      StructType(
        StructField("id", LongType, false) ::
          StructField("name", StringType, false) ::
          StructField("gender", StringType, false) ::
          StructField("salar", LongType, false) ::
          StructField("comm", LongType, false) :: Nil
      )
    }
  }


  override def buildScan(): RDD[Row] = {
    logError("this is ruozedata custom buildScan...")

    var rdd = sqlContext.sparkContext.wholeTextFiles(path).map(_._2)
    val schemaField = schema.fields

    // rdd + schemaField
    val rows = rdd.map(fileContent => {
      val lines = fileContent.split("\n")
      val data = lines.map(_.split(",").map(x=>x.trim)).toSeq

      val result = data.map(x => x.zipWithIndex.map{
        case  (value, index) => {

          val columnName = schemaField(index).name

           caseTo(if(columnName.equalsIgnoreCase("gender")) {
            if(value == "0") {
              "男"
            } else if(value == "1"){
              "女"
            } else {
              "未知"
            }
          } else {
            value
          }, schemaField(index).dataType)
        }
      })
      result.map(x => Row.fromSeq(x))
    })

    rows.flatMap(x=>x)
  }


  def caseTo(value: String, dataType: DataType) = {
      dataType match {
        case _:DoubleType => value.toDouble
        case _:LongType => value.toLong
        case _:StringType => value
    }
  }
}

数据:
1,li,0,100000,2000
2,zhang,1,20000,23223

object Test {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .appName(this.getClass.getSimpleName).master("local[8]")
      .getOrCreate()


    val df = spark
      .read
      .format("com.ztgx.datasourse.DefaultSource")
      .load("file:///Users/mac/Desktop/1.txt")

    df.show()

    spark.stop()
  }

}

结果:

+---+-----+------+------+-----+
| id| name|gender| salar| comm|
+---+-----+------+------+-----+
|  1|   li|     男|100000| 2000|
|  2|zhang|     女| 20000|23223|
+---+-----+------+------+-----+
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值