对于spark外部数据源来说,要先了解这几个类
BaseRelation:定义数据的schema信息,把我们的数据转成RDD[Row]
RelationProvider:是一个relation的提供者,创建BaseRelation
TableScan:读取数据并构建行,拿出所有的数据
PrunedScan:列裁剪的
PrunedFilteredScan:列裁剪➕过滤
InsertableRelation:回写数据的relation
insertableRelation有如下假设需要注意:
/**
- A BaseRelation that can be used to insert data into it through the insert method.
- If overwrite in insert method is true, the old data in the relation should be overwritten with
- the new data. If overwrite in insert method is false, the new data should be appended.
- InsertableRelation has the following three assumptions.
-
- It assumes that the data (Rows in the DataFrame) provided to the insert method
- exactly matches the ordinal of fields in the schema of the BaseRelation.
-
- It assumes that the schema of this relation will not be changed.
- Even if the insert method updates the schema (e.g. a relation of JSON or Parquet data may have a
- schema update after an insert operation), the new schema will not be used.
-
- It assumes that fields of the data provided in the insert method are nullable.
- If a data source needs to check the actual nullability of a field, it needs to do it in the
- insert method.
- @since 1.3.0
*/
@InterfaceStability.Stable
trait InsertableRelation {
def insert(data: DataFrame, overwrite: Boolean): Unit
}
对于spark外部数据源来说,要先了解这几个类
BaseRelation:定义数据的schema信息,把我们的数据转成RDD[Row]
RelationProvider:是一个relation的提供者,创建BaseRelation
TableScan:读取数据并构建行,拿出所有的数据
PrunedScan:列裁剪的
PrunedFilteredScan:列裁剪➕过滤
InsertableRelation:回写数据的relation
insertableRelation有如下假设需要注意:
/**
- A BaseRelation that can be used to insert data into it through the insert method.
- If overwrite in insert method is true, the old data in the relation should be overwritten with
- the new data. If overwrite in insert method is false, the new data should be appended.
- InsertableRelation has the following three assumptions.
-
- It assumes that the data (Rows in the DataFrame) provided to the insert method
- exactly matches the ordinal of fields in the schema of the BaseRelation.
-
- It assumes that the schema of this relation will not be changed.
- Even if the insert method updates the schema (e.g. a relation of JSON or Parquet data may have a
- schema update after an insert operation), the new schema will not be used.
-
- It assumes that fields of the data provided in the insert method are nullable.
- If a data source needs to check the actual nullability of a field, it needs to do it in the
- insert method.
- @since 1.3.0
*/
@InterfaceStability.Stable
trait InsertableRelation {
def insert(data: DataFrame, overwrite: Boolean): Unit
}
下面我们可以简单看一下JDBCRelation
private[sql] case class JDBCRelation(
parts: Array[Partition], jdbcOptions: JDBCOptions)(@transient val sparkSession: SparkSession)
extends BaseRelation
with PrunedFilteredScan
with InsertableRelation
继承了 BaseRelation ,BaseRelation 是一个抽象类,就需要实现他的属性和方法:
override def sqlContext: SQLContext = sparkSession.sqlContext
override val needConversion: Boolean = false
override val schema: StructType = {
val tableSchema = JDBCRDD.resolveTable(jdbcOptions)
jdbcOptions.customSchema match {
case Some(customSchema) => JdbcUtils.getCustomSchema(
tableSchema, customSchema, sparkSession.sessionState.conf.resolver)
case None => tableSchema
}
}
…
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.sources.{BaseRelation, RelationProvider, SchemaRelationProvider}
import org.apache.spark.sql.types.StructType
/**
* @Author: lih
* @Date: 2019/8/2 11:40 PM
* @Version 1.0
*/
class DefaultSource extends RelationProvider with SchemaRelationProvider{
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
createRelation(sqlContext,parameters,null)
}
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation = {
val path = parameters.get("path")
path match {
case Some(p)=> new TextDataSourceRelation(sqlContext,p,schema)
case _=> throw new IllegalArgumentException("path is required ...")
}
}
}
import org.apache.spark.internal.Logging
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.sources._
import org.apache.spark.sql.types._
/**
* @Author: lih
* @Date: 2019/8/2 11:41 PM
* @Version 1.0
*/
class TextDataSourceRelation(override val sqlContext: SQLContext,
path: String,
userSchema: StructType
) extends BaseRelation with TableScan with Logging with Serializable {
override def schema: StructType = {
if (userSchema != null) {
userSchema
} else {
StructType(
StructField("id", LongType, false) ::
StructField("name", StringType, false) ::
StructField("gender", StringType, false) ::
StructField("salar", LongType, false) ::
StructField("comm", LongType, false) :: Nil
)
}
}
override def buildScan(): RDD[Row] = {
logError("this is ruozedata custom buildScan...")
var rdd = sqlContext.sparkContext.wholeTextFiles(path).map(_._2)
val schemaField = schema.fields
// rdd + schemaField
val rows = rdd.map(fileContent => {
val lines = fileContent.split("\n")
val data = lines.map(_.split(",").map(x=>x.trim)).toSeq
val result = data.map(x => x.zipWithIndex.map{
case (value, index) => {
val columnName = schemaField(index).name
caseTo(if(columnName.equalsIgnoreCase("gender")) {
if(value == "0") {
"男"
} else if(value == "1"){
"女"
} else {
"未知"
}
} else {
value
}, schemaField(index).dataType)
}
})
result.map(x => Row.fromSeq(x))
})
rows.flatMap(x=>x)
}
def caseTo(value: String, dataType: DataType) = {
dataType match {
case _:DoubleType => value.toDouble
case _:LongType => value.toLong
case _:StringType => value
}
}
}
数据:
1,li,0,100000,2000
2,zhang,1,20000,23223
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName(this.getClass.getSimpleName).master("local[8]")
.getOrCreate()
val df = spark
.read
.format("com.ztgx.datasourse.DefaultSource")
.load("file:///Users/mac/Desktop/1.txt")
df.show()
spark.stop()
}
}
结果:
+---+-----+------+------+-----+
| id| name|gender| salar| comm|
+---+-----+------+------+-----+
| 1| li| 男|100000| 2000|
| 2|zhang| 女| 20000|23223|
+---+-----+------+------+-----+