自定义Spark的外部数据源读取文件
搬砖过程中学习使用了Spark提供的自定义外部数据源功能,可以根据自身需要自定义数据读取方式获取dataframe,写法是
val df = spark.sqlContext.read.format(“com.spark.datasource”)
.option(“”,“”)
.option(“”,“”)
.schema()
.load(“path”)
format指定读取方式,option传入需要参数,schema可传入定义的schema,load传入文件路径。
建立scala工程
在IDEA建立maven工程,选择jdk,我用的是1.8,引入spark的包。建完之后整个工程中只有java目录,没有scala目录,之后
打开项目设置 File -> Project Structure,选择Modules -> Source ->main,鼠标右键,显示New Folder。新建scala文件夹,之后选中点击Source,变成源码文件夹。
新建scala文件前,需要在IDEA中安装scala的插件,在Project Structure中选择Libraries,点击“+”,新增Scala SDK,完成之后可在scala中新建scala文件,之后便可以编写代码。IDEA设置可参考:
https://blog.csdn.net/whgyxy/article/details/88854021
自定义数据源
自定义数据源就是要实现上面的com.spark.datasource
spark sql处理结构化数据需要(1)获取rdd(2)获取schema(3)rdd+schema生成dataframe 三步,自定义数据源差不多也是这三步。
三个概念:
(1)BaseRelation:定义数据schema,把数据转为rdd[row]
(2)RelationProvider:创建BaseRelation
(3)TableScan:读取数据并构建行
BaseRelation
抽象类,需要子类实现,sqlcontext、schema
abstract class BaseRelation() extends scala.AnyRef {
def sqlContext : org.apache.spark.sql.SQLContext
def schema : org.apache.spark.sql.types.StructType
def sizeInBytes : scala.Long = { /* compiled code / }
def needConversion : scala.Boolean = { / compiled code / }
def unhandledFilters(filters : scala.Array[org.apache.spark.sql.sources.Filter]) : scala.Array[org.apache.spark.sql.sources.Filter] = { / compiled code */ }
}
RelationProvider
createRelation创建BaseRelation
trait RelationProvider extends scala.AnyRef {
def createRelation(sqlContext : org.apache.spark.sql.SQLContext, parameters : scala.Predef.Map[scala.Predef.String, scala.Predef.String]) : org.apache.spark.sql.sources.BaseRelation
}
TableScan
buildScan读取数据返回RDD[Row]
trait TableScan extends scala.AnyRef {
def buildScan() : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
}
所以大体流程是继承RelationProvider重写createRelation方法生成BaseRelation,BaseRelation中定义schema,并且重写buildScan()方法生成RDD[Row],与schema组合从而生成dataframe。
DefaultSource 是固定名称,应用启动时会去找format(“com.spark.datasource”)下的DefaultSource。
class DefaultSource extends RelationProvider with SchemaRelationProvider {
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation = {
createRelation(sqlContext,parameters,null)
}
override def createRelation(sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation = {
val path = parameters.get("path")
val charSet = parameters.getOrElse("charSet","UTF-8")
path match{
case Some(p) =>new DataSourceRelation(sqlContext,p,charSet,schema)
case _=>throw new IllegalArgumentException("Path is required for files")
}
}
class DataSourceRelation (override val sqlContext:SQLContext,path:String ,charSet:String,userSchema:StructType)
extends BaseRelation with TableScan with Serializable {
override def schema:StructType = {
StructType(
StructField("column_1",StringType,true) ::
StructField("column_2",StringType,true) :: Nil
)
}
override def buildScan(): RDD[Row] = {
var resultMapList:java.util.List[java.util.Map[String,String]]= null
val schemaFields = schema.fields
println("buildScan:called path:"+path)
val is = new FileInputStream(new File(path))
val resultList :java.util.List[Row] = new java.util.ArrayList[Row]()
//getData方法是具体文件读取方法,数据存入resultMapList
resultMapList = Util.getData(path,charSet,is)
import scala.collection.JavaConversions._
if(resultMapList!=null){
//获取rows
for(i<-0 until resultMapList.size()){
val record = new util.ArrayList[String]()
val map = resultMapList.get(i)
for(j<-0 until schemaFields.length ){
val field = schemaFields(j)
record.add(map.get(field.name))
}
//生成row
val cleanRow =org.apache.spark.sql.Row.fromSeq(record)
resultList.add(cleanRow)
}
}
//转为rdd
sqlContext.sparkContext.parallelize(resultList)
}
}
其他资料:
[1] http://www.imooc.com/article/294716?block_id=tuijian_wz
[2] https://blog.csdn.net/bluejoe2000/article/details/48859351?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task
[3] https://blog.csdn.net/hucuoshi8718/article/details/88665689
[4] http://adrai.github.io/flowchart.js/
[5] https://blog.csdn.net/liweihope/article/details/94781588?depth_1-utm_source=distribute.pc_relevant.none-task&utm_source=distribute.pc_relevant.none-task
[6] https://my.oschina.net/u/778683/blog/3128796
部分详解:
[7] https://juejin.cn/post/7146833855318589470