Spark sql 自定义读取数据源

最新推荐文章于 2023-07-31 09:39:28 发布

muyingmiao

最新推荐文章于 2023-07-31 09:39:28 发布

阅读量460

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/muyingmiao/article/details/103261214

版权

本文介绍如何在Spark SQL中自定义TextSource数据源，以自动读取带有默认Schema的Text文档。内容包括DefaultSource、TextDatasourceRelation、Utils及TextApp四个组件的实现，以及测试文档的字段说明（ID、name、性别、薪水、奖金）。

摘要由CSDN通过智能技术生成

通常在一个流式计算的主流程里，会用到很多映射数据，比较常见的是Text文档，但是文档读进来之后还要匹配相应的schema，本文通过自定义TextSource数据源，自动读取默认的Schema。
DefaultSource.scala

package com.wxx.bigdata.sql_custome_source

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.sources.{BaseRelation, RelationProvider, SchemaRelationProvider}
import org.apache.spark.sql.types.StructType

class DefaultSource  extends RelationProvider with SchemaRelationProvider{
  def createRelation(sqlContext: SQLContext,
                              parameters: Map[String, String],
                              schema: StructType) :BaseRelation = {
    val path = parameters.get("path")
    path match {
      case Some(p) => new TextDatasourceRelation(sqlContext, p, schema)
      case _ => throw  new IllegalArgumentException("path is required")
    }
  }

  override def createRelation(sqlContext: SQLContext, parameters: Map[String, String]) :Ba