SparkSQL-02 RDD转换DF的两种方式

引用原文:

Interoperating with RDDs

Spark SQL supports two different methods for converting existing RDDs into Datasets.

The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.

RDD <==> DataFrame (两种不同的方法)

1.反射 reflection 常用API: method class ,通过对象.invoke 

        infer schema

        RDD : know the schema

        case class :

注意:使用反射转换RDD与DataFrame的时候一定要知道其Schema

import org.apache.spark.sql.SparkSession

/**
  * RDD => DataFrame 的第一种方法 通过反射
  * 测试数据
     Andy , 12
     Tom  , 24
          , 12
     Ray  , 32
  */
object SparkSessionT {
  def main(args: Array[String]): Unit = {
    //初始化
    val spark = SparkSession
      .builder()
      .appName("SparkSessionT")
      .master("local[2]")
      .getOrCreate()
    //读取数据,生成RDD,map格式化数据
    val people = spark.sparkContext.textFile("file:///d:/people.txt")
                  .map(x => x.split(","))
                  .map(x => Person(x(0),x(1).trim.toInt))
                  //.filter(x => x.age>30)
                  //.filter(x => x.name.equals(""))
    //转换成DF
    import spark.implicits._
    val personDF = people.toDF()
    //创建临时表
    personDF.createOrReplaceTempView("Person")
    //展示数据
    personDF.show
    //展示表结构
    personDF.printSchema()
    //结束spark进程
    spark.stop()
  }
  //创建Person 的 case class
  case class Person(name:String , age:Int)
}

2. programmatic interface,通过编程接口

import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}

/**
  * RDD => DataFrame 的第二种方法 通过以编程方式指定模式
  * 测试数据
     Andy , 12
     Tom  , 24
          , 12
     Ray  , 32
  */
object SparkSessionPS {
  def main(args: Array[String]){
    //初始化
    val spark = SparkSession
      .builder()
      .appName("SparkSessionT")
      .master("local[2]")
      .getOrCreate()
    import spark.implicits._

    //Step1 创建一个RDD ROW作用于RDD
    val peopleRDD = spark.sparkContext.textFile("file:///c:/people.txt")
      .map(x => x.split(","))
      .map(attributes => Row(attributes(0), attributes(1).trim))

    //Step2 定义schema
    val schemaString = "name,age"
    val fields = schemaString.split(",")
      .map(fieldName =>
        StructField(fieldName, StringType, nullable = true))
    val schema = StructType(fields)

    //Step3 创建DF(RDD[Row],StringType)
    val peopleDF = spark.createDataFrame(peopleRDD, schema)
    //Step4 创建临时表
    peopleDF.createOrReplaceTempView("people")
    //Step5 测试数据
    val results = spark.sql("SELECT name,age FROM people")
    results.show()
    //results.map(attributes => attributes.getAs[String]("name")).show()
  }
}

实际工作中的使用

import org.apache.spark.sql.SparkSession

/**
  * 测试数据
id|name|phone|email
1|Burke|1-300-746-8446|ullamcorpe.veilt.in@ametnullaDonec.co.uk
2|Kamal|1-668-571-5046|pede.Suspendisse@interdumenim.edu
3|Olga|1-956-311-1686|Aenean.eqet.metus@dictumcursusNunc.edu
4|Belle|l-246-894-6340|vitae.aliguet.nec@neque.co.uk
5|Trevor|1-300-527-4967|dapibus.id@acturpisegestas.net
6|Laurel|1-691-379-9921|adipiscing@consectetueripsum.edu
7|Sara|1-608-140-1995|Donee.nibh@enimEtiamimperdiet.edu
8|Kaseem|1-881-586-2689|cursus.et.magna@euismod.org
9|Lev|1-916-367-5608|vivamus.nisi@ipsumdolor.com
10|Maya|1-271-683-2698|accumsan,convallis@ornarelectusjusto.edu
11|Emi|l-467-270-1337|est@nunc.com
12|Caleb|1-68B-212-0896|Suspendisse@Quisque.edu
13|Florence|1-603-575-2444|sit.amet.dapibus@lacusAliquamrutrum.ca
14|Anika|1-856-828-7883|euismod@ligulaelit.co.uk
15|Tarik|l-398-171-2268|turpis@felnsorci.com
16|Amena|1-878-250-3129|lorem.luctus.ut@scelerisque.com
17|Blossom|1-154-406-9596|Nunc.commodo.auctor@eratsed.co.uk
18|Guy|1-869-521-32BO|senectus.et.netus@lectusrutrum.com
19|Malachi|1-608-637-2772|Proin.mi.Aliquam@estarcu.net
20|Edward|1-711-710-6552|lectus@aliquetlibero.co.uk
21||1-711-710-6552|lectus@aliquetlibero.co.uk
22||1-711-710-6552|lectus@aliquetlibero.co.uk
23|NULL|1-711-710-6552|lectus@aliquetlibero.co.uk
  */
object SparkSessionEXP {
  def main(args: Array[String]){
    //初始化参数
    val spark = SparkSession
      .builder()
      .appName("SparkSessionEXP")
      .master("local[2]")
      .getOrCreate()
    import spark.implicits._
    //创建两个一样的studentDF,用于join测试
    var studentRDD = spark.sparkContext.textFile("file:///c:/student.data")
    val head = studentRDD.first()
    studentRDD = studentRDD.filter(row => row!=head)

    val studentRDD1 = spark.sparkContext.textFile("file:///c:/student.data")
      .map(x => x.split("\\|"))
      .map(x => student(x(0),x(1),x(2),x(3)))
      .toDF

    val studentRDD2 = spark.sparkContext.textFile("file:///c:/student.data")
      .map(x => x.split("\\|"))
      .map(x => student(x(0),x(1),x(2),x(3)))
      .toDF
      //.filter("name = '' or name = 'NULL'")
      /** 找M打头的名字 substr从哪里来呢, 双击shift, 查找funtions, 打开functions.scala文件, 里面有全面的sql 函数
        * alt+7 查看所有方法
      */
      //.filter("substr(name,0,1) = 'M'")
    //创建两张student临时表
    studentRDD1.createOrReplaceTempView("student1")
    studentRDD2.createOrReplaceTempView("student2")
    //一些常用的简单操作
    //spark.sql("select * from student1").show(23,false)
    //studentRDD1.filter("name = ''").show()
    //studentRDD1.select(studentRDD1.col("name"),studentRDD1.col("id").as("mid")).show(30,false)
    //studentRDD1.sort(studentRDD1.col("name"),studentRDD1.col("id")).show(30,false)
    //spark.sql("select * from student1").show(30,false)

    //join测试
    //用id做join key 注意:要写三个等号,默认是inner join
    studentRDD1.join(studentRDD2, studentRDD1.col("id") === studentRDD2.col("id")).show(500,false)

    //关闭
    spark.stop()
  }
  case class student(id:String,name:String,phone:String,email:String){}
}

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值