Spark 06 DataFrame&DataSet

1 概述

A Dataset is a distributed collection of data.
A DataFrame is a Dataset organized into named columns.

Dataset:分布式数据集
DataFrame:以列(列名,列的类型,列值)的形式构成的分布式数据集

1)DataFrame和RDD对比

RDD:(Resilient Distributed Datasets),弹性分布式数据集

  • java/scala ⇒ jvm
  • python ⇒ python runtime

DataFrame:

  • java/scala/python ⇒ Logic Plan

2 DataFrame 基本API常用操作

1)DataFrameApp.scala

package com.lihaogn.spark

import org.apache.spark.sql.SparkSession

/**
  * DataFrame API基本操作
  */
object DataFrameApp {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("DataFrameApp").master("local[2]").getOrCreate()

    // 将json文件加载成一个dataframe
    val peopleDF = spark.read.format("json").
      load("/Users/Mac/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json")

    // 输出dataframe对应的schema信息
    peopleDF.printSchema()

    // 输出数据集的前20条记录
    peopleDF.show()

    // 查询某列所有的数据:select name from table-name
    peopleDF.select("name").show()

    // 查询某几列所有的数据,并对列进行计算:select name,age+10 as age2 from table-name
    peopleDF.select(peopleDF.col("name"),(peopleDF.col("age")+10).as("age2")).show()

    // 根据某一列的值进行过滤:select * from table where age>19
    peopleDF.filter(peopleDF.col("age")>19).show()

    // 根据某一列进行分组,然后再进行聚合操作:select age,count(1) from table-name group by age
    peopleDF.groupBy("age").count().show()

    spark.stop()

  }
}

2)运行结果





3 DataFrame与RDD互操作

Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.

准备操作文件:infos.txt

1,zhangsan,20
2,lisi,30
3,wangwu,40

1)方式一:反射,前提:事先需要知道字段,字段类型
2)方式二:编程,如果第一种情况不能满足需求(实现不知道列)

DataFrameRDDApp.scala

package com.lihaogn.spark

import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

/**
  * DataFrame与RDD互操作
  */
object DataFrameRDDApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("DataFrameApp").master("local[2]").getOrCreate()

    // 反射方式
//    inferReflection(spark)

    // 编程方式
    program(spark)

    spark.stop()
  }

  def program(spark:SparkSession): Unit ={

    // RDD ==> DataFrame
    val rdd=spark.sparkContext.textFile("/Users/Mac/testdata/infos.txt")

    val infoRDD=rdd.map(_.split(",")).map(line=>Row(line(0).toInt,line(1),line(2).toInt))

    val structType=StructType(Array(StructField("id",IntegerType,true),
      StructField("name",StringType,true),
      StructField("age",IntegerType,true)))

    val infoDF=spark.createDataFrame(infoRDD,structType)
    infoDF.printSchema()
    infoDF.show()

    // 通过df的api操作
    infoDF.filter(infoDF.col("age")>30).show()

    // 通过sql方式操作
    infoDF.createOrReplaceTempView("infos")
    spark.sql("select * from infos where age>30").show()
  }


  def inferReflection(spark:SparkSession): Unit ={

    // RDD ==> DataFrame
    val rdd=spark.sparkContext.textFile("/Users/Mac/testdata/infos.txt")

    // 需要导入隐式转换
    import spark.implicits._
    val infoDF=rdd.map(_.split(",")).map(line=>Info(line(0).toInt,line(1),line(2).toInt)).toDF()

    infoDF.show()

    infoDF.filter(infoDF.col("age")>30).show()

    // sql方式
    infoDF.createOrReplaceTempView("infos")
    spark.sql("select * from infos where age>30").show()
  }

  case class Info(id: Int, name: String, age: Int)

}

4 DataSet

1)使用
DatasetApp.scala

package com.imooc.spark

import org.apache.spark.sql.SparkSession

/**
 * Dataset操作
 */
object DatasetApp {

  def main(args: Array[String]) {
    val spark = SparkSession.builder().appName("DatasetApp")
      .master("local[2]").getOrCreate()

    //注意:需要导入隐式转换
    import spark.implicits._

    val path = "file:///home/hadoop/data/sales.csv"

    //spark如何解析csv文件?
    val df = spark.read.option("header","true").option("inferSchema","true").csv(path)
    df.show

    val ds = df.as[Sales]
    ds.map(line => line.itemId).show
    
    spark.stop()
  }

  case class Sales(transactionId:Int,customerId:Int,itemId:Int,amountPaid:Double)
}

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值