Spark DataFrame pivot()实现分组、透视、求和

最新推荐文章于 2025-02-15 16:13:00 发布

听见下雨的声音hb

最新推荐文章于 2025-02-15 16:13:00 发布

阅读量7.6k

点赞数 2

分类专栏： spark

本文链接：https://blog.csdn.net/u010010664/article/details/88311449

版权

spark 专栏收录该内容

21 篇文章

订阅专栏

问题：

对A列和B列进行分组，然后在C列上进行透视操作并对D列数据进行求和

实现功能如下：

实现方式：

Spark中语法为：df.groupBy(“A”, “B”).pivot(“C”).sum(“D”)，显然这种语法格式非常直观，但这其中也有个值得注意的地方：为取得更好的性能，需要明确指定透视列对应的不同值，例如如果C列有两个不同的值（small 和 large），则性能更优的版本语法为： df.groupBy(“A”, “B”).pivot(“C”, Seq(“small”, “large”)).sum(“D”)。当然，这里给出的是Scala语言实现，使用Java语言和Python语言实现的话方法也是类似的
数据源：

foo one small 1
foo one large 2
foo one large 2
foo two small 3
foo two small 3
bar one large 4
bar one small 5
bar two small 6
bar two large 7

实现代码：

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StructField, _}

object TestDataFrames {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("Spark SQL basic example")
      .config("spark.some.config.option", "some-value")
      .master("local[2]")
      .getOrCreate()

    val pivotRDD = spark.sparkContext.textFile("/Users/xx/Downloads/tmp/spark/input/pivot.txt")
    val schema = StructType(
      List(
        StructField("A", StringType, true),
        StructField("B", StringType, true),
        StructField("C", StringType, true),
        StructField("D", IntegerType, true)
      )
    )

    val rowRDD = pivotRDD
      .map(_.split("\t"))
      .map(attributes => Row(attributes(0), attributes(1), attributes(2), attributes(3).toInt))
    val peopleDF = spark.createDataFrame(rowRDD, schema)
    peopleDF.createOrReplaceTempView("pivot")

    val results = spark.sql("SELECT * FROM pivot")
    results.show()
    results.groupBy("A", "B").pivot("C").sum("D").show()
  }

}