问题:
对A列和B列进行分组,然后在C列上进行透视操作并对D列数据进行求和
实现功能如下:
实现方式:
Spark中语法为:df.groupBy(“A”, “B”).pivot(“C”).sum(“D”),显然这种语法格式非常直观,但这其中也有个值得注意的地方:为取得更好的性能,需要明确指定透视列对应的不同值,例如如果C列有两个不同的值(small 和 large),则性能更优的版本语法为: df.groupBy(“A”, “B”).pivot(“C”, Seq(“small”, “large”)).sum(“D”)。当然,这里给出的是Scala语言实现,使用Java语言和Python语言实现的话方法也是类似的
数据源:
foo one small 1
foo one large 2
foo one large 2
foo two small 3
foo two small 3
bar one large 4
bar one small 5
bar two small 6
bar two large 7
实现代码:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StructField, _}
object TestDataFrames {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.master("local[2]")
.getOrCreate()
val pivotRDD = spark.sparkContext.textFile("/Users/xx/Downloads/tmp/spark/input/pivot.txt")
val schema = StructType(
List(
StructField("A", StringType, true),
StructField("B", StringType, true),
StructField("C", StringType, true),
StructField("D", IntegerType, true)
)
)
val rowRDD = pivotRDD
.map(_.split("\t"))
.map(attributes => Row(attributes(0), attributes(1), attributes(2), attributes(3).toInt))
val peopleDF = spark.createDataFrame(rowRDD, schema)
peopleDF.createOrReplaceTempView("pivot")
val results = spark.sql("SELECT * FROM pivot")
results.show()
results.groupBy("A", "B").pivot("C").sum("D").show()
}
}