Spark – How to Sort DataFrame column explained

无艳影

于 2022-10-14 20:41:14 发布

阅读量176

点赞数

分类专栏： spark 文章标签： spark scala 大数据

原文链接：https://sparkbyexamples.com/spark/spark-how-to-sort-dataframe-column-explained/

版权

spark 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

In Spark, you can use either sort() or orderBy() function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions, In this article, I will explain all these different ways using Scala examples.

Before we start, first let’s create a DataFrame.


  import spark.implicits._

  val simpleData = Seq(("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  )
  val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
  df.show()

This Yields below output.


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: integer (nullable = false)
 |-- age: integer (nullable = false)
 |-- bonus: integer (nullable = false)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|        James|     Sales|   NY| 90000| 34|10000|
|      Michael|     Sales|   NY| 86000| 56|20000|
|       Robert|     Sales|   CA| 81000| 30|23000|
|        Maria|   Finance|   CA| 90000| 24|23000|
|        Raman|   Finance|   CA| 99000| 40|24000|
|        Scott|   Finance|   NY| 83000| 36|19000|
|          Jen|   Finance|   NY| 79000| 53|15000|
|         Jeff| Marketing|   CA| 80000| 25|18000|
|        Kumar| Marketing|   NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+

DataFrame sorting using the sort() function

Spark DataFrame/Dataset class provides sort() function to sort on one or more columns. By default, it sorts by ascending order.

Syntax


sort(sortCol : scala.Predef.String, sortCols : scala.Predef.String*) : Dataset[T]
sort(sortExprs : org.apache.spark.sql.Column*) : Dataset[T]

Example


df.sort("department","state").show(false)
df.sort(col("department"),col("state")).show(false)

The above two examples return the same below output, the first one takes the DataFrame column name as a string and the next takes columns in Column type. This table sorted by the first department column and then the state column.


+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+

DataFrame sorting using orderBy() function

Alternatively, Spark DataFrame/Dataset class also provides orderBy() function to sort on one or more columns. By default, it also orders by ascending.

Syntax


orderBy(sortCol : scala.Predef.String, sortCols : scala.Predef.String*) : Dataset[T]
orderBy(sortExprs : org.apache.spark.sql.Column*) : Dataset[T]

Example


df.orderBy("department","state").show(false)
df.orderBy(col("department"),col("state")).show(false)

This returns the same output as the previous section.

Sort by Ascending (ASC)

If you wanted to specify the ascending order/sort explicitly on DataFrame, you can use the <a href="https://sparkbyexamples.com/spark/spark-sql-sort-functions/#asc">asc</a> method of the Column function. for example


df.sort(col("department").asc,col("state").asc).show(false)
df.orderBy(col("department").asc,col("state").asc).show(false)

The above two examples return the same output.


+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
+-------------+----------+-----+------+---+-----+

Sort by Descending (DESC)

If you wanted to specify the sorting by descending order on DataFrame, you can use the <a href="https://sparkbyexamples.com/spark/spark-sql-sort-functions/#desc">desc</a> method of the Column function. for example


df.sort(col("department").asc,col("state").desc).show(false)
df.orderBy(col("department").asc,col("state").desc).show(false)

This yields the below output.


+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Maria        |Finance   |CA   |90000 |24 |23000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
|Robert       |Sales     |CA   |81000 |30 |23000|
+-------------+----------+-----+------+---+-----+

Using Sorting functions

Spark SQL function provides several sorting functions, below are some examples of how to use asc and desc functions. Besides these Spark also provides asc_nulls_first and asc_nulls_last functions and equivalent for descending.


df.select($"employee_name",asc("department"),desc("state"),$"salary",$"age",$"bonus").show(false)

Using Sorting on SQL


df.createOrReplaceTempView("EMP")
spark.sql(" select employee_name,asc('department'),desc('state'),salary,age,bonus from EMP").show(false)

The above two examples return the same output as above.

Dataframe Sorting Complete Example


package com.sparkbyexamples.spark.dataframe.functions

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object SortExample extends App {

  val spark: SparkSession = SparkSession.builder()
    .master("local[1]")
    .appName("SparkByExamples.com")
    .getOrCreate()

  spark.sparkContext.setLogLevel("ERROR")

  import spark.implicits._

  val simpleData = Seq(("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  )
  val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
  df.printSchema()
  df.show()

  df.sort("department","state").show(false)
  df.sort(col("department"),col("state")).show(false)

  df.orderBy("department","state").show(false)
  df.orderBy(col("department"),col("state")).show(false)

  df.sort(col("department").asc,col("state").asc).show(false)
  df.orderBy(col("department").asc,col("state").asc).show(false)

  df.sort(col("department").asc,col("state").desc).show(false)
  df.orderBy(col("department").asc,col("state").desc).show(false)
 df.select($"employee_name",asc("department"),desc("state"),$"salary",$"age",$"bonus").show(false)
  df.createOrReplaceTempView("EMP")
  spark.sql(" select employee_name,asc('department'),desc('state'),salary,age,bonus from EMP").show(false)

}

This complete example is also available at GitHub project for reference.

Conclusion

Here you have learned how to Sort Spark DataFrame columns using sort(), orderBy() and using SQL sort functions and used this function with Spark SQL along with Ascending and Descending sorting orders.

无艳影

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark – How to Sort DataFrame column explained

In Spark, you can use either or function of DataFrame/Dataset to sort by ascending or descending order based on single or multiple columns, you can also do sorting using Spark SQL sorting functions, In this article, I will explain all these different way
复制链接

扫一扫